Implement a Java or Python UDF to obtain a character at a specific position in a URL - MaxCompute

How it works

UDF_GET_URL_CHAR takes a URL string and an integer position n, then:

Finds the first occurrence of .htm in the URL.
Extracts the path segment immediately before .htm.
Splits that segment by hyphen (-).
Returns the nth element from the right of the resulting array.

Function signature

string UDF_GET_URL_CHAR(string <url>, bigint <n>)

Parameter	Type	Required	Description
`url`	STRING	Yes	The source URL to parse
`n`	BIGINT	Yes	The position to retrieve, counted from right to left

Return type: STRING.

Return value behavior

Condition	Return value
URL does not contain `.htm`	Empty string
`n` is `0`	Empty string
`n` exceeds the number of hyphen-delimited segments	Empty string
Valid input	The nth segment from the right

The function never returns null. All edge cases return an empty string.

Prerequisites

Before you begin, ensure that you have:

A MaxCompute project with permissions to create resources and register UDFs
The MaxCompute client or DataWorks installed and configured

Step 1: Write the UDF

Java UDF

The evaluate method accepts a String (maps to SQL STRING) and a Long (maps to SQL BIGINT), and returns a String. The class must extend com.aliyun.odps.udf.UDF.

package com.aliyun; // The package name, which is user-defined.
import com.aliyun.odps.udf.UDF;

public class GetUrlChar extends UDF {
    public String evaluate(String url, Long n)  {
        if (n == 0) {
            return "";
        }
        try {
            // Find the index of the first occurrence of ".htm" in the URL.
            int index = url.indexOf(".htm");
            if (index < 0)  {
                return "";
            }
            // Extract the prefix up to (but not including) ".htm".
            String a = url.substring(0, index);
            // Find the last forward slash in the prefix.
            index = a.lastIndexOf("/");
            // Extract the segment after the last slash.
            String b = a.substring(index  +  1);
            // Split the segment by hyphen.
            String[] c = b.split("-");
            // Return empty string if n exceeds the number of segments.
            if (c.length  <  n)  {
                return  "";
            }
            // Return the nth element from the right.
            return c[c.length - n.intValue()];
        } catch (Exception e)  {
            return  "Internal error";
        }
    }
}

For code specifications and class requirements, see Java UDFs.

Python 3 UDF

from odps.udf import annotate


@annotate("string,bigint->string")
class GetUrlChar(object):

    def evaluate(self, url, n):
        if n == 0:
            return ""
        try:
            index = url.find(".htm")
            if index < 0:
                return ""
            a = url[:index]
            index = a.rfind("/")
            b = a[index + 1:]
            c = b.split("-")
            if len(c) < n:
                return ""
            return c[-n]
        except Exception:
            return "Internal error"

MaxCompute projects run Python 2 by default. To use a Python 3 UDF in a session, run:

set odps.sql.python.version=cp37;

For Python 3 UDF specifications, see Python 3 UDFs.

Python 2 UDF

#coding:utf-8
from odps.udf import annotate


@annotate("string,bigint->string")
class GetUrlChar(object):

    def evaluate(self, url, n):
        if n == 0:
            return ""
        try:
            index = url.find(".htm")
            if index < 0:
                return ""
            a = url[:index]
            index = a.rfind("/")
            b = a[index + 1:]
            c = b.split("-")
            if len(c) < n:
                return ""
            return c[-n]
        except Exception:
            return "Internal error"

If the code contains Chinese characters, add an encoding declaration at the top of the file — either #coding:utf-8 or # -*- coding: utf-8 -*-. Both formats are equivalent.

For Python 2 UDF specifications, see Python 2 UDFs.

Step 2: Upload and register the UDF

After writing and testing the UDF code, upload it to MaxCompute as a resource and register it under the name UDF_GET_URL_CHAR.

Java UDF: Package a Java program, upload the package, and create a MaxCompute UDF
Python UDF: Upload a Python program and create a MaxCompute UDF

Step 3: Call the UDF

Call UDF_GET_URL_CHAR with a SQL SELECT statement. If you registered a Python 3 UDF, prepend set odps.sql.python.version=cp37; to your query.

Example 1: URL with no .htm path

set odps.sql.python.version=cp37; -- Required only for Python 3 UDFs.
SELECT UDF_GET_URL_CHAR("http://www.taobao.com", 1);

Result:

+-----+
| _c0 |
+-----+
|     |
+-----+

The URL http://www.taobao.com contains no .htm, so the function returns an empty string.

Example 2: URL with a single-segment path

SELECT UDF_GET_URL_CHAR("http://www.taobao.com/a.htm", 1);

Result:

+-----+
| _c0 |
+-----+
| a   |
+-----+

The segment before .htm is a. Splitting by hyphen yields one element; n=1 from the right returns a.

Example 3: URL with a multi-segment path

SELECT UDF_GET_URL_CHAR("http://www.taobao.com/a-b-c-d.htm", 3);

Result:

+-----+
| _c0 |
+-----+
| b   |
+-----+

The segment a-b-c-d splits into ["a", "b", "c", "d"]. Counting from the right, position 3 is b.