This example shows how to write and register a Java or Python user-defined function (UDF) named UDF_GET_URL_CHAR that extracts a segment from a URL path by position. Use this UDF when built-in MaxCompute string functions do not cover your URL parsing logic.
How it works
UDF_GET_URL_CHAR takes a URL string and an integer position n, then:
-
Finds the first occurrence of
.htmin the URL. -
Extracts the path segment immediately before
.htm. -
Splits that segment by hyphen (
-). -
Returns the nth element from the right of the resulting array.
Function signature
string UDF_GET_URL_CHAR(string <url>, bigint <n>)
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
STRING | Yes | The source URL to parse |
n |
BIGINT | Yes | The position to retrieve, counted from right to left |
Return type: STRING.
Return value behavior
| Condition | Return value |
|---|---|
URL does not contain .htm |
Empty string |
n is 0 |
Empty string |
n exceeds the number of hyphen-delimited segments |
Empty string |
| Valid input | The nth segment from the right |
The function never returns null. All edge cases return an empty string.
Prerequisites
Before you begin, ensure that you have:
-
A MaxCompute project with permissions to create resources and register UDFs
-
The MaxCompute client or DataWorks installed and configured
Step 1: Write the UDF
Java UDF
The evaluate method accepts a String (maps to SQL STRING) and a Long (maps to SQL BIGINT), and returns a String. The class must extend com.aliyun.odps.udf.UDF.
package com.aliyun; // The package name, which is user-defined.
import com.aliyun.odps.udf.UDF;
public class GetUrlChar extends UDF {
public String evaluate(String url, Long n) {
if (n == 0) {
return "";
}
try {
// Find the index of the first occurrence of ".htm" in the URL.
int index = url.indexOf(".htm");
if (index < 0) {
return "";
}
// Extract the prefix up to (but not including) ".htm".
String a = url.substring(0, index);
// Find the last forward slash in the prefix.
index = a.lastIndexOf("/");
// Extract the segment after the last slash.
String b = a.substring(index + 1);
// Split the segment by hyphen.
String[] c = b.split("-");
// Return empty string if n exceeds the number of segments.
if (c.length < n) {
return "";
}
// Return the nth element from the right.
return c[c.length - n.intValue()];
} catch (Exception e) {
return "Internal error";
}
}
}
For code specifications and class requirements, see Java UDFs.
Python 3 UDF
from odps.udf import annotate
@annotate("string,bigint->string")
class GetUrlChar(object):
def evaluate(self, url, n):
if n == 0:
return ""
try:
index = url.find(".htm")
if index < 0:
return ""
a = url[:index]
index = a.rfind("/")
b = a[index + 1:]
c = b.split("-")
if len(c) < n:
return ""
return c[-n]
except Exception:
return "Internal error"
MaxCompute projects run Python 2 by default. To use a Python 3 UDF in a session, run:
set odps.sql.python.version=cp37;
For Python 3 UDF specifications, see Python 3 UDFs.
Python 2 UDF
#coding:utf-8
from odps.udf import annotate
@annotate("string,bigint->string")
class GetUrlChar(object):
def evaluate(self, url, n):
if n == 0:
return ""
try:
index = url.find(".htm")
if index < 0:
return ""
a = url[:index]
index = a.rfind("/")
b = a[index + 1:]
c = b.split("-")
if len(c) < n:
return ""
return c[-n]
except Exception:
return "Internal error"
If the code contains Chinese characters, add an encoding declaration at the top of the file — either #coding:utf-8 or # -*- coding: utf-8 -*-. Both formats are equivalent.
For Python 2 UDF specifications, see Python 2 UDFs.
Step 2: Upload and register the UDF
After writing and testing the UDF code, upload it to MaxCompute as a resource and register it under the name UDF_GET_URL_CHAR.
Step 3: Call the UDF
Call UDF_GET_URL_CHAR with a SQL SELECT statement. If you registered a Python 3 UDF, prepend set odps.sql.python.version=cp37; to your query.
Example 1: URL with no .htm path
set odps.sql.python.version=cp37; -- Required only for Python 3 UDFs.
SELECT UDF_GET_URL_CHAR("http://www.taobao.com", 1);
Result:
+-----+
| _c0 |
+-----+
| |
+-----+
The URL http://www.taobao.com contains no .htm, so the function returns an empty string.
Example 2: URL with a single-segment path
SELECT UDF_GET_URL_CHAR("http://www.taobao.com/a.htm", 1);
Result:
+-----+
| _c0 |
+-----+
| a |
+-----+
The segment before .htm is a. Splitting by hyphen yields one element; n=1 from the right returns a.
Example 3: URL with a multi-segment path
SELECT UDF_GET_URL_CHAR("http://www.taobao.com/a-b-c-d.htm", 3);
Result:
+-----+
| _c0 |
+-----+
| b |
+-----+
The segment a-b-c-d splits into ["a", "b", "c", "d"]. Counting from the right, position 3 is b.