All Products
Search
Document Center

MaxCompute:Example: Obtain the character at a specific position in a URL

Last Updated:Sep 01, 2023

This topic describes how to use a Java user-defined function (UDF) or Python UDF to obtain the character at a specific position in a URL.

Description

In this example, a UDF named UDF_GET_URL_CHAR is registered. The following section describes the function syntax and input parameters.

string UDF_GET_URL_CHAR(string <url>, bigint <n>)
  • Function description: The url and n input parameters are required. url is a string, and n is an integer. The function uses the forward slash (/) to split the string specified by url and obtains the substring that is closest to .htm. Then, the function uses the hyphen (-) to split the substring and obtains the nth character from right to left.

  • Parameters:

    • url: The source URL of the STRING type. This parameter is required.

    • n: The position at which you want to obtain the character, which is of the BIGINT type. This parameter is required.

Development and usage procedure

1. Write a UDF

Sample code of a Java UDF

package com.aliyun; // The package name, which is user-defined.
import com.aliyun.odps.udf.UDF;

public class GetUrlChar extends UDF {
    public String evaluate(String url, Long n)  {
        if (n == 0) {
            return "";
        }
        try {
            // Obtain the index position of the period (.) when .htm appears in the string specified by url for the first time. The value is of the INT type. 
            int index = url.indexOf(".htm");
            if (index < 0)  {
                return "";
            }
            // Begin the extraction at position 0, but exclude the character at the index position. 
            String a = url.substring(0, index);
            // Return the index position of the last occurrence of the forward slash (/) in the string. 
            index = a.lastIndexOf("/");
            // The length of the obtained string is calculated by using the following formula: a.length() - (index + 1).
            String b = a.substring(index  +  1);
            // Use a hyphen (-) to split string b and obtain a string array. If you want to use other delimiters, change the value of b.split.
            String[] c = b.split("-");
            // A value is returned only if c.length is greater than or equal to n. 
            if (c.length  <  n)  {
                return  "";
            }
            // Return the character that corresponds to a specific subscript in the string array. 
            return c[c.length - n.intValue()];
        } catch (Exception e)  {
            return  "Internal error";
        }
    }
}

If you write a Java UDF, you must inherit the UDF class. In this example, the evaluate method defines input parameters of the STRING type and BIGINT type and a return value of the STRING type. The data types of the input parameters and the return value are used as the function signature of a UDF in SQL statements. For information about other code specifications and requirements, see Java UDFs.

Sample code of a Python 3 UDF

from odps.udf import annotate


@annotate("string,bigint->string")
class GetUrlChar(object):

    def evaluate(self, url, n):
        if n == 0:
            return ""
        try:
            index = url.find(".htm")
            if index < 0:
                return ""
            a = url[:index]
            index = a.rfind("/")
            b = a[index + 1:]
            c = b.split("-")
            if len(c) < n:
                return ""
            return c[-n]
        except Exception:
            return "Internal error"

By default, Python 2 is used to run UDFs in MaxCompute projects. If you want to run UDFs in Python 3, run the following command at the session level: set odps.sql.python.version=cp37. For more information about Python 3 UDF specifications, see Python 3 UDFs.

Sample code of a Python 2 UDF

#coding:utf-8
from odps.udf import annotate


@annotate("string,bigint->string")
class GetUrlChar(object):

    def evaluate(self, url, n):
        if n == 0:
            return ""
        try:
            index = url.find(".htm")
            if index < 0:
                return ""
            a = url[:index]
            index = a.rfind("/")
            b = a[index + 1:]
            c = b.split("-")
            if len(c) < n:
                return ""
            return c[-n]
        except Exception:
            return "Internal error"

If Chinese characters appear in UDF code that is written in Python 2, an error is returned when you run the UDF. To address this issue, you must add an encoding declaration to the header of the code. The declaration format is #coding:utf-8 or # -*- coding: utf-8 -*-. The two formats are equivalent. For more information about Python 2 UDF specifications, see Python 2 UDFs.

2. Upload resources and register the UDF

After you write and debug a UDF, you must upload the resource to MaxCompute and register the UDF. In this example, the UDF named UDF_GET_URL_CHAR is registered. For more information about how to upload and register a Java UDF, see Package a Java program, upload the package, and create a MaxCompute UDF. For more information about how to upload and register a Python UDF, see Upload a Python program and create a MaxCompute UDF.

3. Use the UDF

After you register the UDF, run the following commands:

  • Example 1

    set odps.sql.python.version=cp37; -- If you want to use a Python 3 UDF, run this command.
    SELECT UDF_GET_URL_CHAR("http://www.taobao.com", 1);

    The following result is returned:

    +-----+
    | _c0 |
    +-----+
    |     |
    +-----+
    Note

    The returned result is not null.

  • Example 2

    select UDF_GET_URL_CHAR("http://www.taobao.com/a.htm", 1);

    The following result is returned:

    +-----+
    | _c0 |
    +-----+
    | a   |
    +-----+
  • Example 3

    select UDF_GET_URL_CHAR("http://www.taobao.com/a-b-c-d.htm", 3);

    The following result is returned:

    +-----+
    | _c0 |
    +-----+
    | b   |
    +-----+