MaxCompute で Java と Python UDF を使用して REGEXP_REPLACE を拡張する - MaxCompute

このトピックでは、Java UDFまたはPython UDFを実行して、正規表現を使用して文字列を置き換える方法について説明します。

説明

この例では、UDF_REPLACE_BY_REGEXPという名前のUDFが作成されます。

構文：

string UDF_REPLACE_BY_REGEXP(string <s>, string <regex>, string <replacement>)

説明：
正規表現regexを使用して、文字列sを文字列置換に置き換えます。 MaxComputeの組み込み関数REGEXP_REPLACEと比較して、UDFでは正規表現で変数を使用できます。

パラメーター：
- s: ソース文字列。このパラメーターの値は、STRINGデータ型です。 This parameter is required.
- regex: 正規表現。このパラメーターの値は、STRINGデータ型です。 This parameter is required.
- replacement: 正規表現を使用してソース文字列を置き換える文字列。このパラメーターの値は、STRINGデータ型です。 This parameter is required.

開発と使用手順

1. UDFを書く

Java UDFのサンプルコード

package com.aliyun.rewrite; // Specify a package name. 
import com.aliyun.odps.udf.UDF;
import com.aliyun.odps.udf.annotation.UdfProperty;

import java.util.Objects;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

@UdfProperty(isDeterministic=true)
public class ReplaceByRegExp extends UDF {
    /**
     * The regular expression in the most recent query, which is cached to avoid multiple compilations.
     */
    private String lastRegex = "";
    private Pattern pattern = null;

    /**
     * @param s The source string.
     * @param regex The regular expression.
     * @param replacement The string that replaces the source string.
     */
    public String evaluate(String s, String regex, String replacement) {
        Objects.requireNonNull(s, "The source string cannot be null");
        Objects.requireNonNull(regex, "The regular expression cannot be null");
        Objects.requireNonNull(replacement, "The string that replaces the source string cannot be null");

        // If the regular expression is changed, recompile the regular expression. 
        if (!regex.equals(lastRegex)) {
            lastRegex = regex;
            pattern = Pattern.compile(regex);
        }
        Matcher m = pattern.matcher(s);
        StringBuffer sb = new StringBuffer();

        // Perform text replacement.
        while (m.find()) {
            m.appendReplacement(sb, replacement);
        }
        m.appendTail(sb);
        return sb.toString();
    }
}

UDFをJavaで記述する場合は、UDFクラスを継承する必要があります。この例では、evaluateメソッドは、STRINGデータ型の3つの入力パラメーターとSTRINGデータ型の戻り値を定義します。入力パラメーターと戻り値のデータ型は、SQL文のUDFのシグネチャとして使用されます。その他のコード仕様と要件の詳細については、「Java UDF」をご参照ください。

Python 3 UDFのサンプルコード

from odps.udf import annotate
import re

@annotate("string,string,string->string")
class ReplaceByRegExp(object):
    def __init__(self):
        self.lastRegex = ""
        self.pattern = None

    def evaluate(self, s, regex, replacement):
        if not s or not regex or not replacement:
            raise ValueError("Arguments with None")
        # If the regular expression is changed, recompile the regular expression. 
        if regex != self.lastRegex:
            self.lastRegex = regex
            self.pattern = re.compile(regex)
        result = self.pattern.sub(replacement, s)
        return result

デフォルトでは、Python 2はMaxComputeプロジェクトでUDFを実行するために使用されます。 Python 3でUDFを実行する場合は、セッションレベルでset odps.sql.python.version=cp37コマンドを実行します。 Python 3 UDF仕様の詳細については、「Python 3 UDF」をご参照ください。

Python 2 UDFのサンプルコード

#coding:utf-8
from odps.udf import annotate
import re

@annotate("string,string,string->string")
class ReplaceByRegExp(object):
    def __init__(self):
        self.lastRegex = ""
        self.pattern = None

    def evaluate(self, s, regex, replacement):
        if not s or not regex or not replacement:
            raise ValueError("Arguments with None")
        # If the regular expression is changed, recompile the regular expression. 
        if regex != self.lastRegex:
            self.lastRegex = regex
            self.pattern = re.compile(regex)
        result = self.pattern.sub(replacement, s)
        return result

Python 2で記述されたUDFコードに漢字が表示される場合、UDFを実行するとエラーが返されます。この問題に対処するには、コードのヘッダーにエンコード宣言を追加する必要があります。宣言形式は、#coding: utf-8または# -*- coding: utf-8 -*- です。 2つの形式は同等です。 Python 2 UDF仕様の詳細については、「Python 2 UDF」をご参照ください。

2. リソースのアップロードとUDFの作成

UDFコードを開発およびデバッグした後、リソースをMaxComputeにアップロードし、UDFを作成します。この例では、UDF_REPLACE_BY_REGEXPという名前のUDFが作成されます。リソースをアップロードしてJava UDFを作成する方法の詳細については、「Javaプログラムのパッケージ化、パッケージのアップロード、MaxCompute UDFの作成」をご参照ください。リソースをアップロードしてPython UDFを作成する方法の詳細については、「PythonプログラムのアップロードとMaxCompute UDFの作成」をご参照ください。

3. UDFを使用する

UDFの作成後、次のコマンドを実行して、文字列内のすべての数字を数字記号 (#) に置き換えます。

set odps.sql.python.version=cp37; -- To use a UDF in Python 3, run this command.
SELECT UDF_REPLACE_BY_REGEXP('abc 123 def 456', '\\d+', '#');

次の応答が返されます。

+--------------+
| _c0          |
+--------------+
| abc # def #  |
+--------------+