Compared with the built-in REGEXP_REPLACE function, a user-defined function (UDF) allows you to use variables in a regular expression. This topic shows how to implement a UDF_REPLACE_BY_REGEXP UDF in Java or Python that accepts the regex pattern as an argument.
UDF signature
Syntax:
string UDF_REPLACE_BY_REGEXP(string <s>, string <regex>, string <replacement>)
Parameters:
All three parameters are required and accept STRING values.
| Parameter | Description |
|---|---|
s |
The source string |
regex |
The regular expression to match against s |
replacement |
The string that replaces each match |
Return value: STRING
Prerequisites
Before you begin, ensure that you have:
-
A MaxCompute project with UDF development permissions
-
A Java or Python development environment
Step 1: Write the UDF
Choose Java or Python based on your development environment.
Java UDF
package com.aliyun.rewrite; // Specify a package name.
import com.aliyun.odps.udf.UDF;
import com.aliyun.odps.udf.annotation.UdfProperty;
import java.util.Objects;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
@UdfProperty(isDeterministic=true)
public class ReplaceByRegExp extends UDF {
/**
* The regular expression in the most recent query, which is cached to avoid multiple compilations.
*/
private String lastRegex = "";
private Pattern pattern = null;
/**
* @param s The source string.
* @param regex The regular expression.
* @param replacement The string that replaces the source string.
*/
public String evaluate(String s, String regex, String replacement) {
Objects.requireNonNull(s, "The source string cannot be null");
Objects.requireNonNull(regex, "The regular expression cannot be null");
Objects.requireNonNull(replacement, "The string that replaces the source string cannot be null");
// If the regular expression is changed, recompile the regular expression.
if (!regex.equals(lastRegex)) {
lastRegex = regex;
pattern = Pattern.compile(regex);
}
Matcher m = pattern.matcher(s);
StringBuffer sb = new StringBuffer();
// Perform text replacement.
while (m.find()) {
m.appendReplacement(sb, replacement);
}
m.appendTail(sb);
return sb.toString();
}
}
A Java UDF must extend the UDF class. The evaluate method signature — three STRING input parameters and a STRING return value — defines the UDF's signature in SQL statements. For full Java UDF specifications, see Java UDFs.
Python 3 UDF
from odps.udf import annotate
import re
@annotate("string,string,string->string")
class ReplaceByRegExp(object):
def __init__(self):
self.lastRegex = ""
self.pattern = None
def evaluate(self, s, regex, replacement):
if not s or not regex or not replacement:
raise ValueError("Arguments with None")
# If the regular expression is changed, recompile the regular expression.
if regex != self.lastRegex:
self.lastRegex = regex
self.pattern = re.compile(regex)
result = self.pattern.sub(replacement, s)
return result
MaxCompute projects run UDFs with Python 2 by default. To use Python 3, run set odps.sql.python.version=cp37 at the session level before calling the UDF. For full Python 3 UDF specifications, see Python 3 UDFs.
Python 2 UDF
#coding:utf-8
from odps.udf import annotate
import re
@annotate("string,string,string->string")
class ReplaceByRegExp(object):
def __init__(self):
self.lastRegex = ""
self.pattern = None
def evaluate(self, s, regex, replacement):
if not s or not regex or not replacement:
raise ValueError("Arguments with None")
# If the regular expression is changed, recompile the regular expression.
if regex != self.lastRegex:
self.lastRegex = regex
self.pattern = re.compile(regex)
result = self.pattern.sub(replacement, s)
return result
If your Python 2 UDF code contains Chinese characters, add an encoding declaration at the top of the file. Both #coding:utf-8 and # -*- coding: utf-8 -*- are valid. For full Python 2 UDF specifications, see Python 2 UDFs.
Step 2: Upload resources and register the UDF
After writing and testing your UDF code, upload it to MaxCompute and register it as UDF_REPLACE_BY_REGEXP.
Step 3: Use the UDF
Run the following SQL to replace all digit sequences in a string with #:
set odps.sql.python.version=cp37; -- To use a UDF in Python 3, run this command.
SELECT UDF_REPLACE_BY_REGEXP('abc 123 def 456', '\\d+', '#');
Expected output:
+--------------+
| _c0 |
+--------------+
| abc # def # |
+--------------+
What's next
-
Java UDFs — Java UDF specifications and requirements
-
Python 3 UDFs — Python 3 UDF specifications
-
Python 2 UDFs — Python 2 UDF specifications