The Feature Generation (FG) framework supports custom feature operators as shared library plugins. Use them when built-in operators don't cover your domain-specific transformation logic—for example, edit distance between two text fields, spherical distance from GPS coordinates, or any numeric formula applied element-wise to a sequence.
How it works
Implement the
IFeatureOPC++ interface and register the operator withREGISTER_PLUGIN.Compile the implementation into a shared library (
.so).Reference the operator in
fg.jsonby settingfeature_typetocustom_featureandoperator_nameto the registered name.Deploy the
.sofile—online services auto-discover it from thecustom_fg_libdirectory; offline tasks require uploading it as a MaxCompute resource.
Prerequisites
Before you begin, ensure that you have:
A C++17-compatible build environment using the official compiler image (see Compile a custom operator)
The API dependency package: fg-api.tar.gz — contains all required header files
Familiarity with the FG framework and how
fg.jsonconfiguration works
Implement an operator
Minimal example
Every custom operator follows the same pattern: inherit IFeatureOP, implement Initialize and at least one ProcessWith* method, and register with REGISTER_PLUGIN.
Here is a minimal operator that outputs a constant integer:
#include "api/base_op.h"
namespace fg {
class ConstantOp : public IFeatureOP {
public:
int Initialize(const string& feature_config) override {
// Parse JSON config if needed; return 0 on success.
return 0;
}
int ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
vector<int32>& outputs) override {
outputs.push_back(42);
return 0; // 0 = success
}
};
} // namespace fg
REGISTER_PLUGIN("ConstantOp", ConstantOp);Key requirements:
Include a parameterless constructor (the default constructor works here).
The
Initializemethod receives the entirefg.jsonentry as a JSON string. Parse configuration items you need from it.Return
0for success from all methods; non-zero indicates failure.Call
REGISTER_PLUGINin the implementation file, not the header.
Choose between ProcessWith* and BatchProcess
Two processing interfaces are available. Choose based on your throughput and complexity requirements:
ProcessWith* | BatchProcess | |
|---|---|---|
| Granularity | One record at a time | One batch of records |
| Output types | Basic scalar types only (string, int32, int64, float, double) | Any valid VariantVector type, including Map |
| Use when | Simple transformations, low implementation cost | High throughput required; user-side features with one sample per request (use the broadcast mechanism to avoid redundant parsing) |
| Required | At least one method matching value_type | Override HasBatchProcessImpl() to return true |
When BatchProcess is implemented and HasBatchProcessImpl() returns true, the framework skips all ProcessWith* methods.
Full C++ interface
#pragma once
#ifndef FEATURE_GENERATOR_PLUGIN_BASE_H
#define FEATURE_GENERATOR_PLUGIN_BASE_H
#include <absl/container/flat_hash_map.h>
#include <absl/strings/string_view.h>
#include <absl/types/optional.h>
#include <stdexcept>
#include <utility>
#include <vector>
#include "fsmap.h"
#include "integral_types.h"
namespace fg {
using absl::optional;
using std::string;
using std::vector;
template <typename T>
using List = std::vector<T>;
template <typename K, typename V>
using Map = absl::flat_hash_map<K, V>;
template <typename K, typename V>
using MapArray = std::vector<std::pair<K, V>>;
using Matrix = std::vector<std::vector<float>>;
using MatrixL = std::vector<std::vector<int64>>;
using MatrixS = std::vector<std::vector<string>>;
template <typename K, typename V>
using FSMap = featurestore::type::fs_map<K, V>;
using FieldPtr = absl::variant<
const optional<string>*, const optional<int32>*, const optional<int64>*,
const optional<float>*, const optional<double>*,
const optional<absl::string_view>*,
const List<string>*, const List<int32>*, const List<int64>*,
const List<float>*, const List<double>*, const List<absl::string_view>*,
const Map<string, string>*, const Map<string, int32>*,
const Map<string, int64>*, const Map<string, float>*,
const Map<string, double>*, const Map<string, absl::string_view>*,
const Map<absl::string_view, absl::string_view>*,
const Map<absl::string_view, int32>*, const Map<absl::string_view, int64>*,
const Map<absl::string_view, float>*, const Map<absl::string_view, double>*,
const Map<absl::string_view, string>*,
const Map<int32, string>*, const Map<int32, int32>*,
const Map<int32, int64>*, const Map<int32, float>*,
const Map<int32, double>*, const Map<int32, absl::string_view>*,
const Map<int64, string>*, const Map<int64, float>*,
const Map<int64, double>*, const Map<int64, int32>*,
const Map<int64, int64>*, const Map<int64, absl::string_view>*,
const FSMap<absl::string_view, absl::string_view>*,
const FSMap<absl::string_view, int32>*,
const FSMap<absl::string_view, int64>*,
const FSMap<absl::string_view, float>*,
const FSMap<absl::string_view, double>*,
const FSMap<int32, int32>*, const FSMap<int32, int64>*,
const FSMap<int32, float>*, const FSMap<int32, double>*,
const FSMap<int32, absl::string_view>*,
const FSMap<int64, float>*, const FSMap<int64, double>*,
const FSMap<int64, int32>*, const FSMap<int64, int64>*,
const FSMap<int64, absl::string_view>*,
const MapArray<string, string>*, const MapArray<string, int32>*,
const MapArray<string, int64>*, const MapArray<string, float>*,
const MapArray<string, double>*,
const MapArray<int32, string>*, const MapArray<int32, float>*,
const MapArray<int32, double>*, const MapArray<int32, int32>*,
const MapArray<int32, int64>*,
const MapArray<int64, string>*, const MapArray<int64, float>*,
const MapArray<int64, double>*, const MapArray<int64, int32>*,
const MapArray<int64, int64>*, const Matrix*, const MatrixL*,
const MatrixS*>;
// Represents a COLUMN of the feature table.
using VariantVector = absl::variant<
vector<optional<string>>, vector<optional<int32>>, vector<optional<int64>>,
vector<optional<float>>, vector<optional<double>>,
vector<optional<absl::string_view>>,
vector<List<string>>, vector<List<int32>>, vector<List<int64>>,
vector<List<float>>, vector<List<double>>, vector<List<absl::string_view>>,
vector<Map<string, string>>, vector<Map<string, int32>>,
vector<Map<string, int64>>, vector<Map<string, float>>,
vector<Map<string, double>>, vector<Map<string, absl::string_view>>,
vector<Map<absl::string_view, absl::string_view>>,
vector<Map<absl::string_view, int32>>,
vector<Map<absl::string_view, int64>>,
vector<Map<absl::string_view, float>>,
vector<Map<absl::string_view, double>>,
vector<Map<int32, string>>, vector<Map<int32, int32>>,
vector<Map<int32, int64>>, vector<Map<int32, float>>,
vector<Map<int32, double>>, vector<Map<int32, absl::string_view>>,
vector<Map<int64, string>>, vector<Map<int64, float>>,
vector<Map<int64, double>>, vector<Map<int64, int32>>,
vector<Map<int64, int64>>, vector<Map<int64, absl::string_view>>,
vector<FSMap<absl::string_view, absl::string_view>>,
vector<FSMap<absl::string_view, int32>>,
vector<FSMap<absl::string_view, int64>>,
vector<FSMap<absl::string_view, float>>,
vector<FSMap<absl::string_view, double>>,
vector<FSMap<int32, int32>>, vector<FSMap<int32, int64>>,
vector<FSMap<int32, float>>, vector<FSMap<int32, double>>,
vector<FSMap<int32, absl::string_view>>,
vector<FSMap<int64, float>>, vector<FSMap<int64, double>>,
vector<FSMap<int64, int32>>, vector<FSMap<int64, int64>>,
vector<FSMap<int64, absl::string_view>>,
vector<MapArray<string, string>>, vector<MapArray<string, int32>>,
vector<MapArray<string, int64>>, vector<MapArray<string, float>>,
vector<MapArray<string, double>>,
vector<MapArray<int32, string>>, vector<MapArray<int32, float>>,
vector<MapArray<int32, double>>, vector<MapArray<int32, int32>>,
vector<MapArray<int32, int64>>,
vector<MapArray<int64, string>>, vector<MapArray<int64, float>>,
vector<MapArray<int64, double>>, vector<MapArray<int64, int32>>,
vector<MapArray<int64, int64>>, vector<Matrix>, vector<MatrixL>,
vector<MatrixS>>;
/**
* @brief The public base class for custom feature operators.
*
* The framework checks if a subclass overrides the `BatchProcess` method. If it is overridden,
* the framework calls this method to perform the feature transformation.
* Otherwise, the framework selects one of the `ProcessWith*` methods based on the `value_type` configuration.
* Implement the method that corresponds to the required output type.
*/
class IFeatureOP {
public:
class NotOverriddenException : public std::exception {
public:
explicit NotOverriddenException(std::string msg) : msg_(std::move(msg)) {}
const char* what() const noexcept override {
if (msg_.empty()) {
return "unimplemented method called";
}
// Cache the message to a member variable to ensure that the returned pointer remains valid.
cached_ = "unimplemented method called: " + msg_;
return cached_.c_str();
}
private:
std::string msg_;
mutable std::string cached_;
};
virtual ~IFeatureOP() = default;
/**
* @brief Initialization method.
* @param feature_config The full fg.json entry for this feature, as a JSON string.
* @return 0 on success; non-zero indicates failure.
*/
virtual int Initialize(const string& feature_config) = 0;
/**
* @brief Processes one record and outputs string values.
* @param inputs A record that can contain multiple fields.
* @param outputs The outputs of the feature transformation.
* @return 0 on success.
*/
virtual int ProcessWithStrOutputs(const vector<FieldPtr>& inputs,
vector<string>& outputs) {
throw NotOverriddenException("ProcessWithStrOutputs(FieldPtr)");
}
/**
* @brief Processes one record and outputs int32 values.
*/
virtual int ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
vector<int32>& outputs) {
throw NotOverriddenException("ProcessWithInt32Outputs(FieldPtr)");
}
/**
* @brief Processes one record and outputs int64 values.
*/
virtual int ProcessWithInt64Outputs(const vector<FieldPtr>& inputs,
vector<int64>& outputs) {
throw NotOverriddenException("ProcessWithInt64Outputs(FieldPtr)");
}
/**
* @brief Processes one record and outputs float values.
*/
virtual int ProcessWithFloatOutputs(const vector<FieldPtr>& inputs,
vector<float>& outputs) {
throw NotOverriddenException("ProcessWithFloatOutputs(FieldPtr)");
}
/**
* @brief Processes one record and outputs double values.
*/
virtual int ProcessWithDoubleOutputs(const vector<FieldPtr>& inputs,
vector<double>& outputs) {
throw NotOverriddenException("ProcessWithDoubleOutputs(FieldPtr)");
}
/**
* @brief Optional batch interface that processes one batch of records.
*
* @param inputs A vector of input columns; each `VariantVector` represents one feature column.
* @param outputs The transformed features. Supports complex output types usable as inputs for downstream operators.
* @return 0 on success.
*/
virtual int BatchProcess(const vector<VariantVector>& inputs,
VariantVector& outputs) {
throw NotOverriddenException("BatchProcess");
}
/**
* @brief Declares whether the subclass implements BatchProcess.
*
* Override this to return `true` if you implement `BatchProcess`.
* This avoids exception propagation issues across dynamic library boundaries
* when the .so and the main program use different C++ ABIs.
*
* @return true if BatchProcess is implemented; false otherwise (default).
*/
virtual bool HasBatchProcessImpl() const { return false; }
};
using CreateOperatorFunc = IFeatureOP* (*)();
inline FieldPtr GetFieldPtr(const VariantVector& input, size_t i) {
return absl::visit(
[&](const auto& vec) -> FieldPtr {
if (i >= vec.size()) {
throw std::out_of_range("GetFieldPtr: index " + std::to_string(i) +
" out of range [0, " +
std::to_string(vec.size()) + ")");
}
return &vec.at(i);
},
input);
}
} // namespace fg
#if defined(__GNUC__)
#define PLUGIN_API_HIDDEN \
__attribute__((visibility("hidden"))) __attribute__((used))
#define PLUGIN_API_EXPORT \
__attribute__((visibility("default"))) __attribute__((used))
#else
#define PLUGIN_API_HIDDEN
#define PLUGIN_API_EXPORT
#endif
std::vector<std::string>& getLocalNames();
std::vector<std::pair<std::string, void*>>& getLocalRegs();
#define REGISTER_PLUGIN(OpName, OpClass) \
extern "C" PLUGIN_API_EXPORT fg::IFeatureOP* create##OpClass() { \
return new fg::OpClass(); \
} \
namespace { \
struct _Reg_##OpClass { \
_Reg_##OpClass() { \
getLocalNames().push_back(OpName); \
getLocalRegs().emplace_back(OpName, (void*)&create##OpClass); \
} \
}; \
static _Reg_##OpClass _dummy_##OpClass __attribute__((used)); \
}
#endif // FEATURE_GENERATOR_PLUGIN_BASE_HImplementation notes
Input types: Implement
ProcessWith*methods for all input types you intend to support. For unsupported types, throw an exception directly.VariantRecorddefines all feature field types that the framework can process.FSMaptypes are required when integrating with Feature Store—they significantly improve online processor performance.Discretization: Implement only the transformation logic before discretization. If
hash_bucket_size,vocab_list,boundaries, or another discretization operation is configured, the framework handles it automatically.Thread safety: By default (
is_op_thread_safe=true), the operator must be stateless or use onlythread_localvariables. Setis_op_thread_safe=falseto have the framework create one object replica per thread—this is simpler to implement but uses more memory.`string_view` inputs: Online services (EasyRecProcessor, TorchEasyRec Processor) pass item-side string features as
absl::string_viewfor efficiency. If your operator can't handlestring_view, setdisable_string_view=truein the configuration to have the framework convert them tostringbefore calling your operator. This degrades performance.Custom config items: Use any key names in the JSON entry—the framework passes the full JSON string to
Initialize. Don't reuse framework-reserved key names (such asfeature_type,operator_name,value_type). Keys that reference external files must end with_fileso the framework can sync them for offline tasks.Operator directory: Use the
FEATURE_OPERATOR_DIRenvironment variable to specify the directory where dynamic-link library files are located. Each dynamic-link library can contain implementations of multiple feature operators.`BatchProcess` return type: The type of the
VariantVectorreturned byBatchProcessdepends on the values ofis_sequence,value_dimension, andvalue_type. For more information, see the output table schema in Configuration reference. Whenstub_type=trueis configured and no binning operation is set,BatchProcesscan return any valid type, such asMap.
Configuration reference
fg.json entry structure
{
"feature_name": "my_custom_fg_op",
"feature_type": "custom_feature",
"operator_name": "EditDistance",
"operator_lib_file": "libedit_distance.so",
"expression": [
"user:query",
"item:title"
],
"value_type": "string",
"separator": ",",
"default_value": "-1",
"value_dimension": 1,
"normalizer": "method=expression,expr=x>16?16:x",
"num_buckets": 10000,
"stub_type": false,
"is_sequence": false,
"is_op_thread_safe": true
}Add any additional configuration items your operator needs. The entire JSON entry is passed to Initialize.
Configuration parameters
| Parameter | Required | Description |
|---|---|---|
feature_type | Yes | Set to custom_feature. |
operator_name | Yes | The name used in REGISTER_PLUGIN. Must match the registered class name. The same operator can be reused across multiple features. |
operator_lib_file | Offline: required; Online: optional | Name of the .so file. Online services scan the custom_fg_lib subdirectory of the fg.json model directory and load all .so files automatically. For official extension operators, set this to pyfg/lib/libxxx.so. For offline tasks, upload the .so as a MaxCompute resource with the same name. |
expression | Yes | Input fields. Supports multiple inputs. |
value_type | Yes | Output type. One of: string, int32, int64, float, double. |
default_value | Yes | Default value as a string. The framework converts it to value_type. |
separator | When output is multi-dimensional | Splits default_value into multiple values for multi-dimensional features. |
stub_type | No | If true, the operator can only produce intermediate results and cannot be a leaf node in the Directed Acyclic Graph (DAG) execution graph. Default: false. |
is_sequence | No | Whether the output is a sequence feature. Default: false. |
sequence_length | When is_sequence=true | Maximum sequence length. Longer sequences are truncated. |
sequence_delim | When input is string-type sequence | Separator between sequence elements. |
split_sequence | No | For string-type input sequences, whether the framework splits the string before passing it to the operator. Default: true. After splitting, the field type becomes std::vector<std::string> even if it was originally a scalar. The split uses the CPU AVX-512 instruction set for better performance. If some inputs are sequences and others are scalars, consider whether framework-level splitting is appropriate. |
value_dimension | No | Dimension of the output feature. Default: 0. Used to truncate output in offline tasks and affects the output table schema. Omit if the output dimension is variable. |
normalizer | No | Post-transformation normalization for numeric features. See Normalizer frameworks. |
placeholder | When is_sequence=true and value_dimension != 1 | Fills empty positions in a multi-value sequence element. Default for floats: NaN; default for integers: minimum value of the type. Sparse features with discretization output a jagged value; dense features without discretization use default_value instead. |
disable_string_view | No | Converts string_view-type inputs to string before calling the operator. Default: false. Enable this if your operator can't handle string_view. Note: enabling this degrades performance. Map keys and values of string_view type are not converted—handle them in your operator. |
is_op_thread_safe | No | Whether the operator is thread-safe. Default: true (operator must be stateless or use only thread_local variables). Set to false to have the framework create one object per thread (simpler, but uses more memory). |
Output table schema
The value_dimension and is_sequence settings determine the schema type of the output table in offline tasks:
value_dimension | is_sequence | Schema type | With discretization |
|---|---|---|---|
1 | false | value_type | bigint |
1 | true | array<value_type> | array<bigint> |
≠1 | false | array<value_type> | array<bigint> |
≠1 | true | array<array<value_type>> | array<array<bigint>> |
Special cases: array<array<int>> is forced to array<array<bigint>>; array<array<double>> is forced to array<array<float>>.
Normalizer frameworks
For numeric features, add a normalizer to further process the transformation result:
| Framework | Example config | Formula |
|---|---|---|
log10 | method=log10,threshold=1e-10,default=-10 | x = x > threshold ? log10(x) : default |
zscore | method=zscore,mean=0.0,standard_deviation=10.0 | x = (x - mean) / standard_deviation |
minmax | method=minmax,min=2.1,max=2.2 | x = (x - min) / (max - min) |
expression | method=expression,expr=sign(x) | Any expression; the variable x represents the input. |
For supported functions in expression, see Built-in feature operators.
Discretization operations
Six discretization types are available without any additional implementation:
| Type | Description |
|---|---|
hash_bucket_size | Hash and modulo operation on the transformation result. |
vocab_list | Converts the result to an index in a list. |
vocab_dict | Converts the result to a value in a dictionary. The value must be convertible to int64. |
vocab_file | Loads a vocab_list or vocab_dict from a file. |
boundaries | Converts the result to a bucket number based on specified bin boundaries. |
num_buckets | Uses the result directly as a bucket number. |
For more information, see Feature discretization (binning).
Configuration examples
Sequence time-difference feature
{
"feature_name": "time_diff_seq",
"feature_type": "custom_feature",
"operator_name": "SeqExpr",
"expression": ["user:cur_time", "user:clk_time_seq"],
"formula": "cur_time - clk_time_seq",
"default_value": "0",
"value_type": "int32",
"is_sequence": true,
"num_buckets": 1000,
"is_op_thread_safe": false
}Spherical distance with normalization
{
"feature_name": "spherical_distance",
"feature_type": "custom_feature",
"operator_name": "SeqExpr",
"expression": ["item:click_id_lng", "item:click_id_lat", "user:j_lng", "user:j_lat"],
"formula": "spherical_distance",
"default_value": "0",
"value_type": "double",
"is_sequence": true,
"is_op_thread_safe": true,
"value_dimension": 1,
"normalizer": "method=expression,expr=sqrt(x)"
}Both examples use the SeqExpr operator. The formula field is a SeqExpr-specific configuration item passed through Initialize.
spherical_distance: Calculates the distance between two latitude/longitude coordinate pairs. Parameters are[lng1_seq, lat1_seq, lng2, lat2]—the first two are sequences; the last two are scalars.
These examples demonstrate the tiled format for custom sequence features. For a nested-format example, see sequence_feature.
Sequence features
When is_sequence=true, the output requirements differ based on whether the feature is sparse or dense:
Sparse sequences
Single value per element: output any type.
Multiple values per element: output
stringonly. Setvalue_typetostringand usechr(29)as the separator between values within one element.
Dense sequences
Set
value_dimensionto the dimension of each element.Scalar elements:
value_dimension=1.Vector elements:
value_dimension=length of the vector.The total number of output values must be an integer multiple of
value_dimension.
Developer example
The following example implements an edit distance operator that computes the Levenshtein distance between two text inputs.
Header file (`edit_distance.h`):
#pragma once
#include "api/base_op.h"
namespace fg {
namespace functor {
class EditDistanceFunctor;
}
using std::string;
using std::vector;
/**
* @brief EditDistance: takes two strings, outputs their edit distance.
*/
class EditDistance : public IFeatureOP {
public:
int Initialize(const string& feature_config) override;
/// @return 0 on success.
int ProcessWithStrOutputs(const vector<FieldPtr>& inputs,
vector<string>& outputs) override;
/// @return 0 on success.
int ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
vector<int32>& outputs) override;
/// @return 0 on success.
int ProcessWithInt64Outputs(const vector<FieldPtr>& inputs,
vector<int64>& outputs) override;
/// @return 0 on success.
int ProcessWithFloatOutputs(const vector<FieldPtr>& inputs,
vector<float>& outputs) override;
/// @return 0 on success.
int ProcessWithDoubleOutputs(const vector<FieldPtr>& inputs,
vector<double>& outputs) override;
private:
string feature_name_;
std::unique_ptr<functor::EditDistanceFunctor> functor_p_;
};
} // end of namespace fgImplementation file (`edit_distance.cc`):
#include "edit_distance.h"
#include <absl/strings/ascii.h>
#include <absl/strings/str_join.h>
#include <nlohmann/json.hpp>
#include <numeric> // std::iota
#include <stdexcept>
#include "api/log.h"
namespace fg {
using absl::optional;
namespace functor {
template <class T>
int edit_distance(const T& s1, const T& s2) {
int l1 = s1.size();
int l2 = s2.size();
if (l1 * l2 == 0) {
return l1 + l2;
}
vector<int> prev(l2 + 1);
vector<int> curr(l2 + 1);
std::iota(prev.begin(), prev.end(), 0);
for (int i = 0; i <= l1; ++i) {
curr[0] = i;
for (int j = 1; j <= l2; ++j) {
int d = prev[j - 1];
if (s1[i - 1] == s2[j - 1]) {
curr[j] = d;
} else {
int d2 = std::min(prev[j], curr[j - 1]);
curr[j] = 1 + std::min(d, d2);
}
}
prev.swap(curr);
}
return prev[l2];
}
enum class Encoding : unsigned int { Latin = 0, UTF8 = 1 };
class EditDistanceFunctor {
public:
EditDistanceFunctor(const string& encoding) {
string enc = absl::AsciiStrToLower(encoding);
if (enc == "utf-8" || enc == "utf8") {
encoding_ = Encoding::UTF8;
} else {
encoding_ = Encoding::Latin;
}
}
int operator()(absl::string_view s1, absl::string_view s2) {
if (encoding_ == Encoding::Latin) {
return edit_distance(s1, s2);
}
if (encoding_ == Encoding::UTF8) {
return edit_distance(from_bytes(s1), from_bytes(s2));
}
LOG(ERROR) << "EditDistanceFunctor found unsupported text encoding";
assert(false);
return 0;
}
const Encoding TextEncoding() const { return encoding_; }
private:
Encoding encoding_;
std::wstring from_bytes(absl::string_view str) {
std::wstring result;
int i = 0;
int len = (int)str.length();
while (i < len) {
int char_size = 0;
int unicode = 0;
if ((str[i] & 0x80) == 0) {
unicode = str[i];
char_size = 1;
} else if ((str[i] & 0xE0) == 0xC0) {
unicode = str[i] & 0x1F;
char_size = 2;
} else if ((str[i] & 0xF0) == 0xE0) {
unicode = str[i] & 0x0F;
char_size = 3;
} else if ((str[i] & 0xF8) == 0xF0) {
unicode = str[i] & 0x07;
char_size = 4;
} else {
// Invalid UTF-8 sequence
++i;
continue;
}
for (int j = 1; j < char_size; ++j) {
unicode = (unicode << 6) | (str[i + j] & 0x3F);
}
if (unicode <= 0xFFFF) {
result += static_cast<wchar_t>(unicode);
} else {
// Handle surrogate pairs for characters outside the BMP
unicode -= 0x10000;
result += static_cast<wchar_t>((unicode >> 10) + 0xD800);
result += static_cast<wchar_t>((unicode & 0x3FF) + 0xDC00);
}
i += char_size;
}
return result;
}
};
} // namespace functor
// Overloaded helper for absl::visit (C++17).
template <class... Ts>
struct overloaded : Ts... {
using Ts::operator()...;
};
template <class... Ts>
overloaded(Ts...) -> overloaded<Ts...>;
int EditDistance::Initialize(const string& feature_config) {
nlohmann::json cfg;
try {
cfg = nlohmann::json::parse(feature_config);
} catch (nlohmann::json::parse_error& ex) {
LOG(ERROR) << "parse error at byte " << ex.byte;
LOG(ERROR) << "config: " << feature_config;
throw std::runtime_error("parse EditDistance config failed");
}
feature_name_ = cfg.at("feature_name");
string encoding = cfg.value("encoding", "latin");
functor_p_ = std::make_unique<functor::EditDistanceFunctor>(encoding);
functor::Encoding enc = functor_p_->TextEncoding();
encoding = (enc == functor::Encoding::UTF8) ? "UTF-8" : "Latin";
LOG(INFO) << "feature <" << feature_name_ << "> with text encoding: " << encoding;
return 0;
}
int EditDistance::ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
vector<int32>& outputs) {
outputs.clear();
if (inputs.size() < 2) {
outputs.push_back(0);
return -1; // invalid inputs
}
int d = absl::visit(
overloaded{
[this](const optional<string>* s1, const optional<string>* s2) {
absl::string_view empty_view;
return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
},
[this](const optional<absl::string_view>* s1,
const optional<absl::string_view>* s2) {
absl::string_view empty_view;
return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
},
[this](const optional<absl::string_view>* s1,
const optional<string>* s2) {
absl::string_view empty_view;
return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
},
[this](const optional<string>* s1,
const optional<absl::string_view>* s2) {
absl::string_view empty_view;
return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
},
[this](const List<string>* s1, const List<string>* s2) {
string str1 = absl::StrJoin(*s1, "");
string str2 = absl::StrJoin(*s2, "");
return functor_p_->operator()(str1, str2);
},
[this](const List<absl::string_view>* s1,
const List<absl::string_view>* s2) {
string str1 = absl::StrJoin(*s1, "");
string str2 = absl::StrJoin(*s2, "");
return functor_p_->operator()(str1, str2);
},
[this](const auto* x, const auto* y) {
ERROR_EXIT(feature_name_,
"unsupported input type: ", typeid(*x).name(), " vs ",
typeid(*y).name());
return 0;
}},
inputs.at(0), inputs.at(1));
outputs.push_back(d);
return 0;
}
// int32 results are reused for int64, float, and double outputs.
int EditDistance::ProcessWithInt64Outputs(const vector<FieldPtr>& inputs,
vector<int64>& outputs) {
vector<int32> distances;
int status = ProcessWithInt32Outputs(inputs, distances);
if (0 != status) return status;
outputs.assign(distances.begin(), distances.end());
return 0;
}
int EditDistance::ProcessWithFloatOutputs(const vector<FieldPtr>& inputs,
vector<float>& outputs) {
vector<int32> distances;
int status = ProcessWithInt32Outputs(inputs, distances);
if (0 != status) return status;
outputs.assign(distances.begin(), distances.end());
return 0;
}
int EditDistance::ProcessWithDoubleOutputs(const vector<FieldPtr>& inputs,
vector<double>& outputs) {
vector<int32> distances;
int status = ProcessWithInt32Outputs(inputs, distances);
if (0 != status) return status;
outputs.assign(distances.begin(), distances.end());
return 0;
}
int EditDistance::ProcessWithStrOutputs(const vector<FieldPtr>& inputs,
vector<string>& outputs) {
vector<int32> distances;
int status = ProcessWithInt32Outputs(inputs, distances);
if (0 != status) return status;
outputs.reserve(distances.size());
std::transform(distances.begin(), distances.end(),
std::back_inserter(outputs),
[](int32& x) { return std::to_string(x); });
return 0;
}
} // end of namespace fg
REGISTER_PLUGIN("EditDistance", EditDistance);Download the source code from the table in Available operators and run the build.sh script to compile the operator.
Compile a custom operator
Use the same compilation environment as the FG framework: C++17 and the official compiler image. The image details are in the build.sh script included with each example.
| Image | Base OS | Notes |
|---|---|---|
mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/feature_generator:centos7-0.1.1 | CentOS 7 | Default C++11 ABI (not enabled) |
mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/feature_generator:0.1.1 | Rocky Linux 8 (CentOS 8 compatible) | Use this image if you need the new C++11 ABI (_GLIBCXX_USE_CXX11_ABI=1) |
Third-party dependencies: Embed all dependencies as source code or use static linking. Dynamic linking to third-party libraries causes the operator to fail to load at runtime.
Required dependency:
abseil-cpp — use the same version as the FG framework.
For CMake configuration details, see the CMakeLists.txt in each developer example.
Available operators
The following operators are available as source code and prebuilt binaries:
| Operator | Description | Source code | Binary |
|---|---|---|---|
| EditDistance | Edit distance between two text inputs | Download | Download |
| SeqExpr | Sequence expression evaluation | Download | Download |
| BPETokenize | Byte Pair Encoding (BPE) tokenization | Download | Included in built-in tokenize_feature |
EditDistance configuration
| Parameter | Options | Default |
|---|---|---|
encoding | utf-8, latin | latin |
For a BatchProcess example, download and review RegexReplace.
What's next
Built-in feature operators — available operators and the
expr_featureexpression syntaxUse FG in offline tasks — how to deploy custom operators for offline tasks
Feature Generation overview and configuration — full
fg.jsonschema and discretization reference