All Products
Search
Document Center

Artificial Intelligence Recommendation:Custom feature operators

Last Updated:Jan 30, 2026

Custom feature operators are plugins that the framework can dynamically load and execute. The Feature Generation (FG) framework is designed to be lightweight and includes only a few common feature operators to reduce compile time, minimize service resource usage, and accelerate service startup.

Configuration

{
    "feature_name": "my_custom_fg_op",
    "feature_type": "custom_feature",
    "operator_name": "EditDistance",
    "operator_lib_file": "libedit_distance.so",
    "expression": [
        "user:query",
        "item:title"
    ],
    "value_type": "string",
    "separator": ",",
    "default_value": "-1",
    "value_dimension": 1,
    "normalizer": "method=expression,expr=x>16?16:x",
    "num_buckets": 10000,
    "stub_type": false,
    "is_sequence": false,
    "is_op_thread_safe": true,
    ...
}

In addition to the listed configuration items, you can add other items as needed. The entire JSON configuration string is passed to the custom operator.

Configuration item

Description

feature_type

Set this to custom_feature.

operator_name

The name under which the feature operator is registered. We recommend that you keep this name consistent with the implemented class name. The same operator can be reused in multiple feature transformations.

operator_lib_file

The name of the feature operator's dynamic-link library file. The name must end with .so. This parameter is required for offline tasks and optional for online tasks.

  • Online services, such as Torch/EasyRecProcessor, scan all dynamic-link library files in the custom_fg_lib subdirectory of the directory where the fg.json model file is located. The services then load the files into memory.

  • A few official extension operators are provided, as shown in the list in the Developer examples section. To specify an official operator, set operator_lib_file to pyfg/lib/libxxx.so.

  • When you run an offline task, upload the dynamic-link library file (if it is not an official one) as a MaxCompute resource with the same name. After you upload the file, you must commit the resource.

expression

The input expression. Multiple inputs are supported.

value_type

The output type of the feature transformation. It can only be a basic type, such as string, int32, int64, float, or double.

default_value

The default value of the feature. Configure this as a string. The code converts it to the required type.

separator

The separator for multiple values. It is used to split the configured default_value. If the output feature is multi-dimensional, you can configure a default value with multiple values.

stub_type

Indicates whether the current feature operator can only be used as an intermediate result of a feature transformation. If you set this to true, the operator cannot be used as a leaf node in the Directed Acyclic Graph (DAG) execution graph.

is_sequence

Marks whether the feature is a sequence feature.

sequence_length

The maximum length of the sequence. If the length exceeds this value, the sequence is truncated.

sequence_delim

The separator between sequence elements. Set this only when the input is of the string type.

split_sequence

If the input sequence feature is of the string type, this parameter specifies whether the framework needs to perform a split operation on the sequence. The default value is true.

  • After the split operation, the type of the current input field becomes std::vector<std::string>, even if it was originally a scalar field.

  • If there are multiple input fields, some of which are sequences and some are scalars, carefully consider whether to perform the split operation at the framework layer.

  • The framework-level split operation uses the CPU AVX512 instruction set, which generally provides better performance.

value_dimension

The dimension of the output feature. This can be used to truncate the output of an offline task and affects the schema of the output table. If the feature has multiple values and the output dimension is uncertain, you can omit this configuration.

  • This is an optional parameter. The default value is 0. It can be used to truncate the output in an offline task.

  • If the value is 1 and is_sequence=false, the schema type of the output table is value_type. If a discretization operation is configured, the output type is bigint.

  • If the value is 1 and is_sequence=true, the schema type of the output table is array<value_type>. If a discretization operation is configured, the output type is array<bigint>.

  • If the value is not 1 and is_sequence=false, the schema type of the output table is array<value_type>. If a discretization operation is configured, the output type is array<bigint>.

  • If the value is not 1 and is_sequence=true, the schema type of the output table is array<array<value_type>>. If a discretization operation is configured, the output type is array<array<bigint>>.

  • Special case 1: According to the preceding rules, if the schema type of the output table is array<array<int>>, it is forcibly changed to array<array<bigint>>.

  • Special case 2: According to the preceding rules, if the schema type of the output table is array<array<double>>, it is forcibly changed to array<array<float>>.

Discretization operation

Six types of discretization operations are supported. You do not need to implement these operations yourself. For more information, see Feature discretization (binning).

  • hash_bucket_size: Performs a hash and modulo operation on the feature transformation result.

  • vocab_list: Converts the feature transformation result into an index of a list.

  • vocab_dict: Converts the feature transformation result into a value in a dictionary. The value must be convertible to the int64 type.

  • vocab_file: Loads a vocab_list or vocab_dict from a file.

  • boundaries: Specifies the binning boundaries to convert the feature transformation result into the corresponding bucket number.

  • num_buckets: Directly uses the feature transformation result as the binning bucket number.

normalizer

For numerical features, you can add this configuration to further process the transformation result, such as calculating the value of an expression.

For supported operators and functions, see Built-in feature operators. Four frameworks are supported: minmax, zscore, log10, and expression. The configurations and calculation methods are as follows:

  • log10

    Example configuration: method=log10,threshold=1e-10,default=-10

    Formula: x = x > threshold ? log10(x) : default;

  • zscore

    Example configuration: method=zscore,mean=0.0,standard_deviation=10.0

    Formula: x = (x - mean) / standard_deviation

  • minmax

    Example configuration: method=minmax,min=2.1,max=2.2

    Formula: x = (x - min) / (max - min)

  • expression

    Example configuration: method=expression,expr=sign(x)

    Formula: You can configure any function or expression. The variable name is fixed as x, which represents the input of the expression.

placeholder

In a sequence feature, if each element of the sequence has multiple values (value_dimension != 1), the custom operator developer uses this to fill empty positions and complete the dimensions with a special value.

  • The default value for floating-point numbers is NaN. The default value for integers is the minimum value of the corresponding type.

  • The FG framework filters out these special placeholder values. After a discretization operation is configured for a sparse feature, a jagged feature value is output. If no discretization operation is performed for a dense feature, the placeholder value is replaced with the feature's default value (default_value).

disable_string_view

Specifies whether to disable feature values of the string_view type. The default value is false.

  • When FG is integrated into a model inference service, such as EasyRecProcessor or TorchEasyRec Processor, string-type features on the item side are converted to the string_view type for input to FG for better performance.

  • For the convenience of operator developers, you can enable this configuration switch. The FG framework converts feature values of the string_view type to the string type and passes them to the custom operator.

  • Note: For Map-type data, if the key or value is of the string_view type, it cannot be converted. The developer still needs to handle the string_view type.

  • Note: Enabling this configuration switch will degrade performance.

is_op_thread_safe

Indicates whether the current feature operator is thread-safe. Set this to true if it is thread-safe, or false if it is not.

  • The default value is true. This means the operator developer must ensure that the operator is thread-safe. The operator must be stateless or have only thread_local variables.

    • [Recommended] Provide natively thread-safe operators.

  • If you set this parameter to false, the framework creates an object replica for each thread.

    • This method consumes more memory than a natively thread-safe operator.

Additional notes:

  • User-defined configuration items must not have the same names as the configuration items that are used by the framework.

    • Custom operators can read and use configuration items defined by the framework. However, attempting to change their semantics will cause undefined behavior.

  • Configuration items that depend on external resource files must end with _file.

    • This marker is used to sync resource files when you use FG in offline tasks.

Configuration examples

{
    "feature_name": "time_diff_seq",
    "feature_type": "custom_feature",
    "operator_name": "SeqExpr",
    "expression": ["user:cur_time", "user:clk_time_seq"],
    "formula": "cur_time - clk_time_seq",
    "default_value": "0",
    "value_type": "int32",
    "is_sequence": true,
    "num_buckets": 1000,
    "is_op_thread_safe": false
},
{
    "feature_name": "spherical_distance",
    "feature_type": "custom_feature",
    "operator_name": "SeqExpr",
    "expression": ["item:click_id_lng", "item:click_id_lat", "user:j_lng", "user:j_lat"],
    "formula": "spherical_distance",
    "default_value": "0",
    "value_type": "double",
    "is_sequence": true,
    "is_op_thread_safe": true,
    "value_dimension": 1,
    "normalizer": "method=expression,expr=sqrt(x)"
}
  • formula: An expression. For more information about supported expressions, see expr_feature.

    • spherical_distance: Calculates the distance between two latitude and longitude coordinates. The parameters are [lng1_seq, lat1_seq, lng2, lat2]. The first two parameters are sequences, and the last two are scalar values.

    This is an example of a custom Sequence feature in a tiled format. For an example of a custom Sequence feature in a nested format, see sequence_feature.

C++ interface

#pragma once
#ifndef FEATURE_GENERATOR_PLUGIN_BASE_H
#define FEATURE_GENERATOR_PLUGIN_BASE_H

#include <absl/container/flat_hash_map.h>
#include <absl/strings/string_view.h>
#include <absl/types/optional.h>

#include <stdexcept>
#include <utility>
#include <vector>

#include "fsmap.h"
#include "integral_types.h"

namespace fg {

using absl::optional;
using std::string;
using std::vector;

template <typename T>
using List = std::vector<T>;
template <typename K, typename V>
using Map = absl::flat_hash_map<K, V>;
template <typename K, typename V>
using MapArray = std::vector<std::pair<K, V>>;
using Matrix = std::vector<std::vector<float>>;
using MatrixL = std::vector<std::vector<int64>>;
using MatrixS = std::vector<std::vector<string>>;
template <typename K, typename V>
using FSMap = featurestore::type::fs_map<K, V>;

using FieldPtr = absl::variant<
    const optional<string>*, const optional<int32>*, const optional<int64>*,
    const optional<float>*, const optional<double>*,
    const optional<absl::string_view>*,

    const List<string>*, const List<int32>*, const List<int64>*,
    const List<float>*, const List<double>*, const List<absl::string_view>*,

    const Map<string, string>*, const Map<string, int32>*,
    const Map<string, int64>*, const Map<string, float>*,
    const Map<string, double>*, const Map<string, absl::string_view>*,

    const Map<absl::string_view, absl::string_view>*,
    const Map<absl::string_view, int32>*, const Map<absl::string_view, int64>*,
    const Map<absl::string_view, float>*, const Map<absl::string_view, double>*,
    const Map<absl::string_view, string>*,

    const Map<int32, string>*, const Map<int32, int32>*,
    const Map<int32, int64>*, const Map<int32, float>*,
    const Map<int32, double>*, const Map<int32, absl::string_view>*,

    const Map<int64, string>*, const Map<int64, float>*,
    const Map<int64, double>*, const Map<int64, int32>*,
    const Map<int64, int64>*, const Map<int64, absl::string_view>*,

    const FSMap<absl::string_view, absl::string_view>*,
    const FSMap<absl::string_view, int32>*,
    const FSMap<absl::string_view, int64>*,
    const FSMap<absl::string_view, float>*,
    const FSMap<absl::string_view, double>*,

    const FSMap<int32, int32>*, const FSMap<int32, int64>*,
    const FSMap<int32, float>*, const FSMap<int32, double>*,
    const FSMap<int32, absl::string_view>*,

    const FSMap<int64, float>*, const FSMap<int64, double>*,
    const FSMap<int64, int32>*, const FSMap<int64, int64>*,
    const FSMap<int64, absl::string_view>*,

    const MapArray<string, string>*, const MapArray<string, int32>*,
    const MapArray<string, int64>*, const MapArray<string, float>*,
    const MapArray<string, double>*,

    const MapArray<int32, string>*, const MapArray<int32, float>*,
    const MapArray<int32, double>*, const MapArray<int32, int32>*,
    const MapArray<int32, int64>*,

    const MapArray<int64, string>*, const MapArray<int64, float>*,
    const MapArray<int64, double>*, const MapArray<int64, int32>*,
    const MapArray<int64, int64>*, const Matrix*, const MatrixL*,
    const MatrixS*>;

// represents a COLUMN of the feature table
using VariantVector = absl::variant<
    vector<optional<string>>, vector<optional<int32>>, vector<optional<int64>>,
    vector<optional<float>>, vector<optional<double>>,
    vector<optional<absl::string_view>>,

    vector<List<string>>, vector<List<int32>>, vector<List<int64>>,
    vector<List<float>>, vector<List<double>>, vector<List<absl::string_view>>,

    vector<Map<string, string>>, vector<Map<string, int32>>,
    vector<Map<string, int64>>, vector<Map<string, float>>,
    vector<Map<string, double>>, vector<Map<string, absl::string_view>>,

    vector<Map<absl::string_view, absl::string_view>>,
    vector<Map<absl::string_view, int32>>,
    vector<Map<absl::string_view, int64>>,
    vector<Map<absl::string_view, float>>,
    vector<Map<absl::string_view, double>>,

    vector<Map<int32, string>>, vector<Map<int32, int32>>,
    vector<Map<int32, int64>>, vector<Map<int32, float>>,
    vector<Map<int32, double>>, vector<Map<int32, absl::string_view>>,

    vector<Map<int64, string>>, vector<Map<int64, float>>,
    vector<Map<int64, double>>, vector<Map<int64, int32>>,
    vector<Map<int64, int64>>, vector<Map<int64, absl::string_view>>,

    vector<FSMap<absl::string_view, absl::string_view>>,
    vector<FSMap<absl::string_view, int32>>,
    vector<FSMap<absl::string_view, int64>>,
    vector<FSMap<absl::string_view, float>>,
    vector<FSMap<absl::string_view, double>>,

    vector<FSMap<int32, int32>>, vector<FSMap<int32, int64>>,
    vector<FSMap<int32, float>>, vector<FSMap<int32, double>>,
    vector<FSMap<int32, absl::string_view>>,

    vector<FSMap<int64, float>>, vector<FSMap<int64, double>>,
    vector<FSMap<int64, int32>>, vector<FSMap<int64, int64>>,
    vector<FSMap<int64, absl::string_view>>,

    vector<MapArray<string, string>>, vector<MapArray<string, int32>>,
    vector<MapArray<string, int64>>, vector<MapArray<string, float>>,
    vector<MapArray<string, double>>,

    vector<MapArray<int32, string>>, vector<MapArray<int32, float>>,
    vector<MapArray<int32, double>>, vector<MapArray<int32, int32>>,
    vector<MapArray<int32, int64>>,

    vector<MapArray<int64, string>>, vector<MapArray<int64, float>>,
    vector<MapArray<int64, double>>, vector<MapArray<int64, int32>>,
    vector<MapArray<int64, int64>>, vector<Matrix>, vector<MatrixL>,
    vector<MatrixS>>;

/**
 * @brief The public base class for custom feature operators.
 *
 * The framework checks if a subclass overrides the `BatchProcess` method. If it is overridden,
 * the framework calls this method to perform the feature transformation.
 * Otherwise, the framework selects one of the `ProcessWith*` methods to execute based on the `value_type` configuration.
 * You must implement the method that corresponds to the required output type.
 */
class IFeatureOP {
 public:
  class NotOverriddenException : public std::exception {
   public:
    explicit NotOverriddenException(std::string msg) : msg_(std::move(msg)) {}
    const char* what() const noexcept override {
      if (msg_.empty()) {
        return "unimplemented method called";
      }
      // Cache the message to a member variable to ensure that the returned pointer remains valid.
      cached_ = "unimplemented method called: " + msg_;
      return cached_.c_str();
    }

   private:
    std::string msg_;
    mutable std::string cached_;
  };

  virtual ~IFeatureOP() = default;

  /**
   * @brief The initialization method.
   * @param feature_config is a json string,
   * @return Returns 0 if the model is loaded successfully. Otherwise, it indicates that the model failed to load.
   */
  virtual int Initialize(const string& feature_config) = 0;

  /**
   * @brief Performs feature transformation and outputs the results as the string type.
   * @param inputs A record that can contain multiple fields.
   * @param outputs The outputs of the feature transformation.
   * @return A status code. A value of 0 indicates successful execution.
   */
  virtual int ProcessWithStrOutputs(const vector<FieldPtr>& inputs,
                                    vector<string>& outputs) {
    throw NotOverriddenException("ProcessWithStrOutputs(FieldPtr)");
  }

  /**
   * @brief Performs feature transformation and outputs the results as the int32 type.
   * @param inputs A record that can contain multiple fields.
   * @param outputs The outputs of the feature transformation.
   * @return A status code. A value of 0 indicates successful execution.
   */
  virtual int ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
                                      vector<int32>& outputs) {
    throw NotOverriddenException("ProcessWithInt32Outputs(FieldPtr)");
  }

  /**
   * @brief Performs feature transformation and outputs the results as the int64 type.
   * @param inputs A record that can contain multiple fields.
   * @param outputs The outputs of the feature transformation.
   * @return A status code. A value of 0 indicates successful execution.
   */
  virtual int ProcessWithInt64Outputs(const vector<FieldPtr>& inputs,
                                      vector<int64>& outputs) {
    throw NotOverriddenException("ProcessWithInt64Outputs(FieldPtr)");
  }

  /**
   * @brief Performs feature transformation and outputs the results as the float type.
   * @param inputs A record that can contain multiple fields.
   * @param outputs The outputs of the feature transformation.
   * @return A status code. A value of 0 indicates successful execution.
   */
  virtual int ProcessWithFloatOutputs(const vector<FieldPtr>& inputs,
                                      vector<float>& outputs) {
    throw NotOverriddenException("ProcessWithFloatOutputs(FieldPtr)");
  }

  /**
   * @brief Performs feature transformation and outputs the results as the double type.
   * @param inputs A record that can contain multiple fields.
   * @param outputs The outputs of the feature transformation.
   * @return A status code. A value of 0 indicates successful execution.
   */
  virtual int ProcessWithDoubleOutputs(const vector<FieldPtr>& inputs,
                                       vector<double>& outputs) {
    throw NotOverriddenException("ProcessWithDoubleOutputs(FieldPtr)");
  }

  /**
   * @brief An optional batch interface for processing multiple records.
   *
   * @param inputs A vector of input columns. `VariantVector` represents a feature column.
   * @param outputs
   * The transformed features. This method supports complex output types that can be used as inputs for other feature transformations.
   * @return A status code. A value of 0 indicates successful execution.
   */
  virtual int BatchProcess(const vector<VariantVector>& inputs,
                           VariantVector& outputs) {
    throw NotOverriddenException("BatchProcess");
  }

  /**
   * @brief Explicitly declares whether the subclass implements the BatchProcess method.
   *
   * The framework preferentially calls this method to check if `BatchProcess` has been overridden.
   * If your subclass implements `BatchProcess`, you must override this method to return true.
   * By default, it returns false to indicate that `BatchProcess` is not implemented.
   *
   * Note: This method is used to avoid exception propagation issues across dynamic library boundaries.
   * When a custom operator (a .so file) and the main program use different C++ ABIs or compilation options,
   * attempting to detect the implementation by calling `BatchProcess` and catching an exception may fail.
   *
   * @return true if the subclass implements BatchProcess.
   * @return false if the subclass does not implement BatchProcess (default).
   */
  virtual bool HasBatchProcessImpl() const { return false; }
};

using CreateOperatorFunc = IFeatureOP* (*)();

inline FieldPtr GetFieldPtr(const VariantVector& input, size_t i) {
  return absl::visit(
      [&](const auto& vec) -> FieldPtr {
        if (i >= vec.size()) {
          throw std::out_of_range("GetFieldPtr: index " + std::to_string(i) +
                                  " out of range [0, " +
                                  std::to_string(vec.size()) + ")");
        }
        return &vec.at(i);
      },
      input);
}
}  // namespace fg

#if defined(__GNUC__)
#define PLUGIN_API_HIDDEN \
  __attribute__((visibility("hidden"))) __attribute__((used))
#define PLUGIN_API_EXPORT \
  __attribute__((visibility("default"))) __attribute__((used))
#else
#define PLUGIN_API_HIDDEN
#define PLUGIN_API_EXPORT
#endif

std::vector<std::string>& getLocalNames();
std::vector<std::pair<std::string, void*>>& getLocalRegs();

#define REGISTER_PLUGIN(OpName, OpClass)                            \
  extern "C" PLUGIN_API_EXPORT fg::IFeatureOP* create##OpClass() {  \
    return new fg::OpClass();                                       \
  }                                                                 \
  namespace {                                                       \
  struct _Reg_##OpClass {                                           \
    _Reg_##OpClass() {                                              \
      getLocalNames().push_back(OpName);                            \
      getLocalRegs().emplace_back(OpName, (void*)&create##OpClass); \
    }                                                               \
  };                                                                \
  static _Reg_##OpClass _dummy_##OpClass __attribute__((used));     \
  }

#endif  // FEATURE_GENERATOR_PLUGIN_BASE_H

Developer guide

  • Download the API dependency file fg-api.tar.gz. This file contains the necessary header files.

  • Inherit the IFeatureOP base class, implement the Initialize method, and implement at least one ProcessWith* method.

  • Your implementation class must include a parameterless constructor.

  • The framework passes the JSON configuration string to the Initialize method. You can then parse the required configuration items.

  • The framework calls the corresponding ProcessWith* method based on the value_type configuration item. If you do not implement the method for the corresponding type, a runtime exception is thrown.

    • The ProcessWith* method processes a single record. It can have multiple input fields and multi-dimensional outputs, such as a multi-value feature.

    • VariantRecord defines all feature field types that the framework can process.

    • Your code should support as many types as possible by implementing the corresponding feature transformation operation for each possible input type. If you are certain that specific types are not required, you can throw an exception directly.

    • FSMAP is a type that needs to be supported when you use featurestore. It can significantly improve processor performance.

  • You only need to implement the feature transformation operation before the discretization operation. If a discretization operation is configured, the framework automatically performs it.

  • Use the REGISTER_PLUGIN macro to register the new feature operator. Otherwise, the framework cannot use it.

    • REGISTER_PLUGIN("OperatorName", OperatorClass): Replace the two macro parameters as required. We recommend that you use the same name for both parameters.

    • The value for operator_name in the configuration item must be 'OperatorName'.

    • Register the operator in the implementation file, not the header file.

  • The framework scans all dynamic-link libraries in a specified directory and attempts to load the required feature operators when necessary.

    • Use the FEATURE_OPERATOR_DIR environment variable to specify the directory where the dynamic-link library files are located.

    • Each dynamic-link library can contain implementations of multiple feature operators.

  • The BatchProcess interface is used for batch processing and processes one batch of data at a time.

    • This interface is optional. If you implement it, the FG framework no longer calls sample-granularity interfaces such as ProcessWith*.

    • After implementing this interface, override the bool HasBatchProcessImpl() const function and return true to instruct the main program to use this interface.

    • Implementing this interface can improve performance. For example, if a user-side feature contains only one sample per request, you can use the broadcast mechanism to avoid repeatedly parsing the user-side feature for cross features.

    • When stub_type=true is configured and no binning operation is set, this interface can return any valid type, such as Map.

    • The type of the VariantVector returned by the BatchProcess function depends on the values of is_sequence, value_dimension, and value_type. For more information, see the description of the value_dimension configuration item.

    • For an example of a batch processing interface, you can download and review RegexReplace.

  • Third-party dependencies

    • abseil-cpp (We recommend that you use the same version as the FG framework.)

    • Third-party libraries that the custom operator depends on must be compiled by embedding the source code or using static linking. Do not depend on any dynamic-link libraries, because this can cause the operator to fail to load.

Sequence features

If the is_sequence configuration item is set to true, note the following items:

  • Sparse feature sequences

    • If the operator generates a sparse feature sequence, such as a sequence of previously visited item_ids, and each element of the sequence is a single value, you can output any type.

    • If the operator generates a sparse feature sequence and each element of the sequence can have multiple values, you can only output the string type. You must set value_type to string and use the separator chr(29) to separate multiple values.

  • Dense feature sequences

    • When an operator generates a sparse feature sequence, such as the embedding vectors of historically accessed items, you must set value_dimension to the dimension of each element in the sequence.

    • If the elements of the sequence are scalars, set value_dimension to 1.

    • If the elements of the sequence are vectors, set value_dimension to the length of the vector.

    • The number of feature values output by the operator must be an integer multiple of value_dimension.

Custom operator list

Operator name

Operator function

Source code download link

Binary package download link

EditDistance

Edit distance

Download link

Click to download

SeqExpr

Sequence expression

Download link

Click to download

BPETokenize

BPE tokenization

Download link

Included in the built-in tokenize_feature.

Configuration items

  • EditDistance

    • encoding: The encoding of the input text. Options: utf-8, latin. The default value is latin.

Developer examples

The following example shows how to calculate the edit distance between two input texts. The header file is edit_distance.h.

#pragma once
#include "api/base_op.h"

namespace fg {
namespace functor {
  class EditDistanceFunctor;
}

using std::string;
using std::vector;


/**
 * @brief Edit distance: Inputs two strings and outputs their text edit distance.
 */
class EditDistance : public IFeatureOP {
 public:
  int Initialize(const string& feature_config) override;

  /// @return A status code. 0 indicates successful execution.
  int ProcessWithStrOutputs(const vector<FieldPtr>& inputs,
                            vector<string>& outputs) override;

  /// @return A status code. 0 indicates successful execution.
  int ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
                              vector<int32>& outputs) override;

  /// @return A status code. 0 indicates successful execution.
  int ProcessWithInt64Outputs(const vector<FieldPtr>& inputs,
                              vector<int64>& outputs) override;

  /// @return A status code. 0 indicates successful execution.
  int ProcessWithFloatOutputs(const vector<FieldPtr>& inputs,
                              vector<float>& outputs) override;

  /// @return A status code. 0 indicates successful execution.
  int ProcessWithDoubleOutputs(const vector<FieldPtr>& inputs,
                               vector<double>& outputs) override;
 private:
  string feature_name_;
  std::unique_ptr<functor::EditDistanceFunctor> functor_p_;
};

}  // end of namespace fg

The implementation file is edit_distance.cc.

#include "edit_distance.h"

#include <absl/strings/ascii.h>
#include <absl/strings/str_join.h>

#include <nlohmann/json.hpp>
#include <numeric>  // Includes std::iota
#include <stdexcept>

#include "api/log.h"

namespace fg {
using absl::optional;

namespace functor {
template <class T>
int edit_distance(const T& s1, const T& s2) {
  int l1 = s1.size();
  int l2 = s2.size();
  if (l1 * l2 == 0) {
    return l1 + l2;
  }
  vector<int> prev(l2 + 1);
  vector<int> curr(l2 + 1);
  std::iota(prev.begin(), prev.end(), 0);
  for (int i = 0; i <= l1; ++i) {
    curr[0] = i;
    for (int j = 1; j <= l2; ++j) {
      int d = prev[j - 1];
      if (s1[i - 1] == s2[j - 1]) {
        curr[j] = d;
      } else {
        int d2 = std::min(prev[j], curr[j - 1]);
        curr[j] = 1 + std::min(d, d2);
      }
    }
    prev.swap(curr);
  }
  return prev[l2];
}

enum class Encoding : unsigned int { Latin = 0, UTF8 = 1 };

class EditDistanceFunctor {
 public:
  EditDistanceFunctor(const string& encoding) {
    string enc = absl::AsciiStrToLower(encoding);
    if (enc == "utf-8" || enc == "utf8") {
      encoding_ = Encoding::UTF8;
    } else {
      encoding_ = Encoding::Latin;
    }
  }

  int operator()(absl::string_view s1, absl::string_view s2) {
    if (encoding_ == Encoding::Latin) {
      return edit_distance(s1, s2);
    }
    if (encoding_ == Encoding::UTF8) {
      return edit_distance(from_bytes(s1), from_bytes(s2));
    }
    LOG(ERROR) << "EditDistanceFunctor found unsupported text encoding";
    assert(false);
    return 0;
  }

  const Encoding TextEncoding() const { return encoding_; }

 private:
  Encoding encoding_;

  std::wstring from_bytes(absl::string_view str) {
    std::wstring result;
    int i = 0;
    int len = (int)str.length();
    while (i < len) {
      int char_size = 0;
      int unicode = 0;

      if ((str[i] & 0x80) == 0) {
        unicode = str[i];
        char_size = 1;
      } else if ((str[i] & 0xE0) == 0xC0) {
        unicode = str[i] & 0x1F;
        char_size = 2;
      } else if ((str[i] & 0xF0) == 0xE0) {
        unicode = str[i] & 0x0F;
        char_size = 3;
      } else if ((str[i] & 0xF8) == 0xF0) {
        unicode = str[i] & 0x07;
        char_size = 4;
      } else {
        // Invalid UTF-8 sequence
        ++i;
        continue;
      }

      for (int j = 1; j < char_size; ++j) {
        unicode = (unicode << 6) | (str[i + j] & 0x3F);
      }

      if (unicode <= 0xFFFF) {
        result += static_cast<wchar_t>(unicode);
      } else {
        // Handle surrogate pairs for characters outside the BMP
        unicode -= 0x10000;
        result += static_cast<wchar_t>((unicode >> 10) + 0xD800);
        result += static_cast<wchar_t>((unicode & 0x3FF) + 0xDC00);
      }
      i += char_size;
    }
    return result;
  }
};
}  // namespace functor

// Defines the overloaded class.
template <class... Ts>
struct overloaded : Ts... {
  using Ts::operator()...;
};
// Class template argument deduction guide (C++17).
template <class... Ts>
overloaded(Ts...) -> overloaded<Ts...>;

int EditDistance::Initialize(const string& feature_config) {
  nlohmann::json cfg;
  try {
    cfg = nlohmann::json::parse(feature_config);
  } catch (nlohmann::json::parse_error& ex) {
    LOG(ERROR) << "parse error at byte " << ex.byte;
    LOG(ERROR) << "config: " << feature_config;
    throw std::runtime_error("parse EditDistance config failed");
  }

  feature_name_ = cfg.at("feature_name");
  string encoding = cfg.value("encoding", "latin");
  functor_p_ = std::make_unique<functor::EditDistanceFunctor>(encoding);
  functor::Encoding enc = functor_p_->TextEncoding();
  encoding = (enc == functor::Encoding::UTF8) ? "UTF-8" : "Latin";
  LOG(INFO) << "feature <" << feature_name_ << "> with text encoding: " << encoding;
  return 0;
}

int EditDistance::ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
                                          vector<int32>& outputs) {
  outputs.clear();
  if (inputs.size() < 2) {
    outputs.push_back(0);
    return -1;  // invalid inputs
  }

  int d = absl::visit(
      overloaded{
          [this](const optional<string>* s1, const optional<string>* s2) {
            absl::string_view empty_view;
            return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
          },
          [this](const optional<absl::string_view>* s1,
                 const optional<absl::string_view>* s2) {
            absl::string_view empty_view;
            return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
          },
          [this](const optional<absl::string_view>* s1,
                 const optional<string>* s2) {
            absl::string_view empty_view;
            return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
          },
          [this](const optional<string>* s1,
                 const optional<absl::string_view>* s2) {
            absl::string_view empty_view;
            return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
          },
          [this](const List<string>* s1, const List<string>* s2) {
            string str1 = absl::StrJoin(*s1, "");
            string str2 = absl::StrJoin(*s2, "");
            return functor_p_->operator()(str1, str2);
          },
          [this](const List<absl::string_view>* s1,
                 const List<absl::string_view>* s2) {
            string str1 = absl::StrJoin(*s1, "");
            string str2 = absl::StrJoin(*s2, "");
            return functor_p_->operator()(str1, str2);
          },
          [this](const auto* x, const auto* y) {
            ERROR_EXIT(feature_name_,
                       "unsupported input type: ", typeid(*x).name(), " vs ",
                       typeid(*y).name());
            return 0;
          }},
      inputs.at(0), inputs.at(1));
  outputs.push_back(d);
  return 0;
}

int EditDistance::ProcessWithInt64Outputs(const vector<FieldPtr>& inputs,
                                          vector<int64>& outputs) {
  vector<int32> distances;
  int status = ProcessWithInt32Outputs(inputs, distances);
  if (0 != status) {
    return status;
  }
  outputs.clear();
  outputs.insert(outputs.end(), distances.begin(), distances.end());
  return 0;
}

int EditDistance::ProcessWithFloatOutputs(const vector<FieldPtr>& inputs,
                                          vector<float>& outputs) {
  vector<int32> distances;
  int status = ProcessWithInt32Outputs(inputs, distances);
  if (0 != status) {
    return status;
  }
  outputs.clear();
  outputs.insert(outputs.end(), distances.begin(), distances.end());
  return 0;
}

int EditDistance::ProcessWithDoubleOutputs(const vector<FieldPtr>& inputs,
                                           vector<double>& outputs) {
  vector<int32> distances;
  int status = ProcessWithInt32Outputs(inputs, distances);
  if (0 != status) {
    return status;
  }
  outputs.clear();
  outputs.insert(outputs.end(), distances.begin(), distances.end());
  return 0;
}

int EditDistance::ProcessWithStrOutputs(const vector<FieldPtr>& inputs,
                                        vector<string>& outputs) {
  vector<int32> distances;
  int status = ProcessWithInt32Outputs(inputs, distances);
  if (0 != status) {
    return status;
  }
  outputs.clear();
  outputs.reserve(distances.size());
  std::transform(distances.begin(), distances.end(),
                 std::back_inserter(outputs),
                 [](int32& x) { return std::to_string(x); });
  return 0;
}

}  // end of namespace fg

REGISTER_PLUGIN("EditDistance", EditDistance);

Download the source code from the table above and run the build.sh script to compile and generate the FG operator.

Compile a custom operator

You must use the same compilation environment as the FG framework, including the language standard (C++17) and compilation options. We recommend that you use the official compiler image. The image details are available in the build.sh script.

  • Compiler environment image (CentOS 7): mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/feature_generator:centos7-0.1.1

  • Compiler environment image (rockylinux:8, compatible with CentOS 8): mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/feature_generator:0.1.1

  • By default, the C++11 ABI is not used. To use the new ABI, set _GLIBCXX_USE_CXX11_ABI=1. In this case, you can only use the second image (tag: 0.1.1), which is based on rockylinux:8.

  • Make sure that the custom operator does not use dynamic linking for third-party libraries. You can use static linking or copy the source code into the project to compile.

For more information, see the CMakeLists.txt file in the developer example.