Custom feature operators - Artificial Intelligence Recommendation

The Feature Generation (FG) framework supports custom feature operators as shared library plugins. Use them when built-in operators don't cover your domain-specific transformation logic—for example, edit distance between two text fields, spherical distance from GPS coordinates, or any numeric formula applied element-wise to a sequence.

How it works

Implement the IFeatureOP C++ interface and register the operator with REGISTER_PLUGIN.
Compile the implementation into a shared library (.so).
Reference the operator in fg.json by setting feature_type to custom_feature and operator_name to the registered name.
Deploy the .so file—online services auto-discover it from the custom_fg_lib directory; offline tasks require uploading it as a MaxCompute resource.

Prerequisites

Before you begin, ensure that you have:

A C++17-compatible build environment using the official compiler image (see Compile a custom operator)
The API dependency package: fg-api.tar.gz — contains all required header files
Familiarity with the FG framework and how fg.json configuration works

Implement an operator

Minimal example

Every custom operator follows the same pattern: inherit IFeatureOP, implement Initialize and at least one ProcessWith* method, and register with REGISTER_PLUGIN.

Here is a minimal operator that outputs a constant integer:

#include "api/base_op.h"

namespace fg {

class ConstantOp : public IFeatureOP {
 public:
  int Initialize(const string& feature_config) override {
    // Parse JSON config if needed; return 0 on success.
    return 0;
  }

  int ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
                              vector<int32>& outputs) override {
    outputs.push_back(42);
    return 0;  // 0 = success
  }
};

}  // namespace fg

REGISTER_PLUGIN("ConstantOp", ConstantOp);

Key requirements:

Include a parameterless constructor (the default constructor works here).
The Initialize method receives the entire fg.json entry as a JSON string. Parse configuration items you need from it.
Return 0 for success from all methods; non-zero indicates failure.
Call REGISTER_PLUGIN in the implementation file, not the header.

Choose between ProcessWith* and BatchProcess

Two processing interfaces are available. Choose based on your throughput and complexity requirements:

	*`ProcessWith`**	`BatchProcess`
Granularity	One record at a time	One batch of records
Output types	Basic scalar types only (`string`, `int32`, `int64`, `float`, `double`)	Any valid `VariantVector` type, including `Map`
Use when	Simple transformations, low implementation cost	High throughput required; user-side features with one sample per request (use the broadcast mechanism to avoid redundant parsing)
Required	At least one method matching `value_type`	Override `HasBatchProcessImpl()` to return `true`

When BatchProcess is implemented and HasBatchProcessImpl() returns true, the framework skips all ProcessWith* methods.

Full C++ interface

#pragma once
#ifndef FEATURE_GENERATOR_PLUGIN_BASE_H
#define FEATURE_GENERATOR_PLUGIN_BASE_H

#include <absl/container/flat_hash_map.h>
#include <absl/strings/string_view.h>
#include <absl/types/optional.h>

#include <stdexcept>
#include <utility>
#include <vector>

#include "fsmap.h"
#include "integral_types.h"

namespace fg {

using absl::optional;
using std::string;
using std::vector;

template <typename T>
using List = std::vector<T>;
template <typename K, typename V>
using Map = absl::flat_hash_map<K, V>;
template <typename K, typename V>
using MapArray = std::vector<std::pair<K, V>>;
using Matrix = std::vector<std::vector<float>>;
using MatrixL = std::vector<std::vector<int64>>;
using MatrixS = std::vector<std::vector<string>>;
template <typename K, typename V>
using FSMap = featurestore::type::fs_map<K, V>;

using FieldPtr = absl::variant<
    const optional<string>*, const optional<int32>*, const optional<int64>*,
    const optional<float>*, const optional<double>*,
    const optional<absl::string_view>*,

    const List<string>*, const List<int32>*, const List<int64>*,
    const List<float>*, const List<double>*, const List<absl::string_view>*,

    const Map<string, string>*, const Map<string, int32>*,
    const Map<string, int64>*, const Map<string, float>*,
    const Map<string, double>*, const Map<string, absl::string_view>*,

    const Map<absl::string_view, absl::string_view>*,
    const Map<absl::string_view, int32>*, const Map<absl::string_view, int64>*,
    const Map<absl::string_view, float>*, const Map<absl::string_view, double>*,
    const Map<absl::string_view, string>*,

    const Map<int32, string>*, const Map<int32, int32>*,
    const Map<int32, int64>*, const Map<int32, float>*,
    const Map<int32, double>*, const Map<int32, absl::string_view>*,

    const Map<int64, string>*, const Map<int64, float>*,
    const Map<int64, double>*, const Map<int64, int32>*,
    const Map<int64, int64>*, const Map<int64, absl::string_view>*,

    const FSMap<absl::string_view, absl::string_view>*,
    const FSMap<absl::string_view, int32>*,
    const FSMap<absl::string_view, int64>*,
    const FSMap<absl::string_view, float>*,
    const FSMap<absl::string_view, double>*,

    const FSMap<int32, int32>*, const FSMap<int32, int64>*,
    const FSMap<int32, float>*, const FSMap<int32, double>*,
    const FSMap<int32, absl::string_view>*,

    const FSMap<int64, float>*, const FSMap<int64, double>*,
    const FSMap<int64, int32>*, const FSMap<int64, int64>*,
    const FSMap<int64, absl::string_view>*,

    const MapArray<string, string>*, const MapArray<string, int32>*,
    const MapArray<string, int64>*, const MapArray<string, float>*,
    const MapArray<string, double>*,

    const MapArray<int32, string>*, const MapArray<int32, float>*,
    const MapArray<int32, double>*, const MapArray<int32, int32>*,
    const MapArray<int32, int64>*,

    const MapArray<int64, string>*, const MapArray<int64, float>*,
    const MapArray<int64, double>*, const MapArray<int64, int32>*,
    const MapArray<int64, int64>*, const Matrix*, const MatrixL*,
    const MatrixS*>;

// Represents a COLUMN of the feature table.
using VariantVector = absl::variant<
    vector<optional<string>>, vector<optional<int32>>, vector<optional<int64>>,
    vector<optional<float>>, vector<optional<double>>,
    vector<optional<absl::string_view>>,

    vector<List<string>>, vector<List<int32>>, vector<List<int64>>,
    vector<List<float>>, vector<List<double>>, vector<List<absl::string_view>>,

    vector<Map<string, string>>, vector<Map<string, int32>>,
    vector<Map<string, int64>>, vector<Map<string, float>>,
    vector<Map<string, double>>, vector<Map<string, absl::string_view>>,

    vector<Map<absl::string_view, absl::string_view>>,
    vector<Map<absl::string_view, int32>>,
    vector<Map<absl::string_view, int64>>,
    vector<Map<absl::string_view, float>>,
    vector<Map<absl::string_view, double>>,

    vector<Map<int32, string>>, vector<Map<int32, int32>>,
    vector<Map<int32, int64>>, vector<Map<int32, float>>,
    vector<Map<int32, double>>, vector<Map<int32, absl::string_view>>,

    vector<Map<int64, string>>, vector<Map<int64, float>>,
    vector<Map<int64, double>>, vector<Map<int64, int32>>,
    vector<Map<int64, int64>>, vector<Map<int64, absl::string_view>>,

    vector<FSMap<absl::string_view, absl::string_view>>,
    vector<FSMap<absl::string_view, int32>>,
    vector<FSMap<absl::string_view, int64>>,
    vector<FSMap<absl::string_view, float>>,
    vector<FSMap<absl::string_view, double>>,

    vector<FSMap<int32, int32>>, vector<FSMap<int32, int64>>,
    vector<FSMap<int32, float>>, vector<FSMap<int32, double>>,
    vector<FSMap<int32, absl::string_view>>,

    vector<FSMap<int64, float>>, vector<FSMap<int64, double>>,
    vector<FSMap<int64, int32>>, vector<FSMap<int64, int64>>,
    vector<FSMap<int64, absl::string_view>>,

    vector<MapArray<string, string>>, vector<MapArray<string, int32>>,
    vector<MapArray<string, int64>>, vector<MapArray<string, float>>,
    vector<MapArray<string, double>>,

    vector<MapArray<int32, string>>, vector<MapArray<int32, float>>,
    vector<MapArray<int32, double>>, vector<MapArray<int32, int32>>,
    vector<MapArray<int32, int64>>,

    vector<MapArray<int64, string>>, vector<MapArray<int64, float>>,
    vector<MapArray<int64, double>>, vector<MapArray<int64, int32>>,
    vector<MapArray<int64, int64>>, vector<Matrix>, vector<MatrixL>,
    vector<MatrixS>>;

/**
 * @brief The public base class for custom feature operators.
 *
 * The framework checks if a subclass overrides the `BatchProcess` method. If it is overridden,
 * the framework calls this method to perform the feature transformation.
 * Otherwise, the framework selects one of the `ProcessWith*` methods based on the `value_type` configuration.
 * Implement the method that corresponds to the required output type.
 */
class IFeatureOP {
 public:
  class NotOverriddenException : public std::exception {
   public:
    explicit NotOverriddenException(std::string msg) : msg_(std::move(msg)) {}
    const char* what() const noexcept override {
      if (msg_.empty()) {
        return "unimplemented method called";
      }
      // Cache the message to a member variable to ensure that the returned pointer remains valid.
      cached_ = "unimplemented method called: " + msg_;
      return cached_.c_str();
    }

   private:
    std::string msg_;
    mutable std::string cached_;
  };

  virtual ~IFeatureOP() = default;

  /**
   * @brief Initialization method.
   * @param feature_config The full fg.json entry for this feature, as a JSON string.
   * @return 0 on success; non-zero indicates failure.
   */
  virtual int Initialize(const string& feature_config) = 0;

  /**
   * @brief Processes one record and outputs string values.
   * @param inputs A record that can contain multiple fields.
   * @param outputs The outputs of the feature transformation.
   * @return 0 on success.
   */
  virtual int ProcessWithStrOutputs(const vector<FieldPtr>& inputs,
                                    vector<string>& outputs) {
    throw NotOverriddenException("ProcessWithStrOutputs(FieldPtr)");
  }

  /**
   * @brief Processes one record and outputs int32 values.
   */
  virtual int ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
                                      vector<int32>& outputs) {
    throw NotOverriddenException("ProcessWithInt32Outputs(FieldPtr)");
  }

  /**
   * @brief Processes one record and outputs int64 values.
   */
  virtual int ProcessWithInt64Outputs(const vector<FieldPtr>& inputs,
                                      vector<int64>& outputs) {
    throw NotOverriddenException("ProcessWithInt64Outputs(FieldPtr)");
  }

  /**
   * @brief Processes one record and outputs float values.
   */
  virtual int ProcessWithFloatOutputs(const vector<FieldPtr>& inputs,
                                      vector<float>& outputs) {
    throw NotOverriddenException("ProcessWithFloatOutputs(FieldPtr)");
  }

  /**
   * @brief Processes one record and outputs double values.
   */
  virtual int ProcessWithDoubleOutputs(const vector<FieldPtr>& inputs,
                                       vector<double>& outputs) {
    throw NotOverriddenException("ProcessWithDoubleOutputs(FieldPtr)");
  }

  /**
   * @brief Optional batch interface that processes one batch of records.
   *
   * @param inputs A vector of input columns; each `VariantVector` represents one feature column.
   * @param outputs The transformed features. Supports complex output types usable as inputs for downstream operators.
   * @return 0 on success.
   */
  virtual int BatchProcess(const vector<VariantVector>& inputs,
                           VariantVector& outputs) {
    throw NotOverriddenException("BatchProcess");
  }

  /**
   * @brief Declares whether the subclass implements BatchProcess.
   *
   * Override this to return `true` if you implement `BatchProcess`.
   * This avoids exception propagation issues across dynamic library boundaries
   * when the .so and the main program use different C++ ABIs.
   *
   * @return true if BatchProcess is implemented; false otherwise (default).
   */
  virtual bool HasBatchProcessImpl() const { return false; }
};

using CreateOperatorFunc = IFeatureOP* (*)();

inline FieldPtr GetFieldPtr(const VariantVector& input, size_t i) {
  return absl::visit(
      [&](const auto& vec) -> FieldPtr {
        if (i >= vec.size()) {
          throw std::out_of_range("GetFieldPtr: index " + std::to_string(i) +
                                  " out of range [0, " +
                                  std::to_string(vec.size()) + ")");
        }
        return &vec.at(i);
      },
      input);
}
}  // namespace fg

#if defined(__GNUC__)
#define PLUGIN_API_HIDDEN \
  __attribute__((visibility("hidden"))) __attribute__((used))
#define PLUGIN_API_EXPORT \
  __attribute__((visibility("default"))) __attribute__((used))
#else
#define PLUGIN_API_HIDDEN
#define PLUGIN_API_EXPORT
#endif

std::vector<std::string>& getLocalNames();
std::vector<std::pair<std::string, void*>>& getLocalRegs();

#define REGISTER_PLUGIN(OpName, OpClass)                            \
  extern "C" PLUGIN_API_EXPORT fg::IFeatureOP* create##OpClass() {  \
    return new fg::OpClass();                                       \
  }                                                                 \
  namespace {                                                       \
  struct _Reg_##OpClass {                                           \
    _Reg_##OpClass() {                                              \
      getLocalNames().push_back(OpName);                            \
      getLocalRegs().emplace_back(OpName, (void*)&create##OpClass); \
    }                                                               \
  };                                                                \
  static _Reg_##OpClass _dummy_##OpClass __attribute__((used));     \
  }

#endif  // FEATURE_GENERATOR_PLUGIN_BASE_H

Implementation notes

Input types: Implement ProcessWith* methods for all input types you intend to support. For unsupported types, throw an exception directly. VariantRecord defines all feature field types that the framework can process. FSMap types are required when integrating with Feature Store—they significantly improve online processor performance.
Discretization: Implement only the transformation logic before discretization. If hash_bucket_size, vocab_list, boundaries, or another discretization operation is configured, the framework handles it automatically.
Thread safety: By default (is_op_thread_safe=true), the operator must be stateless or use only thread_local variables. Set is_op_thread_safe=false to have the framework create one object replica per thread—this is simpler to implement but uses more memory.
`string_view` inputs: Online services (EasyRecProcessor, TorchEasyRec Processor) pass item-side string features as absl::string_view for efficiency. If your operator can't handle string_view, set disable_string_view=true in the configuration to have the framework convert them to string before calling your operator. This degrades performance.
Custom config items: Use any key names in the JSON entry—the framework passes the full JSON string to Initialize. Don't reuse framework-reserved key names (such as feature_type, operator_name, value_type). Keys that reference external files must end with _file so the framework can sync them for offline tasks.
Operator directory: Use the FEATURE_OPERATOR_DIR environment variable to specify the directory where dynamic-link library files are located. Each dynamic-link library can contain implementations of multiple feature operators.
`BatchProcess` return type: The type of the VariantVector returned by BatchProcess depends on the values of is_sequence, value_dimension, and value_type. For more information, see the output table schema in Configuration reference. When stub_type=true is configured and no binning operation is set, BatchProcess can return any valid type, such as Map.

Configuration reference

`fg.json` entry structure

{
    "feature_name": "my_custom_fg_op",
    "feature_type": "custom_feature",
    "operator_name": "EditDistance",
    "operator_lib_file": "libedit_distance.so",
    "expression": [
        "user:query",
        "item:title"
    ],
    "value_type": "string",
    "separator": ",",
    "default_value": "-1",
    "value_dimension": 1,
    "normalizer": "method=expression,expr=x>16?16:x",
    "num_buckets": 10000,
    "stub_type": false,
    "is_sequence": false,
    "is_op_thread_safe": true
}

Add any additional configuration items your operator needs. The entire JSON entry is passed to Initialize.

Configuration parameters

Parameter	Required	Description
`feature_type`	Yes	Set to `custom_feature`.
`operator_name`	Yes	The name used in `REGISTER_PLUGIN`. Must match the registered class name. The same operator can be reused across multiple features.
`operator_lib_file`	Offline: required; Online: optional	Name of the `.so` file. Online services scan the `custom_fg_lib` subdirectory of the `fg.json` model directory and load all `.so` files automatically. For official extension operators, set this to `pyfg/lib/libxxx.so`. For offline tasks, upload the `.so` as a MaxCompute resource with the same name.
`expression`	Yes	Input fields. Supports multiple inputs.
`value_type`	Yes	Output type. One of: `string`, `int32`, `int64`, `float`, `double`.
`default_value`	Yes	Default value as a string. The framework converts it to `value_type`.
`separator`	When output is multi-dimensional	Splits `default_value` into multiple values for multi-dimensional features.
`stub_type`	No	If `true`, the operator can only produce intermediate results and cannot be a leaf node in the Directed Acyclic Graph (DAG) execution graph. Default: `false`.
`is_sequence`	No	Whether the output is a sequence feature. Default: `false`.
`sequence_length`	When `is_sequence=true`	Maximum sequence length. Longer sequences are truncated.
`sequence_delim`	When input is string-type sequence	Separator between sequence elements.
`split_sequence`	No	For string-type input sequences, whether the framework splits the string before passing it to the operator. Default: `true`. After splitting, the field type becomes `std::vector<std::string>` even if it was originally a scalar. The split uses the CPU AVX-512 instruction set for better performance. If some inputs are sequences and others are scalars, consider whether framework-level splitting is appropriate.
`value_dimension`	No	Dimension of the output feature. Default: `0`. Used to truncate output in offline tasks and affects the output table schema. Omit if the output dimension is variable.
`normalizer`	No	Post-transformation normalization for numeric features. See Normalizer frameworks.
`placeholder`	When `is_sequence=true` and `value_dimension != 1`	Fills empty positions in a multi-value sequence element. Default for floats: `NaN`; default for integers: minimum value of the type. Sparse features with discretization output a jagged value; dense features without discretization use `default_value` instead.
`disable_string_view`	No	Converts `string_view`-type inputs to `string` before calling the operator. Default: `false`. Enable this if your operator can't handle `string_view`. Note: enabling this degrades performance. Map keys and values of `string_view` type are not converted—handle them in your operator.
`is_op_thread_safe`	No	Whether the operator is thread-safe. Default: `true` (operator must be stateless or use only `thread_local` variables). Set to `false` to have the framework create one object per thread (simpler, but uses more memory).

Output table schema

The value_dimension and is_sequence settings determine the schema type of the output table in offline tasks:

`value_dimension`	`is_sequence`	Schema type	With discretization
`1`	`false`	`value_type`	`bigint`
`1`	`true`	`array<value_type>`	`array<bigint>`
`≠1`	`false`	`array<value_type>`	`array<bigint>`
`≠1`	`true`	`array<array<value_type>>`	`array<array<bigint>>`

Special cases: array<array<int>> is forced to array<array<bigint>>; array<array<double>> is forced to array<array<float>>.

Normalizer frameworks

For numeric features, add a normalizer to further process the transformation result:

Framework	Example config	Formula
`log10`	`method=log10,threshold=1e-10,default=-10`	`x = x > threshold ? log10(x) : default`
`zscore`	`method=zscore,mean=0.0,standard_deviation=10.0`	`x = (x - mean) / standard_deviation`
`minmax`	`method=minmax,min=2.1,max=2.2`	`x = (x - min) / (max - min)`
`expression`	`method=expression,expr=sign(x)`	Any expression; the variable `x` represents the input.

For supported functions in expression, see Built-in feature operators.

Discretization operations

Six discretization types are available without any additional implementation:

Type	Description
`hash_bucket_size`	Hash and modulo operation on the transformation result.
`vocab_list`	Converts the result to an index in a list.
`vocab_dict`	Converts the result to a value in a dictionary. The value must be convertible to `int64`.
`vocab_file`	Loads a `vocab_list` or `vocab_dict` from a file.
`boundaries`	Converts the result to a bucket number based on specified bin boundaries.
`num_buckets`	Uses the result directly as a bucket number.

For more information, see Feature discretization (binning).

Configuration examples

Sequence time-difference feature

{
    "feature_name": "time_diff_seq",
    "feature_type": "custom_feature",
    "operator_name": "SeqExpr",
    "expression": ["user:cur_time", "user:clk_time_seq"],
    "formula": "cur_time - clk_time_seq",
    "default_value": "0",
    "value_type": "int32",
    "is_sequence": true,
    "num_buckets": 1000,
    "is_op_thread_safe": false
}

Spherical distance with normalization

{
    "feature_name": "spherical_distance",
    "feature_type": "custom_feature",
    "operator_name": "SeqExpr",
    "expression": ["item:click_id_lng", "item:click_id_lat", "user:j_lng", "user:j_lat"],
    "formula": "spherical_distance",
    "default_value": "0",
    "value_type": "double",
    "is_sequence": true,
    "is_op_thread_safe": true,
    "value_dimension": 1,
    "normalizer": "method=expression,expr=sqrt(x)"
}

Both examples use the SeqExpr operator. The formula field is a SeqExpr-specific configuration item passed through Initialize.

spherical_distance: Calculates the distance between two latitude/longitude coordinate pairs. Parameters are [lng1_seq, lat1_seq, lng2, lat2]—the first two are sequences; the last two are scalars.

These examples demonstrate the tiled format for custom sequence features. For a nested-format example, see sequence_feature.

Sequence features

When is_sequence=true, the output requirements differ based on whether the feature is sparse or dense:

Sparse sequences

Single value per element: output any type.
Multiple values per element: output string only. Set value_type to string and use chr(29) as the separator between values within one element.

Dense sequences

Set value_dimension to the dimension of each element.
Scalar elements: value_dimension=1.
Vector elements: value_dimension= length of the vector.
The total number of output values must be an integer multiple of value_dimension.

Developer example

The following example implements an edit distance operator that computes the Levenshtein distance between two text inputs.

Header file (`edit_distance.h`):

#pragma once
#include "api/base_op.h"

namespace fg {
namespace functor {
  class EditDistanceFunctor;
}

using std::string;
using std::vector;

/**
 * @brief EditDistance: takes two strings, outputs their edit distance.
 */
class EditDistance : public IFeatureOP {
 public:
  int Initialize(const string& feature_config) override;

  /// @return 0 on success.
  int ProcessWithStrOutputs(const vector<FieldPtr>& inputs,
                            vector<string>& outputs) override;

  /// @return 0 on success.
  int ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
                              vector<int32>& outputs) override;

  /// @return 0 on success.
  int ProcessWithInt64Outputs(const vector<FieldPtr>& inputs,
                              vector<int64>& outputs) override;

  /// @return 0 on success.
  int ProcessWithFloatOutputs(const vector<FieldPtr>& inputs,
                              vector<float>& outputs) override;

  /// @return 0 on success.
  int ProcessWithDoubleOutputs(const vector<FieldPtr>& inputs,
                               vector<double>& outputs) override;
 private:
  string feature_name_;
  std::unique_ptr<functor::EditDistanceFunctor> functor_p_;
};

}  // end of namespace fg

Implementation file (`edit_distance.cc`):

#include "edit_distance.h"

#include <absl/strings/ascii.h>
#include <absl/strings/str_join.h>

#include <nlohmann/json.hpp>
#include <numeric>  // std::iota
#include <stdexcept>

#include "api/log.h"

namespace fg {
using absl::optional;

namespace functor {
template <class T>
int edit_distance(const T& s1, const T& s2) {
  int l1 = s1.size();
  int l2 = s2.size();
  if (l1 * l2 == 0) {
    return l1 + l2;
  }
  vector<int> prev(l2 + 1);
  vector<int> curr(l2 + 1);
  std::iota(prev.begin(), prev.end(), 0);
  for (int i = 0; i <= l1; ++i) {
    curr[0] = i;
    for (int j = 1; j <= l2; ++j) {
      int d = prev[j - 1];
      if (s1[i - 1] == s2[j - 1]) {
        curr[j] = d;
      } else {
        int d2 = std::min(prev[j], curr[j - 1]);
        curr[j] = 1 + std::min(d, d2);
      }
    }
    prev.swap(curr);
  }
  return prev[l2];
}

enum class Encoding : unsigned int { Latin = 0, UTF8 = 1 };

class EditDistanceFunctor {
 public:
  EditDistanceFunctor(const string& encoding) {
    string enc = absl::AsciiStrToLower(encoding);
    if (enc == "utf-8" || enc == "utf8") {
      encoding_ = Encoding::UTF8;
    } else {
      encoding_ = Encoding::Latin;
    }
  }

  int operator()(absl::string_view s1, absl::string_view s2) {
    if (encoding_ == Encoding::Latin) {
      return edit_distance(s1, s2);
    }
    if (encoding_ == Encoding::UTF8) {
      return edit_distance(from_bytes(s1), from_bytes(s2));
    }
    LOG(ERROR) << "EditDistanceFunctor found unsupported text encoding";
    assert(false);
    return 0;
  }

  const Encoding TextEncoding() const { return encoding_; }

 private:
  Encoding encoding_;

  std::wstring from_bytes(absl::string_view str) {
    std::wstring result;
    int i = 0;
    int len = (int)str.length();
    while (i < len) {
      int char_size = 0;
      int unicode = 0;

      if ((str[i] & 0x80) == 0) {
        unicode = str[i];
        char_size = 1;
      } else if ((str[i] & 0xE0) == 0xC0) {
        unicode = str[i] & 0x1F;
        char_size = 2;
      } else if ((str[i] & 0xF0) == 0xE0) {
        unicode = str[i] & 0x0F;
        char_size = 3;
      } else if ((str[i] & 0xF8) == 0xF0) {
        unicode = str[i] & 0x07;
        char_size = 4;
      } else {
        // Invalid UTF-8 sequence
        ++i;
        continue;
      }

      for (int j = 1; j < char_size; ++j) {
        unicode = (unicode << 6) | (str[i + j] & 0x3F);
      }

      if (unicode <= 0xFFFF) {
        result += static_cast<wchar_t>(unicode);
      } else {
        // Handle surrogate pairs for characters outside the BMP
        unicode -= 0x10000;
        result += static_cast<wchar_t>((unicode >> 10) + 0xD800);
        result += static_cast<wchar_t>((unicode & 0x3FF) + 0xDC00);
      }
      i += char_size;
    }
    return result;
  }
};
}  // namespace functor

// Overloaded helper for absl::visit (C++17).
template <class... Ts>
struct overloaded : Ts... {
  using Ts::operator()...;
};
template <class... Ts>
overloaded(Ts...) -> overloaded<Ts...>;

int EditDistance::Initialize(const string& feature_config) {
  nlohmann::json cfg;
  try {
    cfg = nlohmann::json::parse(feature_config);
  } catch (nlohmann::json::parse_error& ex) {
    LOG(ERROR) << "parse error at byte " << ex.byte;
    LOG(ERROR) << "config: " << feature_config;
    throw std::runtime_error("parse EditDistance config failed");
  }

  feature_name_ = cfg.at("feature_name");
  string encoding = cfg.value("encoding", "latin");
  functor_p_ = std::make_unique<functor::EditDistanceFunctor>(encoding);
  functor::Encoding enc = functor_p_->TextEncoding();
  encoding = (enc == functor::Encoding::UTF8) ? "UTF-8" : "Latin";
  LOG(INFO) << "feature <" << feature_name_ << "> with text encoding: " << encoding;
  return 0;
}

int EditDistance::ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
                                          vector<int32>& outputs) {
  outputs.clear();
  if (inputs.size() < 2) {
    outputs.push_back(0);
    return -1;  // invalid inputs
  }

  int d = absl::visit(
      overloaded{
          [this](const optional<string>* s1, const optional<string>* s2) {
            absl::string_view empty_view;
            return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
          },
          [this](const optional<absl::string_view>* s1,
                 const optional<absl::string_view>* s2) {
            absl::string_view empty_view;
            return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
          },
          [this](const optional<absl::string_view>* s1,
                 const optional<string>* s2) {
            absl::string_view empty_view;
            return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
          },
          [this](const optional<string>* s1,
                 const optional<absl::string_view>* s2) {
            absl::string_view empty_view;
            return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
          },
          [this](const List<string>* s1, const List<string>* s2) {
            string str1 = absl::StrJoin(*s1, "");
            string str2 = absl::StrJoin(*s2, "");
            return functor_p_->operator()(str1, str2);
          },
          [this](const List<absl::string_view>* s1,
                 const List<absl::string_view>* s2) {
            string str1 = absl::StrJoin(*s1, "");
            string str2 = absl::StrJoin(*s2, "");
            return functor_p_->operator()(str1, str2);
          },
          [this](const auto* x, const auto* y) {
            ERROR_EXIT(feature_name_,
                       "unsupported input type: ", typeid(*x).name(), " vs ",
                       typeid(*y).name());
            return 0;
          }},
      inputs.at(0), inputs.at(1));
  outputs.push_back(d);
  return 0;
}

// int32 results are reused for int64, float, and double outputs.
int EditDistance::ProcessWithInt64Outputs(const vector<FieldPtr>& inputs,
                                          vector<int64>& outputs) {
  vector<int32> distances;
  int status = ProcessWithInt32Outputs(inputs, distances);
  if (0 != status) return status;
  outputs.assign(distances.begin(), distances.end());
  return 0;
}

int EditDistance::ProcessWithFloatOutputs(const vector<FieldPtr>& inputs,
                                          vector<float>& outputs) {
  vector<int32> distances;
  int status = ProcessWithInt32Outputs(inputs, distances);
  if (0 != status) return status;
  outputs.assign(distances.begin(), distances.end());
  return 0;
}

int EditDistance::ProcessWithDoubleOutputs(const vector<FieldPtr>& inputs,
                                           vector<double>& outputs) {
  vector<int32> distances;
  int status = ProcessWithInt32Outputs(inputs, distances);
  if (0 != status) return status;
  outputs.assign(distances.begin(), distances.end());
  return 0;
}

int EditDistance::ProcessWithStrOutputs(const vector<FieldPtr>& inputs,
                                        vector<string>& outputs) {
  vector<int32> distances;
  int status = ProcessWithInt32Outputs(inputs, distances);
  if (0 != status) return status;
  outputs.reserve(distances.size());
  std::transform(distances.begin(), distances.end(),
                 std::back_inserter(outputs),
                 [](int32& x) { return std::to_string(x); });
  return 0;
}

}  // end of namespace fg

REGISTER_PLUGIN("EditDistance", EditDistance);

Download the source code from the table in Available operators and run the build.sh script to compile the operator.

Compile a custom operator

Use the same compilation environment as the FG framework: C++17 and the official compiler image. The image details are in the build.sh script included with each example.

Image	Base OS	Notes
`mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/feature_generator:centos7-0.1.1`	CentOS 7	Default C++11 ABI (not enabled)
`mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/feature_generator:0.1.1`	Rocky Linux 8 (CentOS 8 compatible)	Use this image if you need the new C++11 ABI (`_GLIBCXX_USE_CXX11_ABI=1`)

Third-party dependencies: Embed all dependencies as source code or use static linking. Dynamic linking to third-party libraries causes the operator to fail to load at runtime.

Required dependency:

abseil-cpp — use the same version as the FG framework.

For CMake configuration details, see the CMakeLists.txt in each developer example.

Available operators

The following operators are available as source code and prebuilt binaries:

Operator	Description	Source code	Binary
EditDistance	Edit distance between two text inputs	Download	Download
SeqExpr	Sequence expression evaluation	Download	Download
BPETokenize	Byte Pair Encoding (BPE) tokenization	Download	Included in built-in tokenize_feature

EditDistance configuration

Parameter	Options	Default
`encoding`	`utf-8`, `latin`	`latin`

For a BatchProcess example, download and review RegexReplace.

What's next

Built-in feature operators — available operators and the expr_feature expression syntax
Use FG in offline tasks — how to deploy custom operators for offline tasks
Feature Generation overview and configuration — full fg.json schema and discretization reference