All Products
Search
Document Center

Artificial Intelligence Recommendation:Custom feature operators

Last Updated:Apr 01, 2026

The Feature Generation (FG) framework supports custom feature operators as shared library plugins. Use them when built-in operators don't cover your domain-specific transformation logic—for example, edit distance between two text fields, spherical distance from GPS coordinates, or any numeric formula applied element-wise to a sequence.

How it works

  1. Implement the IFeatureOP C++ interface and register the operator with REGISTER_PLUGIN.

  2. Compile the implementation into a shared library (.so).

  3. Reference the operator in fg.json by setting feature_type to custom_feature and operator_name to the registered name.

  4. Deploy the .so file—online services auto-discover it from the custom_fg_lib directory; offline tasks require uploading it as a MaxCompute resource.

Prerequisites

Before you begin, ensure that you have:

  • A C++17-compatible build environment using the official compiler image (see Compile a custom operator)

  • The API dependency package: fg-api.tar.gz — contains all required header files

  • Familiarity with the FG framework and how fg.json configuration works

Implement an operator

Minimal example

Every custom operator follows the same pattern: inherit IFeatureOP, implement Initialize and at least one ProcessWith* method, and register with REGISTER_PLUGIN.

Here is a minimal operator that outputs a constant integer:

#include "api/base_op.h"

namespace fg {

class ConstantOp : public IFeatureOP {
 public:
  int Initialize(const string& feature_config) override {
    // Parse JSON config if needed; return 0 on success.
    return 0;
  }

  int ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
                              vector<int32>& outputs) override {
    outputs.push_back(42);
    return 0;  // 0 = success
  }
};

}  // namespace fg

REGISTER_PLUGIN("ConstantOp", ConstantOp);

Key requirements:

  • Include a parameterless constructor (the default constructor works here).

  • The Initialize method receives the entire fg.json entry as a JSON string. Parse configuration items you need from it.

  • Return 0 for success from all methods; non-zero indicates failure.

  • Call REGISTER_PLUGIN in the implementation file, not the header.

Choose between ProcessWith* and BatchProcess

Two processing interfaces are available. Choose based on your throughput and complexity requirements:

ProcessWith*BatchProcess
GranularityOne record at a timeOne batch of records
Output typesBasic scalar types only (string, int32, int64, float, double)Any valid VariantVector type, including Map
Use whenSimple transformations, low implementation costHigh throughput required; user-side features with one sample per request (use the broadcast mechanism to avoid redundant parsing)
RequiredAt least one method matching value_typeOverride HasBatchProcessImpl() to return true

When BatchProcess is implemented and HasBatchProcessImpl() returns true, the framework skips all ProcessWith* methods.

Full C++ interface

#pragma once
#ifndef FEATURE_GENERATOR_PLUGIN_BASE_H
#define FEATURE_GENERATOR_PLUGIN_BASE_H

#include <absl/container/flat_hash_map.h>
#include <absl/strings/string_view.h>
#include <absl/types/optional.h>

#include <stdexcept>
#include <utility>
#include <vector>

#include "fsmap.h"
#include "integral_types.h"

namespace fg {

using absl::optional;
using std::string;
using std::vector;

template <typename T>
using List = std::vector<T>;
template <typename K, typename V>
using Map = absl::flat_hash_map<K, V>;
template <typename K, typename V>
using MapArray = std::vector<std::pair<K, V>>;
using Matrix = std::vector<std::vector<float>>;
using MatrixL = std::vector<std::vector<int64>>;
using MatrixS = std::vector<std::vector<string>>;
template <typename K, typename V>
using FSMap = featurestore::type::fs_map<K, V>;

using FieldPtr = absl::variant<
    const optional<string>*, const optional<int32>*, const optional<int64>*,
    const optional<float>*, const optional<double>*,
    const optional<absl::string_view>*,

    const List<string>*, const List<int32>*, const List<int64>*,
    const List<float>*, const List<double>*, const List<absl::string_view>*,

    const Map<string, string>*, const Map<string, int32>*,
    const Map<string, int64>*, const Map<string, float>*,
    const Map<string, double>*, const Map<string, absl::string_view>*,

    const Map<absl::string_view, absl::string_view>*,
    const Map<absl::string_view, int32>*, const Map<absl::string_view, int64>*,
    const Map<absl::string_view, float>*, const Map<absl::string_view, double>*,
    const Map<absl::string_view, string>*,

    const Map<int32, string>*, const Map<int32, int32>*,
    const Map<int32, int64>*, const Map<int32, float>*,
    const Map<int32, double>*, const Map<int32, absl::string_view>*,

    const Map<int64, string>*, const Map<int64, float>*,
    const Map<int64, double>*, const Map<int64, int32>*,
    const Map<int64, int64>*, const Map<int64, absl::string_view>*,

    const FSMap<absl::string_view, absl::string_view>*,
    const FSMap<absl::string_view, int32>*,
    const FSMap<absl::string_view, int64>*,
    const FSMap<absl::string_view, float>*,
    const FSMap<absl::string_view, double>*,

    const FSMap<int32, int32>*, const FSMap<int32, int64>*,
    const FSMap<int32, float>*, const FSMap<int32, double>*,
    const FSMap<int32, absl::string_view>*,

    const FSMap<int64, float>*, const FSMap<int64, double>*,
    const FSMap<int64, int32>*, const FSMap<int64, int64>*,
    const FSMap<int64, absl::string_view>*,

    const MapArray<string, string>*, const MapArray<string, int32>*,
    const MapArray<string, int64>*, const MapArray<string, float>*,
    const MapArray<string, double>*,

    const MapArray<int32, string>*, const MapArray<int32, float>*,
    const MapArray<int32, double>*, const MapArray<int32, int32>*,
    const MapArray<int32, int64>*,

    const MapArray<int64, string>*, const MapArray<int64, float>*,
    const MapArray<int64, double>*, const MapArray<int64, int32>*,
    const MapArray<int64, int64>*, const Matrix*, const MatrixL*,
    const MatrixS*>;

// Represents a COLUMN of the feature table.
using VariantVector = absl::variant<
    vector<optional<string>>, vector<optional<int32>>, vector<optional<int64>>,
    vector<optional<float>>, vector<optional<double>>,
    vector<optional<absl::string_view>>,

    vector<List<string>>, vector<List<int32>>, vector<List<int64>>,
    vector<List<float>>, vector<List<double>>, vector<List<absl::string_view>>,

    vector<Map<string, string>>, vector<Map<string, int32>>,
    vector<Map<string, int64>>, vector<Map<string, float>>,
    vector<Map<string, double>>, vector<Map<string, absl::string_view>>,

    vector<Map<absl::string_view, absl::string_view>>,
    vector<Map<absl::string_view, int32>>,
    vector<Map<absl::string_view, int64>>,
    vector<Map<absl::string_view, float>>,
    vector<Map<absl::string_view, double>>,

    vector<Map<int32, string>>, vector<Map<int32, int32>>,
    vector<Map<int32, int64>>, vector<Map<int32, float>>,
    vector<Map<int32, double>>, vector<Map<int32, absl::string_view>>,

    vector<Map<int64, string>>, vector<Map<int64, float>>,
    vector<Map<int64, double>>, vector<Map<int64, int32>>,
    vector<Map<int64, int64>>, vector<Map<int64, absl::string_view>>,

    vector<FSMap<absl::string_view, absl::string_view>>,
    vector<FSMap<absl::string_view, int32>>,
    vector<FSMap<absl::string_view, int64>>,
    vector<FSMap<absl::string_view, float>>,
    vector<FSMap<absl::string_view, double>>,

    vector<FSMap<int32, int32>>, vector<FSMap<int32, int64>>,
    vector<FSMap<int32, float>>, vector<FSMap<int32, double>>,
    vector<FSMap<int32, absl::string_view>>,

    vector<FSMap<int64, float>>, vector<FSMap<int64, double>>,
    vector<FSMap<int64, int32>>, vector<FSMap<int64, int64>>,
    vector<FSMap<int64, absl::string_view>>,

    vector<MapArray<string, string>>, vector<MapArray<string, int32>>,
    vector<MapArray<string, int64>>, vector<MapArray<string, float>>,
    vector<MapArray<string, double>>,

    vector<MapArray<int32, string>>, vector<MapArray<int32, float>>,
    vector<MapArray<int32, double>>, vector<MapArray<int32, int32>>,
    vector<MapArray<int32, int64>>,

    vector<MapArray<int64, string>>, vector<MapArray<int64, float>>,
    vector<MapArray<int64, double>>, vector<MapArray<int64, int32>>,
    vector<MapArray<int64, int64>>, vector<Matrix>, vector<MatrixL>,
    vector<MatrixS>>;

/**
 * @brief The public base class for custom feature operators.
 *
 * The framework checks if a subclass overrides the `BatchProcess` method. If it is overridden,
 * the framework calls this method to perform the feature transformation.
 * Otherwise, the framework selects one of the `ProcessWith*` methods based on the `value_type` configuration.
 * Implement the method that corresponds to the required output type.
 */
class IFeatureOP {
 public:
  class NotOverriddenException : public std::exception {
   public:
    explicit NotOverriddenException(std::string msg) : msg_(std::move(msg)) {}
    const char* what() const noexcept override {
      if (msg_.empty()) {
        return "unimplemented method called";
      }
      // Cache the message to a member variable to ensure that the returned pointer remains valid.
      cached_ = "unimplemented method called: " + msg_;
      return cached_.c_str();
    }

   private:
    std::string msg_;
    mutable std::string cached_;
  };

  virtual ~IFeatureOP() = default;

  /**
   * @brief Initialization method.
   * @param feature_config The full fg.json entry for this feature, as a JSON string.
   * @return 0 on success; non-zero indicates failure.
   */
  virtual int Initialize(const string& feature_config) = 0;

  /**
   * @brief Processes one record and outputs string values.
   * @param inputs A record that can contain multiple fields.
   * @param outputs The outputs of the feature transformation.
   * @return 0 on success.
   */
  virtual int ProcessWithStrOutputs(const vector<FieldPtr>& inputs,
                                    vector<string>& outputs) {
    throw NotOverriddenException("ProcessWithStrOutputs(FieldPtr)");
  }

  /**
   * @brief Processes one record and outputs int32 values.
   */
  virtual int ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
                                      vector<int32>& outputs) {
    throw NotOverriddenException("ProcessWithInt32Outputs(FieldPtr)");
  }

  /**
   * @brief Processes one record and outputs int64 values.
   */
  virtual int ProcessWithInt64Outputs(const vector<FieldPtr>& inputs,
                                      vector<int64>& outputs) {
    throw NotOverriddenException("ProcessWithInt64Outputs(FieldPtr)");
  }

  /**
   * @brief Processes one record and outputs float values.
   */
  virtual int ProcessWithFloatOutputs(const vector<FieldPtr>& inputs,
                                      vector<float>& outputs) {
    throw NotOverriddenException("ProcessWithFloatOutputs(FieldPtr)");
  }

  /**
   * @brief Processes one record and outputs double values.
   */
  virtual int ProcessWithDoubleOutputs(const vector<FieldPtr>& inputs,
                                       vector<double>& outputs) {
    throw NotOverriddenException("ProcessWithDoubleOutputs(FieldPtr)");
  }

  /**
   * @brief Optional batch interface that processes one batch of records.
   *
   * @param inputs A vector of input columns; each `VariantVector` represents one feature column.
   * @param outputs The transformed features. Supports complex output types usable as inputs for downstream operators.
   * @return 0 on success.
   */
  virtual int BatchProcess(const vector<VariantVector>& inputs,
                           VariantVector& outputs) {
    throw NotOverriddenException("BatchProcess");
  }

  /**
   * @brief Declares whether the subclass implements BatchProcess.
   *
   * Override this to return `true` if you implement `BatchProcess`.
   * This avoids exception propagation issues across dynamic library boundaries
   * when the .so and the main program use different C++ ABIs.
   *
   * @return true if BatchProcess is implemented; false otherwise (default).
   */
  virtual bool HasBatchProcessImpl() const { return false; }
};

using CreateOperatorFunc = IFeatureOP* (*)();

inline FieldPtr GetFieldPtr(const VariantVector& input, size_t i) {
  return absl::visit(
      [&](const auto& vec) -> FieldPtr {
        if (i >= vec.size()) {
          throw std::out_of_range("GetFieldPtr: index " + std::to_string(i) +
                                  " out of range [0, " +
                                  std::to_string(vec.size()) + ")");
        }
        return &vec.at(i);
      },
      input);
}
}  // namespace fg

#if defined(__GNUC__)
#define PLUGIN_API_HIDDEN \
  __attribute__((visibility("hidden"))) __attribute__((used))
#define PLUGIN_API_EXPORT \
  __attribute__((visibility("default"))) __attribute__((used))
#else
#define PLUGIN_API_HIDDEN
#define PLUGIN_API_EXPORT
#endif

std::vector<std::string>& getLocalNames();
std::vector<std::pair<std::string, void*>>& getLocalRegs();

#define REGISTER_PLUGIN(OpName, OpClass)                            \
  extern "C" PLUGIN_API_EXPORT fg::IFeatureOP* create##OpClass() {  \
    return new fg::OpClass();                                       \
  }                                                                 \
  namespace {                                                       \
  struct _Reg_##OpClass {                                           \
    _Reg_##OpClass() {                                              \
      getLocalNames().push_back(OpName);                            \
      getLocalRegs().emplace_back(OpName, (void*)&create##OpClass); \
    }                                                               \
  };                                                                \
  static _Reg_##OpClass _dummy_##OpClass __attribute__((used));     \
  }

#endif  // FEATURE_GENERATOR_PLUGIN_BASE_H

Implementation notes

  • Input types: Implement ProcessWith* methods for all input types you intend to support. For unsupported types, throw an exception directly. VariantRecord defines all feature field types that the framework can process. FSMap types are required when integrating with Feature Store—they significantly improve online processor performance.

  • Discretization: Implement only the transformation logic before discretization. If hash_bucket_size, vocab_list, boundaries, or another discretization operation is configured, the framework handles it automatically.

  • Thread safety: By default (is_op_thread_safe=true), the operator must be stateless or use only thread_local variables. Set is_op_thread_safe=false to have the framework create one object replica per thread—this is simpler to implement but uses more memory.

  • `string_view` inputs: Online services (EasyRecProcessor, TorchEasyRec Processor) pass item-side string features as absl::string_view for efficiency. If your operator can't handle string_view, set disable_string_view=true in the configuration to have the framework convert them to string before calling your operator. This degrades performance.

  • Custom config items: Use any key names in the JSON entry—the framework passes the full JSON string to Initialize. Don't reuse framework-reserved key names (such as feature_type, operator_name, value_type). Keys that reference external files must end with _file so the framework can sync them for offline tasks.

  • Operator directory: Use the FEATURE_OPERATOR_DIR environment variable to specify the directory where dynamic-link library files are located. Each dynamic-link library can contain implementations of multiple feature operators.

  • `BatchProcess` return type: The type of the VariantVector returned by BatchProcess depends on the values of is_sequence, value_dimension, and value_type. For more information, see the output table schema in Configuration reference. When stub_type=true is configured and no binning operation is set, BatchProcess can return any valid type, such as Map.

Configuration reference

fg.json entry structure

{
    "feature_name": "my_custom_fg_op",
    "feature_type": "custom_feature",
    "operator_name": "EditDistance",
    "operator_lib_file": "libedit_distance.so",
    "expression": [
        "user:query",
        "item:title"
    ],
    "value_type": "string",
    "separator": ",",
    "default_value": "-1",
    "value_dimension": 1,
    "normalizer": "method=expression,expr=x>16?16:x",
    "num_buckets": 10000,
    "stub_type": false,
    "is_sequence": false,
    "is_op_thread_safe": true
}

Add any additional configuration items your operator needs. The entire JSON entry is passed to Initialize.

Configuration parameters

ParameterRequiredDescription
feature_typeYesSet to custom_feature.
operator_nameYesThe name used in REGISTER_PLUGIN. Must match the registered class name. The same operator can be reused across multiple features.
operator_lib_fileOffline: required; Online: optionalName of the .so file. Online services scan the custom_fg_lib subdirectory of the fg.json model directory and load all .so files automatically. For official extension operators, set this to pyfg/lib/libxxx.so. For offline tasks, upload the .so as a MaxCompute resource with the same name.
expressionYesInput fields. Supports multiple inputs.
value_typeYesOutput type. One of: string, int32, int64, float, double.
default_valueYesDefault value as a string. The framework converts it to value_type.
separatorWhen output is multi-dimensionalSplits default_value into multiple values for multi-dimensional features.
stub_typeNoIf true, the operator can only produce intermediate results and cannot be a leaf node in the Directed Acyclic Graph (DAG) execution graph. Default: false.
is_sequenceNoWhether the output is a sequence feature. Default: false.
sequence_lengthWhen is_sequence=trueMaximum sequence length. Longer sequences are truncated.
sequence_delimWhen input is string-type sequenceSeparator between sequence elements.
split_sequenceNoFor string-type input sequences, whether the framework splits the string before passing it to the operator. Default: true. After splitting, the field type becomes std::vector<std::string> even if it was originally a scalar. The split uses the CPU AVX-512 instruction set for better performance. If some inputs are sequences and others are scalars, consider whether framework-level splitting is appropriate.
value_dimensionNoDimension of the output feature. Default: 0. Used to truncate output in offline tasks and affects the output table schema. Omit if the output dimension is variable.
normalizerNoPost-transformation normalization for numeric features. See Normalizer frameworks.
placeholderWhen is_sequence=true and value_dimension != 1Fills empty positions in a multi-value sequence element. Default for floats: NaN; default for integers: minimum value of the type. Sparse features with discretization output a jagged value; dense features without discretization use default_value instead.
disable_string_viewNoConverts string_view-type inputs to string before calling the operator. Default: false. Enable this if your operator can't handle string_view. Note: enabling this degrades performance. Map keys and values of string_view type are not converted—handle them in your operator.
is_op_thread_safeNoWhether the operator is thread-safe. Default: true (operator must be stateless or use only thread_local variables). Set to false to have the framework create one object per thread (simpler, but uses more memory).

Output table schema

The value_dimension and is_sequence settings determine the schema type of the output table in offline tasks:

value_dimensionis_sequenceSchema typeWith discretization
1falsevalue_typebigint
1truearray<value_type>array<bigint>
≠1falsearray<value_type>array<bigint>
≠1truearray<array<value_type>>array<array<bigint>>

Special cases: array<array<int>> is forced to array<array<bigint>>; array<array<double>> is forced to array<array<float>>.

Normalizer frameworks

For numeric features, add a normalizer to further process the transformation result:

FrameworkExample configFormula
log10method=log10,threshold=1e-10,default=-10x = x > threshold ? log10(x) : default
zscoremethod=zscore,mean=0.0,standard_deviation=10.0x = (x - mean) / standard_deviation
minmaxmethod=minmax,min=2.1,max=2.2x = (x - min) / (max - min)
expressionmethod=expression,expr=sign(x)Any expression; the variable x represents the input.

For supported functions in expression, see Built-in feature operators.

Discretization operations

Six discretization types are available without any additional implementation:

TypeDescription
hash_bucket_sizeHash and modulo operation on the transformation result.
vocab_listConverts the result to an index in a list.
vocab_dictConverts the result to a value in a dictionary. The value must be convertible to int64.
vocab_fileLoads a vocab_list or vocab_dict from a file.
boundariesConverts the result to a bucket number based on specified bin boundaries.
num_bucketsUses the result directly as a bucket number.

For more information, see Feature discretization (binning).

Configuration examples

Sequence time-difference feature

{
    "feature_name": "time_diff_seq",
    "feature_type": "custom_feature",
    "operator_name": "SeqExpr",
    "expression": ["user:cur_time", "user:clk_time_seq"],
    "formula": "cur_time - clk_time_seq",
    "default_value": "0",
    "value_type": "int32",
    "is_sequence": true,
    "num_buckets": 1000,
    "is_op_thread_safe": false
}

Spherical distance with normalization

{
    "feature_name": "spherical_distance",
    "feature_type": "custom_feature",
    "operator_name": "SeqExpr",
    "expression": ["item:click_id_lng", "item:click_id_lat", "user:j_lng", "user:j_lat"],
    "formula": "spherical_distance",
    "default_value": "0",
    "value_type": "double",
    "is_sequence": true,
    "is_op_thread_safe": true,
    "value_dimension": 1,
    "normalizer": "method=expression,expr=sqrt(x)"
}

Both examples use the SeqExpr operator. The formula field is a SeqExpr-specific configuration item passed through Initialize.

  • spherical_distance: Calculates the distance between two latitude/longitude coordinate pairs. Parameters are [lng1_seq, lat1_seq, lng2, lat2]—the first two are sequences; the last two are scalars.

These examples demonstrate the tiled format for custom sequence features. For a nested-format example, see sequence_feature.

Sequence features

When is_sequence=true, the output requirements differ based on whether the feature is sparse or dense:

Sparse sequences

  • Single value per element: output any type.

  • Multiple values per element: output string only. Set value_type to string and use chr(29) as the separator between values within one element.

Dense sequences

  • Set value_dimension to the dimension of each element.

  • Scalar elements: value_dimension=1.

  • Vector elements: value_dimension= length of the vector.

  • The total number of output values must be an integer multiple of value_dimension.

Developer example

The following example implements an edit distance operator that computes the Levenshtein distance between two text inputs.

Header file (`edit_distance.h`):

#pragma once
#include "api/base_op.h"

namespace fg {
namespace functor {
  class EditDistanceFunctor;
}

using std::string;
using std::vector;

/**
 * @brief EditDistance: takes two strings, outputs their edit distance.
 */
class EditDistance : public IFeatureOP {
 public:
  int Initialize(const string& feature_config) override;

  /// @return 0 on success.
  int ProcessWithStrOutputs(const vector<FieldPtr>& inputs,
                            vector<string>& outputs) override;

  /// @return 0 on success.
  int ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
                              vector<int32>& outputs) override;

  /// @return 0 on success.
  int ProcessWithInt64Outputs(const vector<FieldPtr>& inputs,
                              vector<int64>& outputs) override;

  /// @return 0 on success.
  int ProcessWithFloatOutputs(const vector<FieldPtr>& inputs,
                              vector<float>& outputs) override;

  /// @return 0 on success.
  int ProcessWithDoubleOutputs(const vector<FieldPtr>& inputs,
                               vector<double>& outputs) override;
 private:
  string feature_name_;
  std::unique_ptr<functor::EditDistanceFunctor> functor_p_;
};

}  // end of namespace fg

Implementation file (`edit_distance.cc`):

#include "edit_distance.h"

#include <absl/strings/ascii.h>
#include <absl/strings/str_join.h>

#include <nlohmann/json.hpp>
#include <numeric>  // std::iota
#include <stdexcept>

#include "api/log.h"

namespace fg {
using absl::optional;

namespace functor {
template <class T>
int edit_distance(const T& s1, const T& s2) {
  int l1 = s1.size();
  int l2 = s2.size();
  if (l1 * l2 == 0) {
    return l1 + l2;
  }
  vector<int> prev(l2 + 1);
  vector<int> curr(l2 + 1);
  std::iota(prev.begin(), prev.end(), 0);
  for (int i = 0; i <= l1; ++i) {
    curr[0] = i;
    for (int j = 1; j <= l2; ++j) {
      int d = prev[j - 1];
      if (s1[i - 1] == s2[j - 1]) {
        curr[j] = d;
      } else {
        int d2 = std::min(prev[j], curr[j - 1]);
        curr[j] = 1 + std::min(d, d2);
      }
    }
    prev.swap(curr);
  }
  return prev[l2];
}

enum class Encoding : unsigned int { Latin = 0, UTF8 = 1 };

class EditDistanceFunctor {
 public:
  EditDistanceFunctor(const string& encoding) {
    string enc = absl::AsciiStrToLower(encoding);
    if (enc == "utf-8" || enc == "utf8") {
      encoding_ = Encoding::UTF8;
    } else {
      encoding_ = Encoding::Latin;
    }
  }

  int operator()(absl::string_view s1, absl::string_view s2) {
    if (encoding_ == Encoding::Latin) {
      return edit_distance(s1, s2);
    }
    if (encoding_ == Encoding::UTF8) {
      return edit_distance(from_bytes(s1), from_bytes(s2));
    }
    LOG(ERROR) << "EditDistanceFunctor found unsupported text encoding";
    assert(false);
    return 0;
  }

  const Encoding TextEncoding() const { return encoding_; }

 private:
  Encoding encoding_;

  std::wstring from_bytes(absl::string_view str) {
    std::wstring result;
    int i = 0;
    int len = (int)str.length();
    while (i < len) {
      int char_size = 0;
      int unicode = 0;

      if ((str[i] & 0x80) == 0) {
        unicode = str[i];
        char_size = 1;
      } else if ((str[i] & 0xE0) == 0xC0) {
        unicode = str[i] & 0x1F;
        char_size = 2;
      } else if ((str[i] & 0xF0) == 0xE0) {
        unicode = str[i] & 0x0F;
        char_size = 3;
      } else if ((str[i] & 0xF8) == 0xF0) {
        unicode = str[i] & 0x07;
        char_size = 4;
      } else {
        // Invalid UTF-8 sequence
        ++i;
        continue;
      }

      for (int j = 1; j < char_size; ++j) {
        unicode = (unicode << 6) | (str[i + j] & 0x3F);
      }

      if (unicode <= 0xFFFF) {
        result += static_cast<wchar_t>(unicode);
      } else {
        // Handle surrogate pairs for characters outside the BMP
        unicode -= 0x10000;
        result += static_cast<wchar_t>((unicode >> 10) + 0xD800);
        result += static_cast<wchar_t>((unicode & 0x3FF) + 0xDC00);
      }
      i += char_size;
    }
    return result;
  }
};
}  // namespace functor

// Overloaded helper for absl::visit (C++17).
template <class... Ts>
struct overloaded : Ts... {
  using Ts::operator()...;
};
template <class... Ts>
overloaded(Ts...) -> overloaded<Ts...>;

int EditDistance::Initialize(const string& feature_config) {
  nlohmann::json cfg;
  try {
    cfg = nlohmann::json::parse(feature_config);
  } catch (nlohmann::json::parse_error& ex) {
    LOG(ERROR) << "parse error at byte " << ex.byte;
    LOG(ERROR) << "config: " << feature_config;
    throw std::runtime_error("parse EditDistance config failed");
  }

  feature_name_ = cfg.at("feature_name");
  string encoding = cfg.value("encoding", "latin");
  functor_p_ = std::make_unique<functor::EditDistanceFunctor>(encoding);
  functor::Encoding enc = functor_p_->TextEncoding();
  encoding = (enc == functor::Encoding::UTF8) ? "UTF-8" : "Latin";
  LOG(INFO) << "feature <" << feature_name_ << "> with text encoding: " << encoding;
  return 0;
}

int EditDistance::ProcessWithInt32Outputs(const vector<FieldPtr>& inputs,
                                          vector<int32>& outputs) {
  outputs.clear();
  if (inputs.size() < 2) {
    outputs.push_back(0);
    return -1;  // invalid inputs
  }

  int d = absl::visit(
      overloaded{
          [this](const optional<string>* s1, const optional<string>* s2) {
            absl::string_view empty_view;
            return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
          },
          [this](const optional<absl::string_view>* s1,
                 const optional<absl::string_view>* s2) {
            absl::string_view empty_view;
            return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
          },
          [this](const optional<absl::string_view>* s1,
                 const optional<string>* s2) {
            absl::string_view empty_view;
            return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
          },
          [this](const optional<string>* s1,
                 const optional<absl::string_view>* s2) {
            absl::string_view empty_view;
            return functor_p_->operator()(*s1 ? **s1 : empty_view, *s2 ? **s2 : empty_view);
          },
          [this](const List<string>* s1, const List<string>* s2) {
            string str1 = absl::StrJoin(*s1, "");
            string str2 = absl::StrJoin(*s2, "");
            return functor_p_->operator()(str1, str2);
          },
          [this](const List<absl::string_view>* s1,
                 const List<absl::string_view>* s2) {
            string str1 = absl::StrJoin(*s1, "");
            string str2 = absl::StrJoin(*s2, "");
            return functor_p_->operator()(str1, str2);
          },
          [this](const auto* x, const auto* y) {
            ERROR_EXIT(feature_name_,
                       "unsupported input type: ", typeid(*x).name(), " vs ",
                       typeid(*y).name());
            return 0;
          }},
      inputs.at(0), inputs.at(1));
  outputs.push_back(d);
  return 0;
}

// int32 results are reused for int64, float, and double outputs.
int EditDistance::ProcessWithInt64Outputs(const vector<FieldPtr>& inputs,
                                          vector<int64>& outputs) {
  vector<int32> distances;
  int status = ProcessWithInt32Outputs(inputs, distances);
  if (0 != status) return status;
  outputs.assign(distances.begin(), distances.end());
  return 0;
}

int EditDistance::ProcessWithFloatOutputs(const vector<FieldPtr>& inputs,
                                          vector<float>& outputs) {
  vector<int32> distances;
  int status = ProcessWithInt32Outputs(inputs, distances);
  if (0 != status) return status;
  outputs.assign(distances.begin(), distances.end());
  return 0;
}

int EditDistance::ProcessWithDoubleOutputs(const vector<FieldPtr>& inputs,
                                           vector<double>& outputs) {
  vector<int32> distances;
  int status = ProcessWithInt32Outputs(inputs, distances);
  if (0 != status) return status;
  outputs.assign(distances.begin(), distances.end());
  return 0;
}

int EditDistance::ProcessWithStrOutputs(const vector<FieldPtr>& inputs,
                                        vector<string>& outputs) {
  vector<int32> distances;
  int status = ProcessWithInt32Outputs(inputs, distances);
  if (0 != status) return status;
  outputs.reserve(distances.size());
  std::transform(distances.begin(), distances.end(),
                 std::back_inserter(outputs),
                 [](int32& x) { return std::to_string(x); });
  return 0;
}

}  // end of namespace fg

REGISTER_PLUGIN("EditDistance", EditDistance);

Download the source code from the table in Available operators and run the build.sh script to compile the operator.

Compile a custom operator

Use the same compilation environment as the FG framework: C++17 and the official compiler image. The image details are in the build.sh script included with each example.

ImageBase OSNotes
mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/feature_generator:centos7-0.1.1CentOS 7Default C++11 ABI (not enabled)
mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/feature_generator:0.1.1Rocky Linux 8 (CentOS 8 compatible)Use this image if you need the new C++11 ABI (_GLIBCXX_USE_CXX11_ABI=1)

Third-party dependencies: Embed all dependencies as source code or use static linking. Dynamic linking to third-party libraries causes the operator to fail to load at runtime.

Required dependency:

  • abseil-cpp — use the same version as the FG framework.

For CMake configuration details, see the CMakeLists.txt in each developer example.

Available operators

The following operators are available as source code and prebuilt binaries:

OperatorDescriptionSource codeBinary
EditDistanceEdit distance between two text inputsDownloadDownload
SeqExprSequence expression evaluationDownloadDownload
BPETokenizeByte Pair Encoding (BPE) tokenizationDownloadIncluded in built-in tokenize_feature

EditDistance configuration

ParameterOptionsDefault
encodingutf-8, latinlatin

For a BatchProcess example, download and review RegexReplace.

What's next