doc-stats数据统计节点 - 云监控

doc-stats 节点对指定文本字段计算多项文档级统计指标（字符数、词数、行数等），聚合输出为一个 JSON 列。每条记录独立计算，不改变行数。

适用场景

数据质量检查：基于文本长度过滤过短或过长的异常记录。
Pipeline 运行统计：了解各处理阶段的文本长度分布。
Dataset 元数据：为每条记录附加标准化的文本统计信息。

节点配置

在 Pipeline JSON 中添加 doc-stats 类型节点，配置示例如下：

{
  "id": "node_1",
  "type": "doc-stats",
  "parameters": {
    "field": "<字段名>",
    "as": "<输出列名>",
    "output": "<输出列列表>"
  }
}

参数说明

参数	类型	必填	默认值	说明
`field`	String	是	-	待统计的文本字段名。
`as`	String	否	`__doc_stats`	输出统计列的列名。
`output`	String	否	`*`	节点输出列，多列以半角逗号分隔。`*`（默认）保留全部列含扩展列；指定时仅输出列出的列。

输入和输出

输入要求

上游节点输出的任意列数据。
要求包含 field 参数指定的字段。

输出列

列名	类型	来源	说明
`output` 指定列	-	透传	`*` 保留全部原始输入列。
`{as}`	JSON	新增	文档统计 JSON，包含字符数、词数、行数等指标。

统计指标（JSON 内部结构）

Key	类型	说明
`doc_len_char`	bigint	文本字符数。
`doc_len_words`	bigint	文本词数（按空格分词）。
`line_counts`	bigint	文本行数。

示例输出：

{"doc_len_char": 42, "doc_len_words": 8, "line_counts": 1}

行数变化

M to N（M = N）：1:1 变换，不增减行数。

效果预览

处理前（3 条）：

question	input	output
什么是机器学习？	请解释	机器学习是人工智能的一个分支
如何学习Python？请推荐一些入门资源	入门	推荐从官方教程开始然后做项目
AI	简述	人工智能

处理后（3 条，field = "question"）：

question	input	output	__doc_stats
什么是机器学习？	请解释	机器学习是...	`{"doc_len_char":8,"doc_len_words":1,"line_counts":1}`
如何学习Python？请推荐...	入门	推荐从...	`{"doc_len_char":18,"doc_len_words":1,"line_counts":1}`
AI	简述	人工智能	`{"doc_len_char":2,"doc_len_words":1,"line_counts":1}`

行数不变（3 to 3），每行新增统计 JSON 列。可结合 where 节点按字符数过滤过短或过长文本。

使用示例

示例 1：基础文档统计

统计 question 字段，输出列名为默认的 __doc_stats。

{
  "id": "n5",
  "type": "doc-stats",
  "parameters": {
    "field": "question"
  }
}

示例 2：自定义输出列名

通过 as 参数将输出列名自定义为 question_stats。

{
  "id": "n5",
  "type": "doc-stats",
  "parameters": {
    "field": "question",
    "as": "question_stats"
  }
}

示例 3：结合过滤使用

先通过 doc-stats 计算统计指标，再通过 where 节点过滤掉过短的文本。

{
  "nodes": [
    { "id": "n1", "type": "project", "parameters": { "question": "a", "output": "c" } },
    { "id": "n2", "type": "doc-stats", "parameters": { "field": "question" } },
    { "id": "n3", "type": "where", "parameters": { "filter": "json_extract_scalar(__doc_stats, '$.doc_len_char') > '10'" } }
  ]
}

使用建议

结合 where 节点实现数据质量过滤：先统计再按指标过滤过短或过长文本。
在 Pipeline 末尾使用，为 Dataset 输出附加标准化的文本元数据。
在 LLM 处理后统计输出文本长度，辅助质量监控。
使用 json_extract_scalar 提取具体指标值，例如 json_extract_scalar(__doc_stats, '$.doc_len_char')。
通过 as 参数自定义输出列名（默认 __doc_stats），便于多字段分别统计。
本节点不依赖远程函数，计算开销极低，可在 Pipeline 任意位置使用。

异常处理

场景	行为
`field` 缺失	校验失败。
`field` 字段值为 NULL	统计值为 `{"doc_len_char": 0, "doc_len_words": 1, "line_counts": 1}`。
文本为空字符串	与 NULL 处理类似。
输入数据为空	正常返回空结果集。

实现说明

doc-stats 在 SPL 层面不作为独立算子封装，API 翻译时直接生成等价的 extend 表达式。

API → SPL 翻译示例：

{ "field": "question", "as": "question_stats" }

↓ 翻译为：

extend question_stats=cast(map(array['doc_len_char','doc_len_words','line_counts'],array[cast(length(coalesce(question,'')) as bigint),cast(cardinality(split(coalesce(question,''),' ')) as bigint),cast(cardinality(split(coalesce(question,''),chr(10))) as bigint)]) as json)

节点	关系说明
`where`	计算统计后可通过 `where` 按指标过滤
`llm-call`	可在 AI 处理后统计输出文本的质量指标