高度データ分析のためのウィンドウ関数の理解 - AnalyticDB for MySQL

ウィンドウ関数を使用して、グループランキング、移動平均、累積合計などの複雑な計算を実行できます。このトピックでは、ウィンドウ関数の構文について説明し、AnalyticDB for MySQLでウィンドウ関数を使用する方法の例を示します。

集計関数
ソート関数
- CUME_DIST: 値のセット内の各値の累積分布を返します。
- RANK: データセット内の各値のランクを返します。
- DENSE_RANK: 値のセット内の各値のランクを返します。
- NTILE: 各ウィンドウパーティション内のデータをn個のバケットに分散します。バケツは1からnまで番号が付けられています。
- ROW_NUMBER: 1から始まる、ウィンドウパーティション内の行のシーケンスに基づいて、各行の一意のシーケンシャル番号を返します。
- PERCENT_RANK: データセット内の各値のランキング率を (r - 1)/(n - 1) 形式で返します。 rは、rank () 関数を使用して計算された現在の行のランクであり、nは、現在のウィンドウパーティション内の行の総数である。
値関数
- FIRST_VALUE: ウィンドウパーティション内の最初の行の値を返します。
- LAST_VALUE: ウィンドウパーティション内の最後の行の値を返します。
- LAG: ウィンドウ内のオフセット行によって現在の行の前にある行の値を返します。
- LEAD: ウィンドウ内の行をオフセットして、現在の行に続く行の値を返します。
- NTH_VALUE: ウィンドウ内の指定されたオフセット行数だけオフセットされた行の値を返します。オフセットは1から始まります。

概要

ウィンドウ関数は、クエリ結果からの行データに基づいて集計値を計算します。ウィンドウ関数は、HAVING句の後およびORDER BY句の前に実行されます。 OVER句を使用してウィンドウを指定すると、ウィンドウ関数がトリガーされます。

AnalyticDB for MySQLは、集計関数、ソート関数、値関数の3種類のウィンドウ関数をサポートしています。

構文

function over ([partition by a] order by b RANGE|ROWS BETWEEN start AND end)

ウィンドウ関数には、次の部分が含まれます。

パーティションルール: 入力行を異なるパーティションに分割します。このプロセスは、GROUP BY句のグループ化プロセスと同様です。パーティションルール部分はオプションです。
ソートルール: 入力行がウィンドウ関数で実行される順序を決定します。
ウィンドウフレーム: ウィンドウ関数が計算を実行するデータの境界を指定します。
ウィンドウフレームは、RANGEおよびROWSモードをサポートします。
- RANGEは列値の範囲を定義します。
- ROWSは、現在の行に対する相対行数を定義します。
- RANGEおよびROWSでは、BETWEEN start and endを使用して境界値を指定できます。 BETWEEN start AND endの引数の有効な値:
  - 現在の行: 現在の行。
  - N PRECEDING: 前のn行。
  - UNBOUNDED PRECEDING: 最初の行から現在の行までの行。
  - N FOLLOWING: 次のn行。
  - UNBOUNDED FOLLOWING: 現在の行から最後の行までの行。

たとえば、次のクエリでは、現在のウィンドウのデータの各行に基づいて利益の部分合計を計算します。

select year,country,profit,sum(profit) over (partition by country order by year ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) as slidewindow from testwindow;
+------+---------+--------+-------------+
| year | country | profit | slidewindow |
+------+---------+--------+-------------+
| 2001 | USA     |     50 |          50 |
| 2001 | USA     |   1500 |        1550 |
| 2000 | Germany |     75 |          75 |
| 2000 | Germany |     75 |         150 |
| 2001 | Germany |     79 |         229 |
| 2000 | Finland |   1500 |        1500 |
| 2001 | Finland |     10 |        1510 |

次のクエリでは、利益の合計のみを計算できます。

select country,sum(profit) over (partition by country) from testwindow;
+---------+-----------------------------------------+
| country | sum(profit) OVER (PARTITION BY country) |
+---------+-----------------------------------------+
| Germany |                                     229 |
| Germany |                                     229 |
| Germany |                                     229 |
| USA     |                                    1550 |
| USA     |                                    1550 |
| Finland |                                    1510 |
| Finland |                                    1510 |

使用上の注意

境界値が次の要件を満たしていることを確認します。

startはUNBOUNDED FOLLOWINGできません。それ以外の場合、Window frame start cannot be UNBOUNDED FOLLOWINGエラーが返されます。
endをUNBOUNDED PRECEDINGにすることはできません。それ以外の場合は、Window frame end cannot be UNBOUNDED PRECEDINGエラーが返されます。
startがCURRENT ROW、endがN PRECEDINGの場合、Window frame starting from CURRENT ROW cannot end with PRECEDINGエラーが返されます。
startがN FOLLOWING、endがN PRECEDINGの場合、Window frame starting from FOLLOWING cannot end with PRECEDINGエラーが返されます。
startがN FOLLOWINGで、endがCURRENT ROWの場合、Window frame starting from FOLLOWING cannot end with CURRENT ROWエラーが返されます。

ウィンドウフレームがRANGEモードの場合、次のルールが適用されます。

startまたはendがN PRECEDINGの場合、Window frame RANGE PRECEDING is supported only with UNBOUNDEDエラーが返されます。
startまたはendがN FOLLOWINGの場合、Window frame RANGE FOLLOWING is supported only with UNBOUNDEDエラーが返されます。

準備

このトピックでは、testwindowテーブルのデータをウィンドウ関数の例で使用します。

create table testwindow(year int, country varchar(20), product varchar(20), profit int) distributed by hash(year);

insert into testwindow values (2000,'Finland','Computer',1500);
insert into testwindow values (2001,'Finland','Phone',10);
insert into testwindow values (2000,'Germany','Calculator',75);
insert into testwindow values (2000,'Germany','Calculator',75);
insert into testwindow values (2001,'Germany','Calculator',79);
insert into testwindow values (2001,'USA','Calculator',50);
insert into testwindow values (2001,'USA','Computer',1500);

SELECT * FROM testwindow;
+------+---------+------------+--------+
| year | country | product    | profit |
+------+---------+------------+--------+
| 2000 | Finland | Computer   |   1500 |
| 2001 | Finland | Phone      |     10 |
| 2000 | Germany | Calculator |     75 |
| 2000 | Germany | Calculator |     75 |
| 2001 | Germany | Calculator |     79 |
| 2001 | USA     | Calculator |     50 |
| 2001 | USA     | Computer   |   1500 |

集計関数

OVER句を追加することで、すべての集計関数をウィンドウ関数として使用できます。集計関数は、現在のスライディングウィンドウ内の行に基づいてデータの各行を計算します。詳細については、「集計関数」をご参照ください。

たとえば、次のクエリでは、各クラークの日付ごとの注文価格のローリング合計が生成されます。

SELECT clerk, orderdate, orderkey, totalprice,sum(totalprice) OVER (PARTITION BY clerk ORDER BY orderdate) AS rolling_sum FROM orders ORDER BY clerk, orderdate, orderkey

CUME_DIST

CUME_DIST()

説明: この関数は、値のセット内の各値の累積分布を返します。
Return result: 現在の行と現在の行の前のデータ行の数を含む、ウィンドウパーティション内でソートした後に取得されたデータセット。ソートにおける任意の関連する値は、同じ分布値に対して計算される。
戻り値のデータ型: DOUBLE。

例：

select year,country,product,profit,cume_dist() over (partition by country order by profit) as cume_dist from testwindow;
+------+---------+------------+--------+--------------------+
| year | country | product    | profit | cume_dist          |
+------+---------+------------+--------+--------------------+
| 2001 | USA     | Calculator |     50 |                0.5 |
| 2001 | USA     | Computer   |   1500 |                1.0 |
| 2001 | Finland | Phone      |     10 |                0.5 |
| 2000 | Finland | Computer   |   1500 |                1.0 |
| 2000 | Germany | Calculator |     75 | 0.6666666666666666 |
| 2000 | Germany | Calculator |     75 | 0.6666666666666666 |
| 2001 | Germany | Calculator |     79 |                1.0 |

RANK

RANK()

説明: この関数は、データセット内の各値のランクを返します。
rank値は、現在の行の前の行数に1を加えたものです。現在の行はカウントされません。したがって、ソーティングにおける関連する値は、シーケンス内にギャップを生成し得る。ランクは、各ウィンドウ区画に対して計算される。
戻り値のデータ型: BIGINT。

例：

select year,country,product,profit,rank() over (partition by country order by profit) as rank from testwindow;
+------+---------+------------+--------+------+
| year | country | product    | profit | rank |
+------+---------+------------+--------+------+
| 2001 | Finland | Phone      |     10 |    1 |
| 2000 | Finland | Computer   |   1500 |    2 |
| 2001 | USA     | Calculator |     50 |    1 |
| 2001 | USA     | Computer   |   1500 |    2 |
| 2000 | Germany | Calculator |     75 |    1 |
| 2000 | Germany | Calculator |     75 |    1 |
| 2001 | Germany | Calculator |     79 |    3 |

DENSE_RANK

DENSE_RANK()

説明: この関数は、値のセット内の各値のランクを返します。
DENSE_RANK() とRANK() には同様の特徴がありますが、DENSE_RANK() の関連する値はシーケンスにギャップを生成しません。
戻り値のデータ型: BIGINT。

例：

select year,country,product,profit,dense_rank() over (partition by country order by profit) as dense_rank from testwindow;
+------+---------+------------+--------+------------+
| year | country | product    | profit | dense_rank |
+------+---------+------------+--------+------------+
| 2001 | Finland | Phone      |     10 |          1 |
| 2000 | Finland | Computer   |   1500 |          2 |
| 2001 | USA     | Calculator |     50 |          1 |
| 2001 | USA     | Computer   |   1500 |          2 |
| 2000 | Germany | Calculator |     75 |          1 |
| 2000 | Germany | Calculator |     75 |          1 |
| 2001 | Germany | Calculator |     79 |          2 |

NTILE

NTILE(n)

説明: この関数は、各ウィンドウパーティション内のデータをn個のバケットに分散します。バケツは1からnまで番号が付けられています。
バケット番号の最大差は1です。ウィンドウパーティション内のデータ行が各バケットに均等に分散されていない場合、残りのデータは最初のバケットから分散され、各バケットに1つのデータ行が含まれます。例えば、6つの行と4つのバケットが存在する場合、行は、1、1、2、2、3、および4の方法でバケットに分配される。
戻り値のデータ型: BIGINT。

例：

select year,country,product,profit,ntile(2) over (partition by country order by profit) as ntile2 from testwindow;
+------+---------+------------+--------+--------+
| year | country | product    | profit | ntile2 |
+------+---------+------------+--------+--------+
| 2001 | USA     | Calculator |     50 |      1 |
| 2001 | USA     | Computer   |   1500 |      2 |
| 2001 | Finland | Phone      |     10 |      1 |
| 2000 | Finland | Computer   |   1500 |      2 |
| 2000 | Germany | Calculator |     75 |      1 |
| 2000 | Germany | Calculator |     75 |      1 |
| 2001 | Germany | Calculator |     79 |      2 |

ROW_NUMBER

ROW_NUMBER()

説明: この関数は、1から始まるウィンドウパーティション内の行のシーケンスに基づいて、各行に一意のシーケンシャル番号を返します。
戻り値のデータ型: BIGINT。

例：

SELECT year, country, product, profit, ROW_NUMBER() OVER(PARTITION BY country) AS row_num1 FROM testwindow;
+------+---------+------------+--------+----------+
| year | country | product    | profit | row_num1 |
+------+---------+------------+--------+----------+
| 2001 | USA     | Calculator |     50 |        1 |
| 2001 | USA     | Computer   |   1500 |        2 |
| 2000 | Germany | Calculator |     75 |        1 |
| 2000 | Germany | Calculator |     75 |        2 |
| 2001 | Germany | Calculator |     79 |        3 |
| 2000 | Finland | Computer   |   1500 |        1 |
| 2001 | Finland | Phone      |     10 |        2 |

PERCENT_RANK

PERCENT_RANK()

説明: この関数は、データセット内の各値のランキング率を (r - 1)/(n - 1) 形式で返します。 rは、rank () によって計算された現在の行のランクであり、nは、現在のウィンドウパーティション内の行の総数である。
戻り値のデータ型: DOUBLE。

例：

select year,country,product,profit,PERCENT_RANK() over (partition by country order by profit) as ntile3 from testwindow;
+------+---------+------------+--------+--------+
| year | country | product    | profit | ntile3 |
+------+---------+------------+--------+--------+
| 2001 | Finland | Phone      |     10 |    0.0 |
| 2000 | Finland | Computer   |   1500 |    1.0 |
| 2001 | USA     | Calculator |     50 |    0.0 |
| 2001 | USA     | Computer   |   1500 |    1.0 |
| 2000 | Germany | Calculator |     75 |    0.0 |
| 2000 | Germany | Calculator |     75 |    0.0 |
| 2001 | Germany | Calculator |     79 |    1.0 |

FIRST_VALUE

FIRST_VALUE(x)

説明: この関数は、ウィンドウパーティション内の最初の行の値を返します。
戻り値のデータ型: 入力引数型と同じ。

例：

select year,country,product,profit,first_value(profit) over (partition by country order by profit) as firstValue from testwindow;
+------+---------+------------+--------+------------+
| year | country | product    | profit | firstValue |
+------+---------+------------+--------+------------+
| 2000 | Germany | Calculator |     75 |         75 |
| 2000 | Germany | Calculator |     75 |         75 |
| 2001 | Germany | Calculator |     79 |         75 |
| 2001 | USA     | Calculator |     50 |         50 |
| 2001 | USA     | Computer   |   1500 |         50 |
| 2001 | Finland | Phone      |     10 |         10 |
| 2000 | Finland | Computer   |   1500 |         10 |

LAST_VALUE

LAST_VALUE(x)

説明: この関数は、ウィンドウパーティション内の最後の行の値を返します。 LAST_VALUEのデフォルトのウィンドウフレームは、先行行と現在行の間の行で、現在行のデータと先行行のデータを比較します。 LAST_VALUEが最後の行の値を返す場合は、ORDER BY句の後に次の句を追加します。
戻り値のデータ型: 入力引数型と同じ。

例 1：

select year,country,product,profit,last_value(profit) over (partition by country order by profit) as firstValue from testwindow;
+----------------+-------------------+-------------------+------------------+----------------------+
| year           | country           | product           | profit           | firstValue           |
+----------------+-------------------+-------------------+------------------+----------------------+
|           2001 | USA               | Calculator        |               50 |                   50 |
|           2001 | USA               | Computer          |             1500 |                 1500 |
|           2001 | Finland           | Phone             |               10 |                   10 |
|           2000 | Finland           | Computer          |             1500 |                 1500 |
|           2000 | Germany           | Calculator        |               75 |                   75 |
|           2000 | Germany           | Calculator        |               75 |                   75 |
|           2001 | Germany           | Calculator        |               79 |                   79 |

例 2：

select year,country,product,profit,last_value(profit) over (partition by country order by profitrows between unbounded preceding and unbounded following) as lastValue from testwindow;
+------+---------+-----------+--------+-----------+
| year | country | product   | profit | lastValue |
+------+---------+-----------+--------+-----------+
| 2001 | Finland | Phone     |   10   |   1500    |
| 2000 | Finland | Computer  |  1500  |   1500    |
| 2000 | Germany | Calculator|   75   |    79     |
| 2000 | Germany | Calculator|   75   |    79     |
| 2001 | Germany | Calculator|   79   |    79     |
| 2001 | USA     | Calculator|   50   |   1500    |
| 2001 | USA     | Computer  |  1500  |   1500    |
+------+---------+-----------+--------+-----------+

LAG

LAG(x[, offset[, default_value]])

説明: この関数は、ウィンドウ内の現在の行の前にある行の値をoffsetだけ返します。
開始オフセット値は0で、現在のデータ行を指定します。オフセット値は、スカラー式とすることができる。デフォルトのオフセット値は1です。
オフセット値がnullまたはウィンドウの長さより大きい場合、default_valueが返されます。 default_valueを指定しない場合は、nullが返されます。
戻り値のデータ型: 入力引数型と同じ。

例：

select year,country,product,profit,lag(profit) over (partition by country order by profit) as lag from testwindow;
+------+---------+------------+--------+------+
| year | country | product    | profit | lag  |
+------+---------+------------+--------+------+
| 2001 | USA     | Calculator |     50 | NULL |
| 2001 | USA     | Computer   |   1500 |   50 |
| 2000 | Germany | Calculator |     75 | NULL |
| 2000 | Germany | Calculator |     75 |   75 |
| 2001 | Germany | Calculator |     79 |   75 |
| 2001 | Finland | Phone      |     10 | NULL |
| 2000 | Finland | Computer   |   1500 |   10 |

LEAD

LEAD(x[,offset[, default_value]])

説明: この関数は、ウィンドウ内の現在の行に続く行の値をオフセットして返します。
開始オフセット値は0で、現在のデータ行を指定します。オフセット値は、スカラー式とすることができる。デフォルトのオフセット値は1です。
オフセット値がnullまたはウィンドウの長さより大きい場合、default_valueが返されます。 default_valueを指定しない場合は、nullが返されます。
戻り値のデータ型: 入力引数型と同じ。

例：

select year,country,product,profit,lead(profit) over (partition by country order by profit) as lead from testwindow;
+------+---------+------------+--------+------+
| year | country | product    | profit | lead |
+------+---------+------------+--------+------+
| 2000 | Germany | Calculator |     75 |   75 |
| 2000 | Germany | Calculator |     75 |   79 |
| 2001 | Germany | Calculator |     79 | NULL |
| 2001 | Finland | Phone      |     10 | 1500 |
| 2000 | Finland | Computer   |   1500 | NULL |
| 2001 | USA     | Calculator |     50 | 1500 |
| 2001 | USA     | Computer   |   1500 | NULL |

NTH_VALUE

NTH_VALUE(x, offset)

説明: この関数は、ウィンドウ内で指定された数のオフセット行だけオフセットされた行の値を返します。オフセットは1から始まります。
offset値がnullまたはウィンドウ内の値の数より大きい場合、nullが返されます。 オフセット値が0または負の場合、エラーが返されます。
戻り値のデータ型: 入力引数型と同じ。

例：

select year,country,product,profit,nth_value(profit,1) over (partition by country order by profit) as nth_value from testwindow;
+------+---------+------------+--------+-----------+
| year | country | product    | profit | nth_value |
+------+---------+------------+--------+-----------+
| 2001 | Finland | Phone      |     10 |        10 |
| 2000 | Finland | Computer   |   1500 |        10 |
| 2001 | USA     | Calculator |     50 |        50 |
| 2001 | USA     | Computer   |   1500 |        50 |
| 2000 | Germany | Calculator |     75 |        75 |
| 2000 | Germany | Calculator |     75 |        75 |
| 2001 | Germany | Calculator |     79 |        75 |