Window functions - syntax and examples - Simple Log Service

Introduction

Unlike aggregate functions that collapse rows into a single result, window functions compute a result for each row based on a related set of rows. Window functions have three core elements: partition, order, and frame (Window Function Concepts and Syntax).

function over (
    [partition by partition_expression]
    [order by order_expression]
    [frame]
)

Partition: The partition by clause divides rows into partitions. If this clause is omitted, the entire result set is treated as a single partition.
Order: The order by clause sorts the rows within each partition.

Note
Using the order by clause on columns with duplicate values results in a non-deterministic row order. To ensure a consistent sort order, specify multiple columns. For example, order by request_time, request_method.
Frame: Restricts rows within a partition. Cannot be used with ranking functions. Syntax: { rows | range} { frame_start | frame_between }. Example: range between unbounded preceding and unbounded following. Window Function Frame Specification.

Functions

Category	Function	Syntax	Description	SQL	SPL
Aggregate functions	Aggregate functions	N/A	All aggregate functions can be used as window functions.	√	×
Ranking functions	cume_dist function	`cume_dist()`	Calculates the cumulative distribution of a value within a partition, which is the fraction of rows with values less than or equal to the current row's value. The return value is in the range (0, 1].	√	×
	dense_rank function	`dense_rank()`	Calculates the rank of a value in a partition. Ties receive the same rank. Ranks are consecutive. For example, if two rows are ranked 1, the next rank is 2.	√	×
	ntile function	ntile(n)	Divides the ordered rows in a partition into a specified number of groups, n.	√	×
	percent_rank function	`percent_rank()`	Calculates the percentage rank of each row within a partition.	√	×
	rank function	`rank()`	Calculates the rank of a value in a partition. Ties receive the same rank. Ranks are not consecutive. For example, if two rows are ranked 1, the next rank is 3.	√	×
	row_number function	`row_number()`	Assigns a unique, sequential integer to each row in a partition, starting from 1. For example, three rows with the same value are ranked 1, 2, and 3.	√	×
Offset functions	first_value function	first_value(x)	Returns the value of x from the first row of a partition.	√	×
	last_value function	last_value(x)	Returns the value of x from the last row of a partition.	√	×
	lag function	lag(x, offset, default_value)	Returns the value of the row that is offset rows above the current row in the window partition. If the row does not exist, default_value is returned.	√	×
	lead function	lead(x, offset, default_value)	Returns the value of the row that is offset rows after the current row in the window partition. If the row does not exist, it returns default_value.	√	×
	nth_value function	nth_value(x, offset)	Returns the value of the row at the offset position in the window partition.	√	×

Aggregate functions

All aggregate functions can serve as window functions. The following example uses sum().

Syntax

sum() over (
    [partition by partition_expression]
    [order by order_expression]
    [frame]
)

Parameters

Parameter	Description
partition by partition_expression	Divides rows into partitions based on the specified partition expression.
order by order_expression	Sorts rows within each partition based on the specified order expression.
frame	Window frame, such as `range between unbounded preceding and unbounded following`.

Return value type

double

Examples

Calculates each employee's salary as a proportion of their department total.

Query statement

* |
SELECT
  department,
  staff_name,
  salary,
  round ( salary * 1.0 / sum(salary) over(partition by department), 3) AS salary_percentage

Query results: The results show data for four employees in the dev department and three in the Marketing department. The salary_percentage column shows each employee's salary as a fraction of their department total.

Cume_dist function

Calculates the cumulative distribution of a value within a partition: the ratio of rows with values less than or equal to the current row's value to total rows. Returns a value in the range (0, 1].

Syntax

cume_dist() over (
    [partition by partition_expression]
    [order by order_expression]
)

Parameters

Parameter	Description
partition by partition_expression	Partitions the rows based on the specified partition expression.
order by order_expression	Sorts the rows within each partition based on the specified order expression.

Return value type

double

Examples

Calculates the cumulative distribution of object sizes in OSS bucket bucket00788.

Query statement

bucket=bucket00788 |
select
  object,
  object_size,
  cume_dist() over (
    partition by object
    order by
      object_size
  ) as cume_dist
from  oss-log-store

Dense_rank function

Returns the rank of a value within a partition. Ties receive the same rank, and ranks are consecutive without gaps. For example, if two rows share rank 1, the next rank is 2.

Syntax

dense_rank() over (
    [partition by partition_expression]
    [order by order_expression]
)

Parameters

Parameter	Description
partition by partition_expression	Divides rows into partitions based on the partition expression.
order by order_expression	Sorts rows within each partition based on the order expression.

Return value type

bigint

Examples

Calculates salary rank by department.

Query and analysis statement

* |
select
  department,
  staff_name,
  salary,
  dense_rank() over(
    partition by department
    order by
      salary desc
  ) as salary_rank
order by
  department,
  salary_rank

Query and analysis results: In Marketing, Blan Stark and Smith share rank 1 (salary 9000); Achilles is rank 2 (8000). In dev, Rob is rank 1 (9000), Blan rank 2 (8500), Sansa rank 3 (8000).

Ntile function

Divides ordered rows in a partition into a specified number of groups.

Syntax

ntile(n) over (
    [partition by partition_expression]
    [order by order_expression]
)

Parameters

Parameter	Description
n	The number of groups to divide the rows into.
partition by partition_expression	Divides the rows into partitions using the partition expression.
order by order_expression	Sorts the rows within each partition by the order expression.

Return value type

bigint

Examples

Divides data for a specified object into three groups.

Query statement

object=245-da918c.model |
select
  object,
  object_size,
  ntile(3) over (
    partition by object
    order by
      object_size
  ) as ntile
from  oss-log-store

Query and analysis results: The query returns nine rows. The values for object_size, in ascending order, are 3396, 3701, 3750, 3757, 3914, 3918, 7440, 7490, and 7521. The corresponding values in the ntile column are 1, 1, 1, 2, 2, 2, 3, 3, and 3. As a result, ntile(3) divides the nine rows into three equal groups of three.

percent_rank function

Calculates the percent rank of each row in a partition. Formula: (rank - 1) / (total_rows - 1), where rank is the current row's rank and total_rows is the partition size.

Syntax

percent_rank() over (
    [partition by partition_expression]
    [order by order_expression]
)

Parameters

Parameter	Description
partition by partition_expression	Partitions the rows based on the given expression.
order by order_expression	Sorts the rows within each partition by the given expression.

Return value type

double

Examples

Calculates the percent rank of an OSS object by size.

Query statement

object=245-da918c3e2dd9dc9cb4d9283b%2F555e2441b6a4c7f094099a6dba8e7a5f.model|
select
  object,
  object_size,
 percent_rank() over (
    partition by object
    order by
      object_size
  ) as ntile
FROM  oss-log-store

Query and analysis results: Six rows sorted by object_size. The ntile column shows percent rank values distributed evenly from 0.0 to 1.0.

Rank function

Returns the rank of each row within a partition. Ties receive the same rank, creating gaps in the sequence. For example, if two rows share rank 1, the next rank is 3.

Syntax

rank() over (
    [partition by partition_expression]
    [order by order_expression]
)

Parameters

Parameter	Description
partition by partition_expression	Divides rows into partitions based on the specified partition expression.
order by order_expression	Sorts rows within each partition based on the specified order expression.

Return value type

bigint

Examples

Ranks employees by salary within each department.

Query and analysis statement

* |
select
  department,
  staff_name,
  salary,
  rank() over(
    partition by department
    order by
      salary desc
  ) as salary_rank
order by
  department,
  salary_rank

Query and analysis results: In Marketing, two employees share rank 1 (salary 9000). The next rank is 3, demonstrating how ties create gaps.

row_number function

Assigns a unique sequential integer to each row within a partition, starting from 1.

Syntax

row_number() over (
    [partition by partition_expression]
    [order by order_expression]
)

Parameters

Parameter	Description
partition by partition_expression	Divides the rows into partitions based on the specified partition expression.
order by order_expression	Sorts the rows within each partition based on the specified order expression.

Return value type

bigint.

Examples

Ranks employees by salary within each department.

Query

* |
select
  department,
  staff_name,
  salary,
  row_number() over(
    partition by department
    order by
      salary desc
  ) as salary_rank
order by
  department,
  salary_rank

Six rows: In Marketing, Blan Stark (9000) is row 1, Smith (9000) row 2, Achilles (8000) row 3. In dev, Rob (9000) is row 1, Blan (8500) row 2, Sansa (8000) row 3.

First_value function

Returns the value of the first row in each partition.

Syntax

first_value(x) over (
    [partition by partition_expression]
    [order by order_expression]
    [frame]
)

Parameters

Parameter	Description
x	The column name. It can be of any data type.
partition by partition_expression	Divides rows into window partitions based on the partition expression.
order by order_expression	Sorts the rows within each window partition based on the order expression.
frame	Specifies the window frame, a subset of the current window partition. For example, `range between unbounded preceding and unbounded following`.

Return value type

Returns the same data type as x.

Examples

Returns the minimum size of each object in an OSS bucket.

Query and analysis statement

bucket :bucket90 |
select
  object,
  object_size,
  first_value(object_size) over (
    partition by object
    order by
      object_size 
     range between unbounded preceding and unbounded following
  ) as first_value
from  oss-log-store

Query and analysis results: Seven rows partitioned into three groups by object. Within each group, first_value equals the minimum object_size (the first row in ascending order).

Last_value function

Returns the value from the last row in a partition.

Syntax

last_value(x) over (
    [partition by partition_expression]
    [order by order_expression]
    [frame]
)

Parameters

Parameter	Description
x	The column name. It can be any data type.
partition by partition_expression	Divides the rows into partitions based on the specified partition expression.
order by order_expression	Sorts the rows within each partition based on the specified order expression.
frame	Specifies the window frame, which is a subset of the current partition. For example, `range between unbounded preceding and unbounded following`.

Return value type

Same type as x.

Examples

Finds the maximum object size in the specified OSS bucket.

Query and analysis statement

bucket :bucket90 |
select
  object,
  object_size,
  last_value(object_size) over (
    partition by object
    order by
      object_size 
     range between unbounded preceding and unbounded following
  ) as last_value
from  oss-log-store

The query returns three columns: object, object_size, and last_value. For 245-da918c.model, seven rows show object_size from 2383 to 6936, with last_value of 6936 for all rows. For dashboard%2F2020%2F05%2F20%2F16%2F47.csv, two rows have object_size 2435 and 2603, with last_value of 2603.

Lag function

Returns the value from a row at a specified offset preceding the current row in a partition.

Syntax

lag(x, offset, default_value) over (
    [partition by partition_expression]
    [order by order_expression]
    [frame]
)

Parameters

Parameter	Description
x	A column or expression of any data type.
offset	The offset, which specifies the number of rows preceding the current row. If you set offset to 0, the function returns the value of the current row.
default_value	Returns default_value if the specified offset row does not exist.
partition by partition_expression	Divides the rows into window partitions based on the specified partition expression.
order by order_expression	Sorts the rows within each window partition based on the specified order expression.
frame	Specifies the window frame, a subset of the current window partition. For example, `range between unbounded preceding and unbounded following`.

Return value type

Same type as x.

Examples

Calculates daily UVs and day-over-day growth rate.

Query statement

* |
select
  day,
  UV,
  UV * 1.0 /(lag(UV, 1, 0) over()) as diff_percentage
from  (
    select
      approx_distinct(client_ip) as UV,
      date_trunc('day', __time__) as day
    from  log
    group by
      day
    order by
      day asc
  )

Query and analysis results: Eight records from August 2-9, 2021. The first row's diff_percentage is Infinity because lag returns the default value 0, causing division by zero. Subsequent rows show the ratio of current-day to previous-day UVs, such as 2.098 and 0.976.

Lead function

Returns the value from a row at a specified offset after the current row in a partition.

Syntax

lead(x, offset, default_value) over (
    [partition by partition_expression]
    [order by order_expression]
    [frame]
)

Parameters

Parameter	Description
x	A column or expression of any data type.
offset	The number of rows after the current row. If offset is 0, the function returns the value of the current row.
default_value	If the specified offset row does not exist, it returns default_value.
partition by partition_expression	Divides the rows into window partitions based on the specified partition expression.
order by order_expression	Sorts the rows within each window partition based on the specified order expression.
frame	Specifies the window frame, for example, `range between unbounded preceding and unbounded following`.

Return value type

Returns a value of the same data type as x.

Examples

Calculates the hourly UV ratio on August 26, 2021.

Query statement

* |
select
  time,
  UV,
  UV * 1.0 /(lead(UV, 1, 0) over()) as diff_percentage
from  (
    select
      approx_distinct(client_ip) as uv,
      date_trunc('hour', __time__) as time
    from    log
    group by
      time
    order by
      time asc
  )

Query and analysis results: The query returns eight rows, showing the hourly unique visitors (UV) and the ratio of the current hour's UVs to the next hour's UVs (diff_percentage) from 00:00 to 07:00 on August 26, 2021.

nth_value function

Returns the value from the offset-th row in a partition.

Syntax

nth_value(x, offset) over (
    [partition by partition_expression]
    [order by order_expression]
    [frame]
)

Parameters

Parameter	Description
x	A column or expression. It can be of any data type.
offset	The row offset. It must be a positive integer.
partition by partition_expression	Divides the rows into partitions based on the partition expression.
order by order_expression	Sorts the rows within each partition based on the order expression.
frame	Specifies the window frame, which is a subset of the current partition. For example, `range between unbounded preceding and unbounded following`.

Return value type

Same as the data type of x.

Examples

Returns the employee with the second-highest salary per department.

Query statement

* |
select
  department,
  staff_name,
  salary,
  nth_value(staff_name, 2) over(
    partition by department
    order by
      salary desc
      range between unbounded preceding and unbounded following
  ) as second_highest_salary from log

Query results: The second_highest_salary column shows Blan (8500) for dev and San (7000) for Marketing across all rows in each partition.