Deduplication and sorting by using DISTINCT and ORDER BY - PolarDB

DISTINCT and ORDER BY are two Oracle SQL clauses. DISTINCT deduplicates query results, and ORDER BY sorts query results. When the two clauses are used together, you need to notice their syntax and execution order. This topic describes how to use DISTINCT and ORDER BY in Oracle SQL statements.

Syntax

SELECT DISTINCT column1, column2, ...
FROM table_name
WHERE condition
ORDER BY column1 [ASC|DESC], column2 [ASC|DESC], ...;

SELECT DISTINCT: duplicates the result and returns unique records.
FROM: specifies the table to be queried.
WHERE: optional. The query conditions.
ORDER BY: sorts the result set. You can specify one or more columns and select ascending (ASC) or descending (DESC ).

How they work

In SQL queries, DISTINCT is executed before ORDER BY. The system selects unique records and then sorts them.

Examples

A table named employees has the following structure:

CREATE TABLE employees (
    employee_id INT PRIMARY KEY , - The unique ID of an employee.
    first_name VARCHAR(50) NOT NULL, -- The first name of an employee
    last_name VARCHAR(50) NOT NULL, -- The last name of an employee.
    department VARCHAR(50), -- The department of an employee.
    salary DECIMAL (10,2) CHECK (salary > 0) -- The salary of an employee.
);

INSERT INTO employees VALUES (1, 'John', 'Doe', 'Sales', 5000);
INSERT INTO employees VALUES (2, 'Jane', 'Smith', 'Marketing', 6000);
INSERT INTO employees VALUES (3, 'John', 'Doe', 'Sales', 5000);
INSERT INTO employees VALUES (4, 'Alice', 'Johnson', 'IT', 7000);
INSERT INTO employees VALUES (5, 'Bob', 'Brown', 'Sales', 5000);

Select unique department values and sort them in ascending order.

SELECT DISTINCT DEPARTMENT
FROM employees
ORDER BY DEPARTMENT ASC;

Sample result:

 department 
------------
 IT
 Marketing
 Sales
(3 rows)

Select unique salary values and sort them in descending order.

SELECT DISTINCT SALARY
FROM employees
ORDER BY SALARY DESC;

Sample result:

 salary  
---------
 7000.00
 6000.00
 5000.00
(3 rows)

Select unique name values and sort them in ascending order.

SELECT DISTINCT FIRST_NAME, LAST_NAME
FROM employees
ORDER BY FIRST_NAME ASC, LAST_NAME ASC;

Sample result:

Note

Even if multiple records with the same FIRST_NAME and LAST_NAME, only one unique name is returned because DISTINCT is used.

 first_name | last_name 
------------+-----------
 Alice      | Johnson
 Bob        | Brown
 Jane       | Smith
 John       | Doe
(4 rows)

Usage notes

Columns in ORDER BY
When you use ORDER BY, the columns specified in the clause must exist in the SELECT statement. If you reference a column in ORDER BY that is not in the SELECT statement, PolarDB returns an error.
Incorrect example:
```
SELECT DISTINCT FIRST_NAME
FROM employees
ORDER BY LAST_NAME ASC;
```
The following error message is returned:
```
ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list
```
For the sake of Oracle compatibility, you can reference columns that are not in a SELECT statement after setting the polar_enable_distinct_orderby_new_column parameter to ON. To use this feature, perform the following steps:
1. Set the polar_enable_distinct_orderby_new_column parameter to ON.
  Note
  The polar_enable_distinct_orderby_new_column parameter is first introduced in PolarDB for PostgreSQL (Compatible with Oracle) 2.0 with revision version 2.0.14.26.0. If your cluster version is earlier than 2.0.14.26.0 and you want to use this feature, upgrade your cluster to 2.0.14.26.0 or later. For more information about how to view and upgrade minor versions, see Version management.
2. Reference a column in ORDER BY that is not in the SELECT statement.
3. Add the FETCH FIRST X rows ONLY or LIMIT keyword to the end of the query statement.
```
SELECT DISTINCT FIRST_NAME
FROM employees
ORDER BY LAST_NAME ASC
FETCH FIRST 2 rows ONLY;
```
Use aliases
You can specify aliases for columns in the SELECT statement and then use these aliases in ORDER BY.
```
SELECT DISTINCT FIRST_NAME AS FN
FROM employees
ORDER BY FN ASC;
```
Use positions of the columns referenced in ORDER BY
You can use the position of a column to sort data. For example, ORDER BY 1 indicates that data is sorted by the first column in the SELECT statement.
```
SELECT DISTINCT FIRST_NAME, LAST_NAME
FROM employees
ORDER BY 1 ASC;
```

Advanced usage

Combined with WHERE

DISTINCT and ORDER BY can be combined with WHERE to filter records.

Example:

SELECT DISTINCT DEPARTMENT
FROM employees
WHERE SALARY > 5000
ORDER BY DEPARTMENT ASC;

Sample result:

 department 
------------
 IT
 Marketing
(2 rows)

Combined with subqueries

In some complex query scenarios, DISTINCT and ORDER BY can be combined with subqueries.

Example:

Select the department with the highest salary and then sort the departments:

SELECT DEPARTMENT, MAX(SALARY) AS MAX_SALARY
FROM employees
GROUP BY DEPARTMENT
ORDER BY DEPARTMENT ASC;

Sample result:

 department | max_salary 
------------+------------
 IT         |    7000.00
 Marketing  |    6000.00
 Sales      |    5000.00
(3 rows)

Combined with window functions

DISTINCT and ORDER BY can be combined with window functions to implement more complex data analysis.

Example:

Select the employee with the highest salary in each department and sort the departments in descending order of salary.

SELECT DISTINCT DEPARTMENT, FIRST_NAME, LAST_NAME, SALARY
FROM (
    SELECT DEPARTMENT, FIRST_NAME, LAST_NAME, SALARY,
           ROW_NUMBER() OVER (PARTITION BY DEPARTMENT ORDER BY SALARY DESC) AS rn
    FROM employees
)
WHERE rn = 1
ORDER BY SALARY DESC;

Sample result:

 department | first_name | last_name | salary 
------------+------------+-----------+--------
 IT         | Alice      | Johnson   |   7000
 Marketing  | Jane       | Smith     |   6000
 Sales      | John       | Doe       |   5000
(3 rows)

Common errors and debugging

Sort the column not specified in WHERE

As described in Columns in ORDER BY, an error is returned if you use a column in the ORDER BY that is not in the SELECT statement. You must ensure that all columns specified in ORDER BY exist in the SELECT statement, or use aliases or column positions.

Note

If you set the polar_enable_distinct_orderby_new_column parameter to ON and add the FETCH FIRST X rows ONLY keyword, it is allowed for DISTINCT to implicitly reference the column in the ORDER BY.

Confuse the coverage of DISTINCT

DISTINCT apply to all columns in the SELECT statement, not just a single column. Therefore, the SELECT DISTINCT statement that specifies multiple columns deduplicates all specified columns.

Example:

SELECT DISTINCT FIRST_NAME, DEPARTMENT
FROM employees
ORDER BY FIRST_NAME ASC;

The above SQL statement deduplicates records filtered using FIRST_NAME and DEPARTMENT, not just using FIRST_NAME.

Performance improvements

When large datasets are involved, using DISTINCT and ORDER BY has impacts on query performance. The following optimization suggestions can be used:

Index optimization: Create appropriate indexes for the columns specified in WHERE, DISTINCT, and ORDER BY to speed up queries.
Minimize columns returned: Select only the necessary columns to reduce data volumes to be processed.
Use partitioning: You can partition large tables to improve query efficiency.
Execution plan analysis: Use the EXPLAIN PLAN statement to analyze execution plans and identify potential performance bottlenecks.

Example:

Use index optimization.

-- Create an index to optimize queries on the DEPARTMENT column.
CREATE INDEX idx_employees_department ON employees(DEPARTMENT);

Summary

DISTINCT and ORDER BY are powerful SQL tools in PolarDB for deduplicating records and sorting result sets. Understanding how they work, their syntax, and common scenarios helps you improve query efficiency and accuracy. These two clauses can be combined with other SQL features to implement complex data processing and analysis tasks.