PolarDB Distributed Edition Architecture Overview - PolarDB

The PolarDB for PostgreSQL Distributed Edition cluster is a distributed database solution developed based on the PolarDB for PostgreSQL centralized edition. The cluster uses a compute node (CN) and data node (DN) two-layer architecture to decouple compute and storage and provide distributed scaling capabilities. The cluster combines the strengths of a distributed database with the proven capabilities of PolarDB for PostgreSQL and can meet the diverse performance and reliability requirements of enterprise-level applications.

How it works

PolarDB for PostgreSQL Distributed Edition contains two types of core nodes:

Compute Nodes (CNs): Act as the access point for the cluster. They parse SQL, create distributed query plans, and manage metadata.
Data Nodes (DNs): Store the physical data shards of tables.

Each compute or data node is a high availability (HA) PolarDB for PostgreSQL cluster. Each node includes a read-write (RW) node, read-only (RO) nodes, and distributed storage (PolarStore) to ensure component-level reliability.

Architectural advantages

Online horizontal scaling: Add nodes online to expand compute and storage capacity. This breaks through single-machine bottlenecks and supports petabyte-scale data and high-concurrency services.
Flexible scaling methods: Nodes use a shared storage architecture. You can scale out by adding more nodes or scale up by increasing node specifications.
High availability and low cost: The distributed storage (PolarStore) uses triplicate replicas and the ParallelRaft replication protocol to ensure high data availability. You are charged only for a single replica. Storage is billed based on actual usage, so you do not need to manually adjust capacity.
High performance: Deep I/O optimizations based on the distributed file system (PolarFileSystem, or PolarFS) deliver extremely high input/output operations per second (IOPS). These optimizations include parallel dirty page flushing, batch reads and writes, and table size caching.
Quick backup and recovery: Create backups in seconds. Multiple data restoration methods are available, such as point-in-time restore (PITR).

Distributed development patterns

When you develop on PolarDB for PostgreSQL Distributed Edition, the core task is to plan data distribution. You can choose one of the following two patterns based on your business scenario:

Horizontal splitting: Distributes the rows of a large single table across multiple DNs based on the hash value of a specific column (the distribution column). This pattern solves performance issues caused by excessively large single tables, such as user tables or order tables. For optimal performance, include the distribution column as a filter condition in your queries.
- Distributed table: A table that is logically complete but whose data is physically split and stored across multiple DNs. For example, an application accesses the sensors_data table, but the data is actually stored in shards such as sensors_data_shard1 and sensors_data_shard2.
- Distribution column: The column used to calculate hash values to determine data distribution when you create a distributed table. An example is sensor_id.
- Replicated table: A special type of table where a full copy of its data is stored on every DN. It is typically used for small dimension tables, such as tables for countries or configuration information, that are frequently joined with large distributed tables. This optimizes cross-node JOINs into local JOINs and significantly improves query performance.
- Distributed transaction: When a single operation (an explicit BEGIN ... COMMIT/ROLLBACK or an implicit transaction) needs to modify data distributed across multiple DNs, the system automatically enables a distributed transaction to ensure data consistency (atomicity, consistency, isolation, and durability (ACID)).
  Note
  Any modification to a replicated table triggers a distributed transaction.
- Primary/Follower CNs: All CNs can process query requests. However, to ensure metadata consistency, Data Definition Language (DDL) operations, such as CREATE TABLE, can be executed only on the primary CN. Changes are automatically synchronized to all other nodes (CNs and DNs).
Vertical splitting: Deploys tables from different business modules to different DNs. This pattern is suitable for isolating resources by business. For example, you can store tables for high-frequency transaction services and backend reporting services on different nodes to prevent them from affecting each other. This pattern is minimally intrusive to applications because the CNs shield the underlying details from the upper layer. Applications can still access all tables as if they were accessing a standalone database.

How to use

Connect to the distributed database

Create a database account
Log on to the PolarDB console. In the cluster list, click the ID of the cluster to go to its Basic Information page. In the left-side navigation pane, choose Settings and Management > Accounts to create a database account.
Note
You can create a Privileged Account or a Standard Account. These two types of accounts have different permissions. Create a database account based on your business requirements.
Configure a cluster whitelist
Log on to the PolarDB console. In the cluster list, click the ID of the cluster that you want to connect to go to its Basic Information page. In the left-side navigation pane, choose Settings and Management > Cluster Whitelists, add an IP whitelist or a security group.
Note
If you want to access the PolarDB cluster from an ECS instance and the ECS instance is in the same VPC as the PolarDB cluster, create an IP address whitelist and add the internal IP address of the ECS instance to the whitelist, or add the security group to which the ECS instance belongs.
If you want to access the PolarDB cluster from an ECS instance and the ECS instance is in a different VPC from the PolarDB cluster, create an IP address whitelist and add the public IP address of the ECS instance to the whitelist, or add the security group to which the ECS instance belongs.
If you want to access the PolarDB cluster from your on-premises environment, create an IP address whitelist and add the public IP address of your on-premises environment to the whitelist.
Use the following methods to obtain the public IP address of your on-premises environment:
Linux: Open the CLI, enter the curl ifconfig.me command, and then press the Enter key.
Windows: Open Command Prompt, enter the curl ip.me command, and then press the Enter key.
macOS: Start Terminal, enter the curl ifconfig.me command, and then press the Enter key.
If a proxy is used for your on-premises network environment, the IP address obtained by the preceding method may not be your actual public IP address. You can add the 0.0.0.0/0 CIDR block to a whitelist of the PolarDB cluster. After you connect to the cluster, run the SELECT pid,usename,datname,client_addr,state,query FROM pg_stat_activity WHERE state = 'active'; command to obtain the actual public IP address and add it to a whitelist of the cluster. Then, delete the 0.0.0.0/0 CIDR block from the whitelist.
If you add the 0.0.0.0/0 CIDR block to an IP whitelist, all sources are allowed to access the cluster. Do not add 0.0.0.0/0 to an IP whitelist of the cluster unless it is necessary.
Obtain the database endpoint and port
Log on to the PolarDB console. In the cluster list, click the ID of the cluster to go to its Basic Information page. Then, you can view the endpoint information in the Database Connections section.
Note
By default, a distributed PolarDB for PostgreSQL cluster has only one Primary Endpoint, and the default port number is 5432.
Use the Private or Public endpoint based on your access environment.
If you want to access the PolarDB cluster from an ECS instance and the ECS instance is in the same VPC as the PolarDB cluster, use the Private endpoint.
If you want to access the PolarDB cluster from your on-premises environment, use the Public endpoint. By default, no public endpoint is available. Click Apply to apply for a public endpoint.
A PolarDB cluster cannot achieve optimal performance if you connect it by using a public endpoint.
You cannot connect to a PolarDB cluster from virtual hosts and lightweight servers by using a Private endpoint.

Connect to the distributed cluster

Use DMS to connect to a cluster

Data Management (DMS) is a graphical data management tool provided by Alibaba Cloud. It provides various data management services, including data management, schema management, user management, security audit, data trends, data tracking, business intelligence (BI) charts, performance optimization, and server management. You can manage your PolarDB cluster directly by using DMS without using other tools.

Log on to the PolarDB console. In the cluster list, click the ID of the cluster that you want to connect to go to its Basic Information page. In the upper-right corner of the page, click Log On To Database.
In the dialog box that appears, enter the database account and password that you created for the cluster, and click Login.
After you log on to the cluster, choose Database Instances > Instances Connected in the left-side navigation pane to manage the cluster.

Use a client to connect to a cluster

You can use a client to connect to a PolarDB cluster. The following procedure uses the pgAdmin 4 v9.0 client to connect to a cluster.

Download and install the pgAdmin 4 client.
Open the pgAdmin 4 client, right-click Servers, and select Register > Server....

On the General tab, set the connection name. On the Connection tab, configure the cluster connection information, and click Save.

Parameter	Description
Host name/address	The endpoint and port of the PolarDB cluster. To access the PolarDB cluster from an ECS instance, and the ECS instance is in the same VPC as the PolarDB cluster, specify the Private endpoint and port. To access the PolarDB cluster from your on-premises environment, specify the Public endpoint and port. The default port number is 5432.
Port
Username	The database account and password of the PolarDB cluster.
Password	The database account and password of the PolarDB cluster.

View the connection result. If the connection information is correct, the following interface appears, indicating a successful connection.
Note
postgres is the default system database. Do not perform any operations on this database.

Use psql to connect to a cluster

You can download psql from PostgreSQL Downloads to connect to a PolarDB cluster. You can also use psql in the PolarDB-Tools to connect to a PolarDB cluster.

Note

Th cluster connection method by using psql is the same for Windows and Linux systems.
For more information about how to use psql, see psql.

Syntax

psql -h <host> -p <port> -U <username> -d <dbname>

Parameter	Description
`host`	The cluster endpoint and port of the PolarDB cluster. To access the PolarDB cluster from an ECS instance, and the ECS instance is in the same VPC as the PolarDB cluster, specify the Private endpoint and port. To access the PolarDB cluster from your on-premises environment, specify the Public endpoint and port. The default port number is 5432
`port`
`username`	The database account of the PolarDB cluster.
`dbname`	The database name.

Example

psql -h pc-xxx.rwlb.rds.aliyuncs.com -p 5432 -U testusername -d postgres

Connect to a cluster in a programming language

Connecting to a PolarDB for PostgreSQL cluster is similar to connect a regular PostgreSQL database. You only need to change the connection parameters, including the endpoint, port, account, and password. Below are examples of how to connect to a PolarDB cluster in specific programming languages.

Java

This example describes how to connect to a PolarDB for PostgreSQL cluster by using the PostgreSQL JDBC driver in a Maven-based Java project.

Add the PostgreSQL JDBC driver dependency to your pom.xml file. Sample code:

<dependency>
  <groupId>org.postgresql</groupId>
  <artifactId>postgresql</artifactId>
  <version>42.2.18</version>
</dependency>

Connect to the cluster. Replace the <HOST>, <PORT>, <USER>, <PASSWORD>, <DATABASE>, <YOUR_TABLE_NAME>, and <YOUR_TABLE_COLUMN_NAME> placeholders with the actual cluster connection parameters.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;

public class PolarDBConnection {
    public static void main(String[] args) {
        // Database URL, username, and password.
        String url = "jdbc:postgresql://<HOST>:<PORT>/<DATABASE>";
        String user = "<USER>";
        String password = "<PASSWORD>";

        try {
            // Load the PostgreSQL JDBC driver.
            Class.forName("org.postgresql.Driver");
            
            // Establish the connection.
            Connection conn = DriverManager.getConnection(url, user, password);
            
            // Create a Statement object.
            Statement stmt = conn.createStatement();
            
            // Execute an SQL query.
            ResultSet rs = stmt.executeQuery("SELECT * FROM <YOUR_TABLE_NAME>");
            
            // Process the result set.
            while (rs.next()) {
                System.out.println(rs.getString("<YOUR_TABLE_COLUMN_NAME>"));
            }
            
            // Close resources.
            rs.close();
            stmt.close();
            conn.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Python

This example describes how to connect to a PolarDB for PostgreSQL cluster by using the psycopg2 library in Python 3.

Install the psycopg2 library.
```
pip3 install psycopg2-binary
```

Connect to the cluster. Replace the <HOST>, <PORT>, <USER>, <PASSWORD>, <DATABASE>, and <YOUR_TABLE_NAME> placeholders with the actual cluster connection parameters.

import psycopg2

try:
    # Connection parameters
    conn = psycopg2.connect(
        host="<HOST>",  # The cluster endpoint.
        database="<DATABASE>",  # The database name.
        user="<USER>",  # The username.
        password="<PASSWORD>",  # The password.
        port="<PORT>"  # The port number.
    )

    # Create a cursor object.
    cursor = conn.cursor()

    # Execute a query.
    cursor.execute("SELECT * FROM <YOUR_TABLE_NAME>")

    # Get all results.
    records = cursor.fetchall()
    for record in records:
        print(record)
        
except Exception as e:
    print("Error:", e)
finally:
    # Close the connection.
    if 'cursor' in locals():
        cursor.close()
    if 'conn' in locals():
        conn.close()

Go

This example describes how to connect to a PolarDB for PostgreSQL cluster by using the database/sql package and the lib/pq driver in Go 1.23.0.

Install the lib/pq driver.
```
go get -u github.com/lib/pq
```

Connect to the cluster. Replace the <HOST>, <PORT>, <USER>, <PASSWORD>, <DATABASE>, and <YOUR_TABLE_NAME> placeholders with the actual cluster connection parameters.

package main

import (
    "database/sql"
    "fmt"
    "log"

    _ "github.com/lib/pq" // Initialize the PostgreSQL driver.
)

func main() {
    // The connection string format.
    connStr := "user=<USER> password=<PASSWORD> dbname=<DATABASE> host=<HOST> port=<PORT> sslmode=disable"

    // Open a database connection.
    db, err := sql.Open("postgres", connStr)
    if err != nil {
        log.Fatal(err)
    }
    defer db.Close() // Close the connection when the program exits.

    // Test the connection.
    err = db.Ping()
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println("Connected to PostgreSQL!")

    // Execute a query.
    rows, err := db.Query("SELECT * FROM <YOUR_TABLE_NAME>")
    if err != nil {
        log.Fatal(err)
    }
    defer rows.Close()
}

Create and manage distributed tables and replicated tables
DML operations on distributed tables
Optimize queries on distributed tables
Configure CDC to synchronize data changes