×
Community Blog How to Build a Customer Data Platform on Alibaba Cloud with DataWorks and ApsaraDB RDS

How to Build a Customer Data Platform on Alibaba Cloud with DataWorks and ApsaraDB RDS

Learn how to build a Customer Data Platform (CDP) on Alibaba Cloud using DataWorks, MaxCompute, ApsaraDB RDS, and OSS to unify customer data, create e...

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

Marketing teams often struggle with fragmented customer data spread across CRMs, email tools, databases, and spreadsheets. A Customer Data Platform (CDP) solves this by creating a unified, continuously updated customer profile from all data sources.

This guide shows how to use Alibaba Cloud DataWorks to integrate customer data from sources like ApsaraDB RDS, transform it in MaxCompute, and build a centralized customer profile layer for marketing tools and analytics.

What a CDP on Alibaba Cloud Actually Looks Like

Before writing any code or clicking anything in the console, it helps to understand the architecture you are building.

A well-designed CDP on Alibaba Cloud has four layers:

Source layer. This is where your raw customer data lives today: ApsaraDB RDS (MySQL or PostgreSQL) for transactional data, OSS for log files and event data from your website or mobile app, and any other structured or semi-structured sources your business uses.

Ingestion layer. DataWorks Data Integration handles the movement of data from your sources into the processing layer.

It supports batch synchronisation for scheduled loads and real-time synchronisation for continuous streaming. DataWorks Data Integration supports data synchronization in complex network environments.

You can configure a batch synchronization node by using the codeless UI on the DataStudio page to periodically synchronize offline data. You can also create a real-time synchronization node on the DataStudio page to synchronize incremental data from a single table or a database in real time.

Processing and storage layer. MaxCompute is where your customer data lands after ingestion. This is where you run SQL-based transformations to build your unified customer profile table, calculate segment tags, and produce the clean, joined datasets your marketing team actually needs.

Serving layer. The output of your CDP feeds back into ApsaraDB RDS as clean, enriched customer tables that your marketing tools can query, or into OSS and DataWorks Data Service as an API that your campaign tools can call.

The first step is DataWorks (batch and real time) for unified data collection. The second step is full-procedure data development based on DataWorks, including data integration, data development and ETL, conversion, KPI computation, as well as scheduling, monitoring, and alerting of data jobs.

What You Will Build

By the end of this guide your CDP will:

  • Ingest customer records from an ApsaraDB RDS MySQL instance into MaxCompute
  • Ingest website event logs from OSS into MaxCompute
  • Join and transform both sources to produce a unified customer profile table
  • Add computed marketing attributes: recency, frequency, monetary value (RFM), and engagement segments
  • Write the cleaned, enriched profiles back to ApsaraDB RDS for use by your marketing tools
  • Run the entire pipeline automatically on a daily schedule through DataWorks

Prerequisites

Before starting, have the following ready:

  • An Alibaba Cloud account with DataWorks activated and a workspace created
  • A MaxCompute project linked to your DataWorks workspace
  • An ApsaraDB RDS for MySQL instance containing your customer and transaction data
  • An OSS bucket containing website event log files (CSV or JSON format)
  • RAM user credentials with permissions for DataWorks, MaxCompute, RDS, and OSS
  • The RDS instance's public endpoint and a read-only database user for the ingestion step

DataWorks Basic Edition is free to use. For production workloads with scheduling at scale, the Professional Edition covers the majority of core data development and governance features.

Step 1: Set Up Your Source Data in ApsaraDB RDS

For this tutorial, assume your RDS MySQL database contains two tables relevant to the CDP:

A customers table with basic profile information:

CREATE TABLE customers (
    customer_id   VARCHAR(50) PRIMARY KEY,
    email         VARCHAR(255),
    first_name    VARCHAR(100),
    last_name     VARCHAR(100),
    country       VARCHAR(50),
    signup_date   DATE,
    updated_at    DATETIME
);

A transactions table with purchase history:

CREATE TABLE transactions (
    transaction_id   VARCHAR(50) PRIMARY KEY,
    customer_id      VARCHAR(50),
    product_id       VARCHAR(50),
    amount           DECIMAL(10,2),
    transaction_date DATE,
    status           VARCHAR(20)
);

In your actual setup, these tables exist already. The structure above just illustrates what field names this guide assumes. Adjust the field names in the DataWorks sync configuration to match your actual schema.

Add DataWorks IP addresses to the RDS whitelist:

DataWorks accesses your RDS instance over the network. Before adding RDS as a data source, add the DataWorks resource group IP addresses for your region to the RDS whitelist.

You can find the current IP ranges in the DataWorks documentation for your region. Without this step, the data source connection test will fail.

Step 2: Add Data Sources to DataWorks

Log on to the DataWorks console. Select your workspace and navigate to Data Integration > Data Sources.

You need to add three data sources: your RDS MySQL instance as a source, your OSS bucket as a source, and MaxCompute as the destination.

Add ApsaraDB RDS MySQL as a data source:

Click New Data Source and select MySQL from the relational database section. Fill in the connection details:

Field Value
Data Source Name customer_rds_prod
JDBC URL jdbc:mysql://<your-rds-endpoint>:3306/<database-name>
Username Your read-only database user
Password The corresponding password

Click Test Connection. A successful test confirms DataWorks can reach your RDS instance. If the test fails, the most common cause is the RDS whitelist not yet including the DataWorks IP addresses from the previous step.

Add OSS as a data source:

Click New Data Source and select OSS. Enter the Endpoint for your region (for example oss-ap-southeast-1.aliyuncs.com), the bucket name, and the AccessKey credentials of a RAM user with read access to the bucket.

MaxCompute is pre-configured as the default compute engine in a standard DataWorks workspace. No manual data source setup is needed for MaxCompute destinations.

Step 3: Create the Target Tables in MaxCompute

Before you can sync data into MaxCompute, you need the target tables ready. Open DataStudio in your DataWorks workspace, navigate to your MaxCompute project, and run the following DDL statements to create the ODS (Operational Data Store) tables:

-- Raw customer records from RDS
CREATE TABLE IF NOT EXISTS cdp.ods_customers (
    customer_id   STRING,
    email         STRING,
    first_name    STRING,
    last_name     STRING,
    country       STRING,
    signup_date   STRING,
    updated_at    STRING,
    dt            STRING
)
PARTITIONED BY (dt STRING)
LIFECYCLE 90;

-- Raw transaction records from RDS
CREATE TABLE IF NOT EXISTS cdp.ods_transactions (
    transaction_id   STRING,
    customer_id      STRING,
    product_id       STRING,
    amount           DOUBLE,
    transaction_date STRING,
    status           STRING,
    dt               STRING
)
PARTITIONED BY (dt STRING)
LIFECYCLE 90;

-- Raw website events from OSS logs
CREATE TABLE IF NOT EXISTS cdp.ods_web_events (
    event_id     STRING,
    customer_id  STRING,
    event_type   STRING,
    page_url     STRING,
    event_time   STRING,
    dt           STRING
)
PARTITIONED BY (dt STRING)
LIFECYCLE 90;

Partitioning by dt (business date) means each daily sync writes to its own partition, which keeps historical data intact and makes it easy to reprocess a specific day if needed.

Step 4: Configure the Batch Synchronisation Nodes

The website user profile analysis process involves databases used to store raw data, computing storage databases, and the platform for developing the entire process.

Use Data Integration to extract basic user information and website access logs of users from different data sources to a compute engine.

In DataStudio, create a workflow to hold your CDP pipeline nodes. Right-click the workflow folder and choose Create Workflow. Name it cdp_daily_pipeline.

Create the RDS-to-MaxCompute sync node:

Inside the workflow, click Create Node and select Data Integration > Batch Synchronization. Name the node sync_customers_rds_to_mc.

In the node editor, configure:

  • Source (Reader): Select MySQL, choose customer_rds_prod as the data source, and write a query that reads yesterday's updated records:
SELECT
    customer_id,
    email,
    first_name,
    last_name,
    country,
    signup_date,
    updated_at
FROM customers
WHERE DATE(updated_at) = '${bizdate}'

The ${bizdate} variable is automatically substituted with the previous business date on each scheduled run. This incremental approach means the sync only moves records that changed since the last run, keeping sync times fast even as your table grows.

  • Destination (Writer): Select MaxCompute, choose cdp.ods_customers as the target table, and set the partition to dt=${bizdate}.
  • Field Mapping: Click Auto Map in the field mapping section. DataWorks maps source columns to target columns by name automatically. Review the mapping and adjust any mismatches.

Save and commit the node.

Create the transactions sync node:

Repeat the same process for the transactions table, creating a node named sync_transactions_rds_to_mc that reads from the transactions table and writes to cdp.ods_transactions.

Create the OSS events sync node:

Create a third Batch Synchronization node named sync_web_events_oss_to_mc. Set the source to OSS, point it to the path in your bucket where event log files are stored (for example s3://your-bucket/events/${bizdate}/), and set the destination to cdp.ods_web_events.

For OSS sources, DataWorks reads CSV or JSON files from the specified path. Set the file format to match your log format and map the columns to the target table fields.


Step 5: Build the Unified Customer Profile Table

With raw data landing in MaxCompute every day, the next step is transforming it into a unified customer profile.

This is the core of the CDP: a single table where every row represents one customer and every column is a marketing-relevant attribute derived from all available data sources.

Create a new MaxCompute SQL node in your workflow named build_customer_profiles. This node runs after all three sync nodes have completed.

-- Unified customer profile with RFM scores and engagement tags
INSERT OVERWRITE TABLE cdp.dws_customer_profiles PARTITION (dt='${bizdate}')
SELECT
    c.customer_id,
    c.email,
    c.first_name,
    c.last_name,
    c.country,
    c.signup_date,

    -- Transaction-derived attributes
    t.total_orders,
    t.total_revenue,
    t.avg_order_value,
    t.first_purchase_date,
    t.last_purchase_date,
    DATEDIFF('${bizdate}', t.last_purchase_date) AS days_since_last_purchase,

    -- RFM segmentation
    CASE
        WHEN DATEDIFF('${bizdate}', t.last_purchase_date) <= 30  THEN 'Active'
        WHEN DATEDIFF('${bizdate}', t.last_purchase_date) <= 90  THEN 'At Risk'
        WHEN DATEDIFF('${bizdate}', t.last_purchase_date) <= 180 THEN 'Lapsed'
        ELSE 'Churned'
    END AS recency_segment,

    CASE
        WHEN t.total_orders >= 10 THEN 'High'
        WHEN t.total_orders >= 3  THEN 'Medium'
        ELSE 'Low'
    END AS frequency_segment,

    CASE
        WHEN t.total_revenue >= 1000 THEN 'High Value'
        WHEN t.total_revenue >= 200  THEN 'Mid Value'
        ELSE 'Low Value'
    END AS monetary_segment,

    -- Web engagement attributes
    e.total_page_views,
    e.last_session_date,
    CASE
        WHEN e.total_page_views >= 20 THEN 'Highly Engaged'
        WHEN e.total_page_views >= 5  THEN 'Moderately Engaged'
        ELSE 'Low Engagement'
    END AS engagement_segment

FROM cdp.ods_customers PARTITION (dt='${bizdate}') c

LEFT JOIN (
    SELECT
        customer_id,
        COUNT(DISTINCT transaction_id) AS total_orders,
        SUM(amount)                    AS total_revenue,
        AVG(amount)                    AS avg_order_value,
        MIN(transaction_date)          AS first_purchase_date,
        MAX(transaction_date)          AS last_purchase_date
    FROM cdp.ods_transactions
    WHERE status = 'completed'
    GROUP BY customer_id
) t ON c.customer_id = t.customer_id

LEFT JOIN (
    SELECT
        customer_id,
        COUNT(event_id)       AS total_page_views,
        MAX(event_time)       AS last_session_date
    FROM cdp.ods_web_events
    GROUP BY customer_id
) e ON c.customer_id = e.customer_id;

Create the output table before running this node:

CREATE TABLE IF NOT EXISTS cdp.dws_customer_profiles (
    customer_id          STRING,
    email                STRING,
    first_name           STRING,
    last_name            STRING,
    country              STRING,
    signup_date          STRING,
    total_orders         BIGINT,
    total_revenue        DOUBLE,
    avg_order_value      DOUBLE,
    first_purchase_date  STRING,
    last_purchase_date   STRING,
    days_since_last_purchase BIGINT,
    recency_segment      STRING,
    frequency_segment    STRING,
    monetary_segment     STRING,
    total_page_views     BIGINT,
    last_session_date    STRING,
    engagement_segment   STRING
)
PARTITIONED BY (dt STRING)
LIFECYCLE 365;

This unified profile table is the heart of your CDP. Every customer record now carries attributes derived from transactions, web behaviour, and their original profile record.

Marketing teams can query it for audience segments, analysts can measure cohort behaviour, and campaign tools can consume it for personalisation.

Step 6: Write Enriched Profiles Back to ApsaraDB RDS

Many marketing tools, CRMs, and campaign platforms connect directly to MySQL rather than MaxCompute. To make your CDP output accessible to those tools, write the enriched customer profiles back to an ApsaraDB RDS table.

Create another Batch Synchronization node named write_profiles_to_rds. This time the direction is reversed: MaxCompute is the source and RDS is the destination.

  • Source (Reader): Select MaxCompute, choose cdp.dws_customer_profiles, and set the partition to dt=${bizdate}.
  • Destination (Writer): Select MySQL, choose customer_rds_prod, and set the target table to cdp_customer_profiles (create this table in RDS first with the matching schema).
  • Write Mode: Set to Insert or Update (UPSERT). This ensures that existing customer records are updated rather than duplicated on each run.

Your marketing tools can now query cdp_customer_profiles in RDS to get the latest enriched customer attributes without any access to MaxCompute or DataWorks.

Alibaba Cloud Data Integration is a comprehensive data synchronization platform that facilitates both real-time and offline data exchange across various data sources, networks, and locations.

It supports data synchronization between more than 400 pairs of disparate data sources, including RDS databases, semi-structured storage, non-structured storage, NoSQL databases, and big data storage.

Step 7: Set Up the Pipeline Dependencies and Schedule

Your pipeline currently has five nodes: three sync nodes, one transformation node, and one write-back node. They need to run in the right order. If the transformation runs before the sync nodes finish, it processes yesterday's data instead of today's.

In DataStudio, configure the dependencies between nodes by drawing lines between them on the workflow canvas:

sync_customers_rds_to_mc     --\
sync_transactions_rds_to_mc  ----> build_customer_profiles --> write_profiles_to_rds
sync_web_events_oss_to_mc    --/

build_customer_profiles will only trigger after all three upstream sync nodes have completed successfully. write_profiles_to_rds will only trigger after the transformation node completes.

Configure the schedule:

Click the root node of your workflow (or each node individually) and open the Properties panel. Set:

  • Scheduling Cycle: Daily
  • Start Time: 2:00 AM (or whatever time falls after your source systems finish their nightly processing)
  • Retry on Failure: Enable and set to 3 retries with a 5-minute interval

Publish the workflow to the production environment. DataWorks will execute the full pipeline automatically every morning. By the time your marketing team starts their day, the CDP table in RDS reflects last night's transactions, web events, and profile updates.

Step 8: Build Audience Segments for Campaigns

The unified profile table you built is useful on its own, but its real value comes from the audience segments you derive from it. A segment is simply a query against your CDP that produces a list of customer IDs and email addresses meeting specific criteria.

Here are four practical segment queries your marketing team can run or schedule:

High-value customers who have not purchased recently:

SELECT customer_id, email, first_name, last_purchase_date
FROM cdp.dws_customer_profiles
WHERE dt = '${bizdate}'
  AND monetary_segment = 'High Value'
  AND recency_segment IN ('At Risk', 'Lapsed')
ORDER BY days_since_last_purchase ASC;

This segment is your win-back priority list: customers who have spent a lot in the past but have not come back recently. A targeted re-engagement campaign to this group typically outperforms broad campaigns significantly.

New customers who are highly engaged but have not purchased:

SELECT customer_id, email, first_name, total_page_views
FROM cdp.dws_customer_profiles
WHERE dt = '${bizdate}'
  AND total_orders = 0
  AND engagement_segment = 'Highly Engaged'
  AND DATEDIFF('${bizdate}', signup_date) <= 14
ORDER BY total_page_views DESC;

This segment identifies visitors who signed up recently, are browsing frequently, but have not converted yet. They are your highest-likelihood first-purchase candidates.

Customers eligible for a loyalty upgrade:

SELECT customer_id, email, first_name, total_orders, total_revenue
FROM cdp.dws_customer_profiles
WHERE dt = '${bizdate}'
  AND frequency_segment = 'Medium'
  AND total_revenue BETWEEN 200 AND 999
  AND recency_segment = 'Active'
ORDER BY total_revenue DESC;

This segment identifies customers who are close to your high-value threshold and are still actively purchasing. A targeted campaign nudging them toward that tier tends to have strong ROI because the behaviour is already established.

You can save each of these as a DataWorks scheduled node that writes its output to a dedicated RDS table, or export the results to OSS for your email campaign tool to pick up.

Step 9: Add Data Quality Monitoring

A CDP is only as trustworthy as the data inside it. If the sync nodes run but write zero rows due to a connection issue, your transformation will run on empty data and produce misleading segments.

DataWorks provides built-in data quality rules that catch these issues before they affect your campaigns.

In DataWorks, navigate to Data Quality and create monitoring rules for your key tables.

Rule 1: Minimum row count check on daily syncs.

For cdp.ods_customers, create a rule that alerts if the row count for today's partition falls below a minimum threshold. Set this threshold to 80 percent of your average daily sync volume. A sync that produces fewer rows than expected indicates a connection issue or a change in the source table.

Rule 2: Null rate check on critical fields.

Create a rule that checks the null rate of the email field in cdp.dws_customer_profiles. If email nulls exceed 5 percent, something has changed in the source schema or the join logic. Flag this for immediate review before campaigns fire.

Rule 3: Duplicate customer ID check.

The unified profile should have exactly one row per customer per day. Create a uniqueness rule on customer_id in the daily partition of cdp.dws_customer_profiles. Any duplicate indicates a join problem that needs to be resolved.

Configure each rule to block the downstream pipeline node if it fails. This prevents incorrect data from propagating to RDS and being picked up by marketing campaigns.

Step 10: Make CDP Data Available to Your Marketing Stack

With a clean, enriched customer profile table in ApsaraDB RDS, connecting your marketing tools is straightforward. Most email platforms, CRM systems, and ad platforms can read directly from a MySQL source.

Direct database connection: Configure your email platform or CRM to connect to the cdp_customer_profiles table in RDS using a read-only database user. Most marketing automation platforms support MySQL connections as a native integration or through a data connector.

DataWorks Data Service API: For tools that need an API rather than a direct database connection, DataWorks provides a Data Service module that wraps a MaxCompute or RDS query as a REST API endpoint. This is useful for real-time personalisation scenarios where a website or app needs to fetch a customer's segment information at page load time.

Scheduled CSV exports to OSS: For tools that prefer file-based ingestion, add a final node to your pipeline that exports the day's segment lists as CSV files to an OSS bucket. Your campaign tools pick up the files from the bucket on their own schedule.

Visualise CDP metrics in Quick BI: Connect the cdp.dws_customer_profiles table in MaxCompute directly to Alibaba Cloud Quick BI. Build a dashboard showing segment sizes, RFM distribution, and how your active/at-risk/lapsed/churned breakdown changes over time. This gives your marketing team a live view of the customer base health without querying the database directly.

Step 11: Extend the CDP as Your Data Grows

The pipeline you built handles two source systems: RDS transactions and OSS event logs. As your business grows, you will want to add more sources.

Adding a third-party marketing platform: If your email platform exports engagement data (opens, clicks, unsubscribes) as CSV to OSS or provides a direct API, you can add a new Data Integration node that syncs that data into MaxCompute and join it into the customer profile transformation query.

This gives you email engagement attributes alongside purchase behaviour in the same unified profile.

Real-time profile updates: The current pipeline runs daily. For businesses where customer behaviour changes fast enough to matter within the day, consider adding a real-time sync node alongside the batch nodes.

The real-time synchronization feature allows you to combine multiple types of data sources to form a star-shaped data synchronization link. You can synchronize data between different types of data sources in real time.

A real-time sync on the transactions table means high-value purchases trigger an immediate profile update rather than waiting until 2:00 AM.

AI-powered segment scoring: Once your customer profiles are stable in MaxCompute, you can connect Alibaba Cloud Platform for AI (PAI) to the same table and build propensity models that score each customer's likelihood to purchase, churn, or upgrade.

These scores become additional columns in the unified profile, giving your campaigns a much richer targeting signal than rule-based segments alone.

Summary

In this guide you built a Customer Data Platform on Alibaba Cloud that unifies customer data from multiple sources into a single, queryable profile layer your entire marketing team can act on.

The pipeline you built covers:

  • Adding ApsaraDB RDS MySQL and OSS as data sources in DataWorks
  • Creating partitioned target tables in MaxCompute for raw data landing
  • Configuring Batch Synchronisation nodes to pull customer records, transactions, and web events on a daily schedule
  • Building a unified customer profile transformation that joins all sources and computes RFM segments and engagement attributes
  • Writing enriched profiles back to ApsaraDB RDS for marketing tool consumption
  • Configuring pipeline dependencies and a daily schedule so the whole process runs automatically
  • Deriving actionable audience segments from the unified profile table
  • Adding data quality rules that catch problems before they reach your campaigns
  • Connecting the CDP output to your downstream marketing stack

The architecture is modular. Every new data source you add becomes another sync node and a few more columns in the transformation query. Every new marketing tool you adopt can connect to the same RDS table or OSS export. The CDP grows with your business without requiring you to rebuild anything from scratch.


Related Resources

SQL examples and console navigation in this article reflect Alibaba Cloud DataWorks and MaxCompute as of early 2026. Field names in the sample schemas are illustrative. Adjust them to match your actual table structure before running any queries in production.

0 1 0
Share on

Kalpesh Parmar

16 posts | 4 followers

You may also like

Comments