Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
Marketing teams often struggle with fragmented customer data spread across CRMs, email tools, databases, and spreadsheets. A Customer Data Platform (CDP) solves this by creating a unified, continuously updated customer profile from all data sources.
This guide shows how to use Alibaba Cloud DataWorks to integrate customer data from sources like ApsaraDB RDS, transform it in MaxCompute, and build a centralized customer profile layer for marketing tools and analytics.
Before writing any code or clicking anything in the console, it helps to understand the architecture you are building.
A well-designed CDP on Alibaba Cloud has four layers:
Source layer. This is where your raw customer data lives today: ApsaraDB RDS (MySQL or PostgreSQL) for transactional data, OSS for log files and event data from your website or mobile app, and any other structured or semi-structured sources your business uses.
Ingestion layer. DataWorks Data Integration handles the movement of data from your sources into the processing layer.
It supports batch synchronisation for scheduled loads and real-time synchronisation for continuous streaming. DataWorks Data Integration supports data synchronization in complex network environments.
You can configure a batch synchronization node by using the codeless UI on the DataStudio page to periodically synchronize offline data. You can also create a real-time synchronization node on the DataStudio page to synchronize incremental data from a single table or a database in real time.
Processing and storage layer. MaxCompute is where your customer data lands after ingestion. This is where you run SQL-based transformations to build your unified customer profile table, calculate segment tags, and produce the clean, joined datasets your marketing team actually needs.
Serving layer. The output of your CDP feeds back into ApsaraDB RDS as clean, enriched customer tables that your marketing tools can query, or into OSS and DataWorks Data Service as an API that your campaign tools can call.
The first step is DataWorks (batch and real time) for unified data collection. The second step is full-procedure data development based on DataWorks, including data integration, data development and ETL, conversion, KPI computation, as well as scheduling, monitoring, and alerting of data jobs.
By the end of this guide your CDP will:
Before starting, have the following ready:
DataWorks Basic Edition is free to use. For production workloads with scheduling at scale, the Professional Edition covers the majority of core data development and governance features.
For this tutorial, assume your RDS MySQL database contains two tables relevant to the CDP:
A customers table with basic profile information:
CREATE TABLE customers (
customer_id VARCHAR(50) PRIMARY KEY,
email VARCHAR(255),
first_name VARCHAR(100),
last_name VARCHAR(100),
country VARCHAR(50),
signup_date DATE,
updated_at DATETIME
);
A transactions table with purchase history:
CREATE TABLE transactions (
transaction_id VARCHAR(50) PRIMARY KEY,
customer_id VARCHAR(50),
product_id VARCHAR(50),
amount DECIMAL(10,2),
transaction_date DATE,
status VARCHAR(20)
);
In your actual setup, these tables exist already. The structure above just illustrates what field names this guide assumes. Adjust the field names in the DataWorks sync configuration to match your actual schema.
Add DataWorks IP addresses to the RDS whitelist:
DataWorks accesses your RDS instance over the network. Before adding RDS as a data source, add the DataWorks resource group IP addresses for your region to the RDS whitelist.
You can find the current IP ranges in the DataWorks documentation for your region. Without this step, the data source connection test will fail.
Log on to the DataWorks console. Select your workspace and navigate to Data Integration > Data Sources.
You need to add three data sources: your RDS MySQL instance as a source, your OSS bucket as a source, and MaxCompute as the destination.
Add ApsaraDB RDS MySQL as a data source:
Click New Data Source and select MySQL from the relational database section. Fill in the connection details:
| Field | Value |
|---|---|
| Data Source Name | customer_rds_prod |
| JDBC URL | jdbc:mysql://<your-rds-endpoint>:3306/<database-name> |
| Username | Your read-only database user |
| Password | The corresponding password |
Click Test Connection. A successful test confirms DataWorks can reach your RDS instance. If the test fails, the most common cause is the RDS whitelist not yet including the DataWorks IP addresses from the previous step.
Add OSS as a data source:
Click New Data Source and select OSS. Enter the Endpoint for your region (for example oss-ap-southeast-1.aliyuncs.com), the bucket name, and the AccessKey credentials of a RAM user with read access to the bucket.
MaxCompute is pre-configured as the default compute engine in a standard DataWorks workspace. No manual data source setup is needed for MaxCompute destinations.
Before you can sync data into MaxCompute, you need the target tables ready. Open DataStudio in your DataWorks workspace, navigate to your MaxCompute project, and run the following DDL statements to create the ODS (Operational Data Store) tables:
-- Raw customer records from RDS
CREATE TABLE IF NOT EXISTS cdp.ods_customers (
customer_id STRING,
email STRING,
first_name STRING,
last_name STRING,
country STRING,
signup_date STRING,
updated_at STRING,
dt STRING
)
PARTITIONED BY (dt STRING)
LIFECYCLE 90;
-- Raw transaction records from RDS
CREATE TABLE IF NOT EXISTS cdp.ods_transactions (
transaction_id STRING,
customer_id STRING,
product_id STRING,
amount DOUBLE,
transaction_date STRING,
status STRING,
dt STRING
)
PARTITIONED BY (dt STRING)
LIFECYCLE 90;
-- Raw website events from OSS logs
CREATE TABLE IF NOT EXISTS cdp.ods_web_events (
event_id STRING,
customer_id STRING,
event_type STRING,
page_url STRING,
event_time STRING,
dt STRING
)
PARTITIONED BY (dt STRING)
LIFECYCLE 90;
Partitioning by dt (business date) means each daily sync writes to its own partition, which keeps historical data intact and makes it easy to reprocess a specific day if needed.
The website user profile analysis process involves databases used to store raw data, computing storage databases, and the platform for developing the entire process.
Use Data Integration to extract basic user information and website access logs of users from different data sources to a compute engine.
In DataStudio, create a workflow to hold your CDP pipeline nodes. Right-click the workflow folder and choose Create Workflow. Name it cdp_daily_pipeline.
Create the RDS-to-MaxCompute sync node:
Inside the workflow, click Create Node and select Data Integration > Batch Synchronization. Name the node sync_customers_rds_to_mc.
In the node editor, configure:
customer_rds_prod as the data source, and write a query that reads yesterday's updated records:SELECT
customer_id,
email,
first_name,
last_name,
country,
signup_date,
updated_at
FROM customers
WHERE DATE(updated_at) = '${bizdate}'
The ${bizdate} variable is automatically substituted with the previous business date on each scheduled run. This incremental approach means the sync only moves records that changed since the last run, keeping sync times fast even as your table grows.
cdp.ods_customers as the target table, and set the partition to dt=${bizdate}.Save and commit the node.
Create the transactions sync node:
Repeat the same process for the transactions table, creating a node named sync_transactions_rds_to_mc that reads from the transactions table and writes to cdp.ods_transactions.
Create the OSS events sync node:
Create a third Batch Synchronization node named sync_web_events_oss_to_mc. Set the source to OSS, point it to the path in your bucket where event log files are stored (for example s3://your-bucket/events/${bizdate}/), and set the destination to cdp.ods_web_events.
For OSS sources, DataWorks reads CSV or JSON files from the specified path. Set the file format to match your log format and map the columns to the target table fields.
With raw data landing in MaxCompute every day, the next step is transforming it into a unified customer profile.
This is the core of the CDP: a single table where every row represents one customer and every column is a marketing-relevant attribute derived from all available data sources.
Create a new MaxCompute SQL node in your workflow named build_customer_profiles. This node runs after all three sync nodes have completed.
-- Unified customer profile with RFM scores and engagement tags
INSERT OVERWRITE TABLE cdp.dws_customer_profiles PARTITION (dt='${bizdate}')
SELECT
c.customer_id,
c.email,
c.first_name,
c.last_name,
c.country,
c.signup_date,
-- Transaction-derived attributes
t.total_orders,
t.total_revenue,
t.avg_order_value,
t.first_purchase_date,
t.last_purchase_date,
DATEDIFF('${bizdate}', t.last_purchase_date) AS days_since_last_purchase,
-- RFM segmentation
CASE
WHEN DATEDIFF('${bizdate}', t.last_purchase_date) <= 30 THEN 'Active'
WHEN DATEDIFF('${bizdate}', t.last_purchase_date) <= 90 THEN 'At Risk'
WHEN DATEDIFF('${bizdate}', t.last_purchase_date) <= 180 THEN 'Lapsed'
ELSE 'Churned'
END AS recency_segment,
CASE
WHEN t.total_orders >= 10 THEN 'High'
WHEN t.total_orders >= 3 THEN 'Medium'
ELSE 'Low'
END AS frequency_segment,
CASE
WHEN t.total_revenue >= 1000 THEN 'High Value'
WHEN t.total_revenue >= 200 THEN 'Mid Value'
ELSE 'Low Value'
END AS monetary_segment,
-- Web engagement attributes
e.total_page_views,
e.last_session_date,
CASE
WHEN e.total_page_views >= 20 THEN 'Highly Engaged'
WHEN e.total_page_views >= 5 THEN 'Moderately Engaged'
ELSE 'Low Engagement'
END AS engagement_segment
FROM cdp.ods_customers PARTITION (dt='${bizdate}') c
LEFT JOIN (
SELECT
customer_id,
COUNT(DISTINCT transaction_id) AS total_orders,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value,
MIN(transaction_date) AS first_purchase_date,
MAX(transaction_date) AS last_purchase_date
FROM cdp.ods_transactions
WHERE status = 'completed'
GROUP BY customer_id
) t ON c.customer_id = t.customer_id
LEFT JOIN (
SELECT
customer_id,
COUNT(event_id) AS total_page_views,
MAX(event_time) AS last_session_date
FROM cdp.ods_web_events
GROUP BY customer_id
) e ON c.customer_id = e.customer_id;
Create the output table before running this node:
CREATE TABLE IF NOT EXISTS cdp.dws_customer_profiles (
customer_id STRING,
email STRING,
first_name STRING,
last_name STRING,
country STRING,
signup_date STRING,
total_orders BIGINT,
total_revenue DOUBLE,
avg_order_value DOUBLE,
first_purchase_date STRING,
last_purchase_date STRING,
days_since_last_purchase BIGINT,
recency_segment STRING,
frequency_segment STRING,
monetary_segment STRING,
total_page_views BIGINT,
last_session_date STRING,
engagement_segment STRING
)
PARTITIONED BY (dt STRING)
LIFECYCLE 365;
This unified profile table is the heart of your CDP. Every customer record now carries attributes derived from transactions, web behaviour, and their original profile record.
Marketing teams can query it for audience segments, analysts can measure cohort behaviour, and campaign tools can consume it for personalisation.
Many marketing tools, CRMs, and campaign platforms connect directly to MySQL rather than MaxCompute. To make your CDP output accessible to those tools, write the enriched customer profiles back to an ApsaraDB RDS table.
Create another Batch Synchronization node named write_profiles_to_rds. This time the direction is reversed: MaxCompute is the source and RDS is the destination.
cdp.dws_customer_profiles, and set the partition to dt=${bizdate}.customer_rds_prod, and set the target table to cdp_customer_profiles (create this table in RDS first with the matching schema).Your marketing tools can now query cdp_customer_profiles in RDS to get the latest enriched customer attributes without any access to MaxCompute or DataWorks.
Alibaba Cloud Data Integration is a comprehensive data synchronization platform that facilitates both real-time and offline data exchange across various data sources, networks, and locations.
It supports data synchronization between more than 400 pairs of disparate data sources, including RDS databases, semi-structured storage, non-structured storage, NoSQL databases, and big data storage.
Your pipeline currently has five nodes: three sync nodes, one transformation node, and one write-back node. They need to run in the right order. If the transformation runs before the sync nodes finish, it processes yesterday's data instead of today's.
In DataStudio, configure the dependencies between nodes by drawing lines between them on the workflow canvas:
sync_customers_rds_to_mc --\
sync_transactions_rds_to_mc ----> build_customer_profiles --> write_profiles_to_rds
sync_web_events_oss_to_mc --/
build_customer_profiles will only trigger after all three upstream sync nodes have completed successfully. write_profiles_to_rds will only trigger after the transformation node completes.
Configure the schedule:
Click the root node of your workflow (or each node individually) and open the Properties panel. Set:
Publish the workflow to the production environment. DataWorks will execute the full pipeline automatically every morning. By the time your marketing team starts their day, the CDP table in RDS reflects last night's transactions, web events, and profile updates.
The unified profile table you built is useful on its own, but its real value comes from the audience segments you derive from it. A segment is simply a query against your CDP that produces a list of customer IDs and email addresses meeting specific criteria.
Here are four practical segment queries your marketing team can run or schedule:
High-value customers who have not purchased recently:
SELECT customer_id, email, first_name, last_purchase_date
FROM cdp.dws_customer_profiles
WHERE dt = '${bizdate}'
AND monetary_segment = 'High Value'
AND recency_segment IN ('At Risk', 'Lapsed')
ORDER BY days_since_last_purchase ASC;
This segment is your win-back priority list: customers who have spent a lot in the past but have not come back recently. A targeted re-engagement campaign to this group typically outperforms broad campaigns significantly.
New customers who are highly engaged but have not purchased:
SELECT customer_id, email, first_name, total_page_views
FROM cdp.dws_customer_profiles
WHERE dt = '${bizdate}'
AND total_orders = 0
AND engagement_segment = 'Highly Engaged'
AND DATEDIFF('${bizdate}', signup_date) <= 14
ORDER BY total_page_views DESC;
This segment identifies visitors who signed up recently, are browsing frequently, but have not converted yet. They are your highest-likelihood first-purchase candidates.
Customers eligible for a loyalty upgrade:
SELECT customer_id, email, first_name, total_orders, total_revenue
FROM cdp.dws_customer_profiles
WHERE dt = '${bizdate}'
AND frequency_segment = 'Medium'
AND total_revenue BETWEEN 200 AND 999
AND recency_segment = 'Active'
ORDER BY total_revenue DESC;
This segment identifies customers who are close to your high-value threshold and are still actively purchasing. A targeted campaign nudging them toward that tier tends to have strong ROI because the behaviour is already established.
You can save each of these as a DataWorks scheduled node that writes its output to a dedicated RDS table, or export the results to OSS for your email campaign tool to pick up.
A CDP is only as trustworthy as the data inside it. If the sync nodes run but write zero rows due to a connection issue, your transformation will run on empty data and produce misleading segments.
DataWorks provides built-in data quality rules that catch these issues before they affect your campaigns.
In DataWorks, navigate to Data Quality and create monitoring rules for your key tables.
Rule 1: Minimum row count check on daily syncs.
For cdp.ods_customers, create a rule that alerts if the row count for today's partition falls below a minimum threshold. Set this threshold to 80 percent of your average daily sync volume. A sync that produces fewer rows than expected indicates a connection issue or a change in the source table.
Rule 2: Null rate check on critical fields.
Create a rule that checks the null rate of the email field in cdp.dws_customer_profiles. If email nulls exceed 5 percent, something has changed in the source schema or the join logic. Flag this for immediate review before campaigns fire.
Rule 3: Duplicate customer ID check.
The unified profile should have exactly one row per customer per day. Create a uniqueness rule on customer_id in the daily partition of cdp.dws_customer_profiles. Any duplicate indicates a join problem that needs to be resolved.
Configure each rule to block the downstream pipeline node if it fails. This prevents incorrect data from propagating to RDS and being picked up by marketing campaigns.
With a clean, enriched customer profile table in ApsaraDB RDS, connecting your marketing tools is straightforward. Most email platforms, CRM systems, and ad platforms can read directly from a MySQL source.
Direct database connection: Configure your email platform or CRM to connect to the cdp_customer_profiles table in RDS using a read-only database user. Most marketing automation platforms support MySQL connections as a native integration or through a data connector.
DataWorks Data Service API: For tools that need an API rather than a direct database connection, DataWorks provides a Data Service module that wraps a MaxCompute or RDS query as a REST API endpoint. This is useful for real-time personalisation scenarios where a website or app needs to fetch a customer's segment information at page load time.
Scheduled CSV exports to OSS: For tools that prefer file-based ingestion, add a final node to your pipeline that exports the day's segment lists as CSV files to an OSS bucket. Your campaign tools pick up the files from the bucket on their own schedule.
Visualise CDP metrics in Quick BI: Connect the cdp.dws_customer_profiles table in MaxCompute directly to Alibaba Cloud Quick BI. Build a dashboard showing segment sizes, RFM distribution, and how your active/at-risk/lapsed/churned breakdown changes over time. This gives your marketing team a live view of the customer base health without querying the database directly.
The pipeline you built handles two source systems: RDS transactions and OSS event logs. As your business grows, you will want to add more sources.
Adding a third-party marketing platform: If your email platform exports engagement data (opens, clicks, unsubscribes) as CSV to OSS or provides a direct API, you can add a new Data Integration node that syncs that data into MaxCompute and join it into the customer profile transformation query.
This gives you email engagement attributes alongside purchase behaviour in the same unified profile.
Real-time profile updates: The current pipeline runs daily. For businesses where customer behaviour changes fast enough to matter within the day, consider adding a real-time sync node alongside the batch nodes.
The real-time synchronization feature allows you to combine multiple types of data sources to form a star-shaped data synchronization link. You can synchronize data between different types of data sources in real time.
A real-time sync on the transactions table means high-value purchases trigger an immediate profile update rather than waiting until 2:00 AM.
AI-powered segment scoring: Once your customer profiles are stable in MaxCompute, you can connect Alibaba Cloud Platform for AI (PAI) to the same table and build propensity models that score each customer's likelihood to purchase, churn, or upgrade.
These scores become additional columns in the unified profile, giving your campaigns a much richer targeting signal than rule-based segments alone.
In this guide you built a Customer Data Platform on Alibaba Cloud that unifies customer data from multiple sources into a single, queryable profile layer your entire marketing team can act on.
The pipeline you built covers:
The architecture is modular. Every new data source you add becomes another sync node and a few more columns in the transformation query. Every new marketing tool you adopt can connect to the same RDS table or OSS export. The CDP grows with your business without requiring you to rebuild anything from scratch.
SQL examples and console navigation in this article reflect Alibaba Cloud DataWorks and MaxCompute as of early 2026. Field names in the sample schemas are illustrative. Adjust them to match your actual table structure before running any queries in production.
Setting Up an AI Chatbot for Marketing Automation with Alibaba Cloud Model Studio
16 posts | 4 followers
FollowAlibaba Cloud MaxCompute - February 17, 2021
Data Geek - May 11, 2024
ApsaraDB - February 20, 2021
Kalpesh Parmar - May 7, 2026
Alibaba Cloud Project Hub - January 20, 2021
Alibaba Cloud MaxCompute - September 30, 2022
16 posts | 4 followers
Follow
Intelligent Robot
A dialogue platform that enables smart dialog (based on natural language processing) through a range of dialogue-enabling clients
Learn More
Storage Capacity Unit
Plan and optimize your storage budget with flexible storage services
Learn More
Simple Log Service
An all-in-one service for log-type data
Learn More
Architecture and Structure Design
Customized infrastructure to ensure high availability, scalability and high-performance
Learn MoreMore Posts by Kalpesh Parmar