×
Community Blog Distributed Pandas Processing with MaxCompute MaxFrame

Distributed Pandas Processing with MaxCompute MaxFrame

This article introduces how to use common Pandas operators with MaxFrame.

MaxCompute MaxFrame is a distributed computing framework developed by Alibaba Cloud. It is compatible with Pandas interfaces and automatically performs distributed processing. This ensures powerful data processing capabilities and greatly improves the data processing scale and computing efficiency.

MaxFrame allows you to use the same APIs as Pandas in a distributed environment to analyze data. With MaxFrame, you can perform data analysis and computation on MaxCompute with performance tens of times higher than that of open-source Pandas. This article introduces how to use common Pandas operators with MaxFrame.

Prerequisites

MaxFrame has been installed. For more information, please refer to Preparations.

Data Preparation

1.  Run the following script in the Python environment with MaxFrame installed to prepare test tables and test data:

from odps import ODPS
from maxframe.session import new_session
import maxframe.dataframe as md
import pandas as pd
o = ODPS(
    # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the Access Key ID of your Alibaba Cloud account.
    # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the Access Key Secret of your Alibaba Cloud account.
    # We recommend that you do not directly use your Access Key ID / Access Key Secret.
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='your-default-project',
    endpoint='your-end-point',
)
data_sets = [{
    "table_name": "product",
    "table_schema" : "index bigint, product_id bigint, product_name string, current_price bigint",
    "source_type": "records",
    "records" : [
        [1, 100, 'Nokia', 1000],
        [2, 200, 'Apple', 5000],
        [3, 300, 'Samsung', 9000]
    ],
},
{
    "table_name" : "sales",
    "table_schema" : "index bigint, sale_id bigint, product_id bigint, user_id bigint, year bigint, quantity bigint, price bigint",
    "source_type": "records",
    "records" : [
        [1, 1, 100, 101, 2008, 10, 5000],
        [2, 2, 300, 101, 2009, 7, 4000],
        [3, 4, 100, 102, 2011, 9, 4000],
        [4, 5, 200, 102, 2013, 6, 6000],
        [5, 8, 300, 102, 2015, 10, 9000],
        [6, 9, 100, 102, 2015, 6, 2000]
    ],
    "lifecycle": 5
}]
def prepare_data(o: ODPS, data_sets, suffix="", drop_if_exists=False):
    for index, data in enumerate(data_sets):
        table_name = data.get("table_name")
        table_schema = data.get("table_schema")
        source_type = data.get("source_type")
        
        if not table_name or not table_schema or not source_type:
            raise ValueError(f"Dataset at index {index} is missing one or more required keys: 'table_name', 'table_schema', or 'source_type'.")
        lifecycle = data.get("lifecycle", 5)
        table_name += suffix
        
        print(f"Processing {table_name}...")
        if drop_if_exists:
            print(f"Deleting {table_name}...")
            o.delete_table(table_name, if_exists=True)
        
        o.create_table(name=table_name, table_schema=table_schema, lifecycle=lifecycle, if_not_exists=True)
        if source_type == "local_file":
            file_path = data.get("file")
            if not file_path:
                raise ValueError(f"Dataset at index {index} with source_type 'local_file' is missing the 'file' key.")
            sep = data.get("sep", ",")
            pd_df = pd.read_csv(file_path, sep=sep)
            ODPSDataFrame(pd_df).persist(table_name, drop_table=True)
        elif source_type == 'records':
            records = data.get("records")
            if not records:
                raise ValueError(f"Dataset at index {index} with source_type 'records' is missing the 'records' key.")
            with o.get_table(table_name).open_writer() as writer:
                writer.write(records)
        else:
            raise ValueError(f"Unknown data set source_type: {source_type}")
        
        print(f"Processed {table_name} Done")
prepare_data(o, data_sets, "_maxframe_demo", True)

Parameters:

  • ALIBABA_CLOUD_ACCESS_KEY_ID: Set this environment variable to the AccessKey ID that has the necessary MaxCompute permissions for the destination MaxCompute project's objects. You can obtain the AccessKey ID on the AccessKey management page.
  • ALIBABA_CLOUD_ACCESS_KEY_SECRET: Set this environment variable to the AccessKey Secret corresponding to the AccessKey ID.
  • your-default-project: the name of your MaxCompute project. You can log on to the MaxCompute console. In the left-side navigation pane, choose Workspaces > Projects to view the name of the MaxCompute project.
  • your-end-point: the endpoint of the region where the destination MaxCompute project resides. You can select a network connection method, such as http://service.cn-chengdu.maxcompute.aliyun.com/api. For more information, please refer to Configure endpoints.

2.  Execute the following SQL statement to query data in the sales_maxframe_demo and product_maxframe_demo:

-- Query the sales_maxframe_demo table.
SELECT * FROM sales_maxframe_demo;
-- The following result is returned:
+------------+------------+------------+------------+------------+------------+------------+
| index      | sale_id    | product_id | user_id    | year       | quantity   | price      |
+------------+------------+------------+------------+------------+------------+------------+
| 1          | 1          | 100        | 101        | 2008       | 10         | 5000       |
| 2          | 2          | 300        | 101        | 2009       | 7          | 4000       |
| 3          | 4          | 100        | 102        | 2011       | 9          | 4000       |
| 4          | 5          | 200        | 102        | 2013       | 6          | 6000       |
| 5          | 8          | 300        | 102        | 2015       | 10         | 9000       |
| 6          | 9          | 100        | 102        | 2015       | 6          | 2000       |
+------------+------------+------------+------------+------------+------------+------------+
-- Query the product_maxframe_demo table.
SELECT * FROM product_maxframe_demo;
-- The following result is returned:
+------------+------------+--------------+---------------+
| index      | product_id | product_name | current_price |
+------------+------------+--------------+---------------+
| 1          | 100        | Nokia        | 1000          |
| 2          | 200        | Apple        | 5000          |
| 3          | 300        | Samsung      | 9000          |
+------------+------------+--------------+---------------+

Use MaxFrame for Data Analysis

Scenario 1: Use the Merge Method to Connect Two Data Tables to Obtain the product_name Corresponding to All sale_id in the sales_maxframe_demo Table and All year Values and price Values of the Product.

• Sample code

from odps import ODPS
from maxframe.session import new_session
import maxframe.dataframe as md
o = ODPS(
    # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the Access Key ID of your Alibaba Cloud account.
    # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the Access Key Secret of your Alibaba Cloud account.
    # We recommend that you do not directly use your Access Key ID / Access Key Secret.
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='your-default-project',
    endpoint='your-end-point',
)
session = new_session(o)
# The session id is a string of characters used to associate a MaxFrame task. It is important for debugging and tracking the task status.
print(session.session_id)
sales = md.read_odps_table("sales_maxframe_demo", index_col="index")
product = md.read_odps_table("product_maxframe_demo", index_col="product_id")
# The df is not immediately executed unless you use the df.execute() to trigger it.
# This means that all computations will be fully completed within the MaxCompute cluster, avoiding unnecessary intermediate data transfers and bottlenecks.
df = sales.merge(product, left_on="product_id", right_index=True)
df = df[["product_name", "year", "price"]]
print(df.execute().fetch())
# Save the result to the MaxCompute table and destroy the Session.
md.to_odps_table(df, "result_df", overwrite=True).execute()
session.destroy()

• The following result is returned:

index product_name  year  price                   
1            Nokia  2008   5000
2          Samsung  2009   4000
3            Nokia  2011   4000
4            Apple  2013   6000
5          Samsung  2015   9000
6            Nokia  2015   2000

• Performance comparison

In a data sample where the sales table has 50 million records (size: 1.96 GB) and the product table has 100,000 records (size: 3 MB), the computation yields the following time comparison results:

Environment Time (unit: seconds)
Local Pandas (version 1.3.5) 65.8
MaxFrame 22

Scenario 2: Select the Product ID, Year, Quantity, and Price of Each Product Sold in the First Year.

• Sample code:

from odps import ODPS
from maxframe.session import new_session
import maxframe.dataframe as md
o = ODPS(
    # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the Access Key ID of your Alibaba Cloud account.
    # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the Access Key Secret of your Alibaba Cloud account.
    # We recommend that you do not directly use your Access Key ID / Access Key Secret.
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='your-default-project',
    endpoint='your-end-point',
)
session = new_session(o)
# The session id is a string of characters used to associate a MaxFrame task. It is important for debugging and tracking the task status.
print(session.session_id)
# Aggregate to get the first year of each product.
min_year_df = md.read_odps_table("sales_maxframe_demo", index_col="index")
min_year_df = min_year_df.groupby('product_id', as_index=False).agg(first_year=('year', 'min'))
# Find the corresponding sales record through join.
sales = md.read_odps_table("sales_maxframe_demo", index_col=['product_id', 'year'])
result_df = md.merge(sales, min_year_df, 
                        left_index=True, 
                        right_on=['product_id','first_year'],
                        how='inner')
# The result_df is not immediately executed unless you use the result_df.execute() to trigger it.
# This means that all computations will be fully completed within the MaxCompute cluster, avoiding unnecessary intermediate data transfers and bottlenecks.
result_df = result_df[['product_id', 'first_year', 'quantity', 'price']]
print(result_df.execute().fetch())
# Destroy the Session.
session.destroy()

• The following result is returned:

product_id  first_year  quantity  price
100         100        2008        10   5000
300         300        2009         7   4000
200         200        2013         6   6000

• Performance comparison

In a data sample where the sales table has 50 million records (size: 1.96 GB) and the product table has 100,000 records (size: 3 MB), the computation yields the following time comparison results:

Environment Time (unit: seconds)
Local Pandas (version 1.3.5) 186
MaxFrame 21

Scenario 3: Obtain the Most Consumed Product ID for Each User

Note: This scenario demonstrates multiple groupby, join, drop_duplicates, and sort_values operations.

• Sample code

from odps import ODPS
from maxframe.session import new_session
import maxframe.dataframe as md
o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='your-default-project',
    endpoint='your-end-point',
)
session = new_session(o)
# The session id is a string of characters used to associate a MaxFrame task. It is important for debugging and tracking the task status.
print(session.session_id)
sales = md.read_odps_table("sales_maxframe_demo", index_col="index")
product = md.read_odps_table("product_maxframe_demo", index_col="product_id")
sales['total'] = sales['price'] * sales['quantity']
product_cost_df = sales.groupby(['product_id', 'user_id'], as_index=False).agg(user_product_total=('total','sum'))
product_cost_df = product_cost_df.merge(product, left_on="product_id", right_index=True, how='right')
user_cost_df = product_cost_df.groupby('user_id').agg(max_total=('user_product_total', 'max'))
merge_df = product_cost_df.merge(user_cost_df, left_on='user_id', right_index=True)
# The result_df is not immediately executed unless you use the result_df.execute() to trigger it.
# This means that all computations will be fully completed within the MaxCompute cluster, avoiding unnecessary intermediate data transfers and bottlenecks.
result_df = merge_df[merge_df['user_product_total'] == merge_df['max_total']][['user_id', 'product_id']].drop_duplicates().sort_values(['user_id'], ascending = [1])
print(result_df.execute().fetch())
# Destroy the Session.
session.destroy()

• The following result is returned:

user_id  product_id
100      101         100
300      102         300

• Performance comparison

In a data sample where the sales table has 50 million records (size: 1.96 GB) and the product table has 100,000 records (size: 3 MB), the computation yields the following time comparison results:

Environment Time (unit: seconds)
Local Pandas (version 1.3.5) 176
MaxFrame 85

Test Conclusion

MaxFrame is compatible with Pandas interfaces and is capable of automatic distributed processing. This ensures powerful data processing capabilities and greatly improves the data processing scale and computing efficiency.

1 1 0
Share on

Alibaba Cloud MaxCompute

137 posts | 18 followers

You may also like

Comments

Santhakumar Munuswamy August 5, 2024 at 6:54 am

Thanks for the sharing

Alibaba Cloud MaxCompute

137 posts | 18 followers

Related Products