ODPS Optimization Suggestions for Beginners

By Huiling Zhou (Zhule)

Introduction

This article is a sharing on Open Data Processing Service (ODPS) offline development. It lists some important experiences and conclusions accumulated over the years, especially the skills and configuration methods in the daily data processing and scheduling of the algorithm. As for the specific reasons, we recommend visiting the Alibaba Cloud official website to check the working principles of the underlying map reduce and data flow.

Reference: https://www.alibabacloud.com/help/en/maxcompute/latest/overview-of-maxcompute-sql

Usage of Join

Input data: two test tables zhule_a and zhule_b. The specific schema and data are listed below (to illustrate the problems caused by duplicate data, a duplicate key=2, ds='20220930' data will be inserted into tables a and b, respectively):

read zhule_a;

key	ds
1	20220930
2	20220930
1	20221001
2	20221001
3	20221001

read zhule_b;

key	ds
2	20220930
3	20220930
1	20221001
2	20221001
3	20221001
4	20221001
5	20221001

Join/Inner Join

Usage: Returns the rows that have matching column values in the left table and the right table based on the Join condition. In a word, find the common parts of the two tables and pay attention to the performance optimization under the Cartesian product.

1. Select the subset of each table and then join:

-- better way to perform join, select small range of data first.
SELECT A.*, B.*
FROM
(SELECT * FROM A WHERE ds='20180101') A
JOIN
(SELECT * FROM B WHERE ds='20180101') B
ON a.key = b.key;

Note: Before performing Join operations, be sure to check whether there is duplicate data in the left and right tables. Otherwise, the final duplicate results will increase by the number of Cartesian products. For example, if there are two duplicate data in each left and right tables, there will be as many as four duplicate data after Join.

2. In the best case, large tables join small tables, and then mapjoin is used to implement.

Official explanation: In the map stage, MAPJOIN loads all data in the specified tables into the memory of the program that performs the JOIN operation. The tables specified for MAPJOIN must be small tables, and the total memory occupied by the table data cannot exceed 512 MB.

Limits on JOIN operations in MAPJOIN:

The left table in a LEFT OUTER JOIN operation must be a large table.
The right table in a RIGHT OUTER JOIN operation must be a large table.
MAPJOIN cannot be used in a FULL OUTER JOIN operation.
The left or right table in an INNER JOIN operation can be a large table.

SELECT  /*+ MAPJOIN(b) */
        a.*
FROM    test_a a
JOIN test_b b
ON      a.user_key = b.user_key
;
// Add a mark before the sql statement to say that this is mapjoin, and write the alias of the small table in parentheses.

An Interesting Point:

In addition to normal equality, mapjoin supports inequality, as shown the following example:

Left Join

Usage: A LEFT JOIN operation takes the Cartesian product of the rows in Table A and Table B and returns all the rows of Table A and rows in Table B that meet the join condition. If the Join condition finds no matching rows in Table B for a row in Table A, the row in Table A is returned in the result set with NULL values in each column from Table B. In a word, return all records in the left table A and data in the right table B that meet the association conditions. When the rows in the right table do not meet the association conditions, NULL values are returned.

The content of the left table must be retained. You can choose to use Left Join. For example, add key_attrs.
There is no essential difference between Right Join and Left Join. We recommend using Left Join after you define the left table.
If the right table has duplicate data, the final Left Join result will have duplicates.

Left Semi Join

Usage: A LEFT SEMI JOIN operation only returns the rows in Table A that have a matching row in Table B. For a piece of data in the left table A, if there are rows in the right table B that meet the association conditions, the left table A will be returned. Otherwise, it will not be returned.

If the right table B has no duplicate data and is the same as Join, only the same columns are retained.
The Left semi Join does not return any data from the right table B, so you cannot specify any filtering condition for the right table B in the where condition. The following example can be clearer (the example is quoted from the open-source forum).

employee (2 columns - e_id and e_name)
10, Tom
20, Jerry
30, Emily

employee_department_mapping (2 columns - e_id and d_name)
10, IT
10, Sales
20, HR

-- inner join results
SELECT e.e_id, e.e_name, d.d_name FROM
employee e INNER JOIN employee_department_mapping d
on e.e_id = d.e_id
-- results
10, Tom, IT
10, Tom, Sales
20, Jerry, HR

-- left semi join results
SELECT e.e_id, e.e_name, d.d_name FROM
employee e LEFT SEMI JOIN employee_department_mapping d
on e.e_id = d.e_id
-- results
10, Tom, IT
20, Jerry, HR

Left Anti Join

Usage: A LEFT ANTI JOIN operation only returns the rows in Table A that have no matching rows in Table B. In a word, for a piece of data in the left table A, if there is no data in the right table B that meets the association conditions, the left table A is returned.

The best scenario to use is to find the differences between the two tables.
The algorithm can be used to add and modify the extraction of goods every day during daily scheduling. Just put the key field in the ON condition.

Full Join

Usage: A FULL JOIN operation takes the Cartesian product of the rows in Table A and Table B and returns all the rows in Table A and Table B, whether the Join condition is met or not. In the result set, NULL values are returned in the columns from the table that lacks a matching row in the other table. In a word, return all records of the left table A and the right table B. For data that does not meet the association conditions, return NULL on the other side that is not associated.

It is useful to update the latest downstream data with addition, deletion, and modification operations, but few people know about it.

For example, today_feat is the new feature table computed today, and yest_feat is the feature of the previous partition.

SELECT  COALESCE(a.main_image_url,b.main_image_url) AS main_image_url
        ,COALESCE(a.embedding,b.embedding) AS embedding
FROM    today_feat a
FULL JOIN yest_feat b
ON      a.main_image_url = b.main_image_url

The effect of full Join is shown below, which meets the update of new, old, and updated features and is smoothly coupled with COALESCE.

Set Mapper and Reducepriority Appropriately

set odps.instance.priority

ODPS's update can only take effect in developing dev space. Setting priorities can improve the execution priority of queuing tasks to a certain extent. However, it will not take effect in the current online formal environment. We recommend optimizing your health scores and setting baselines for important online scheduling tasks to ensure the timeliness of output.

set odps.sql.mapper.split.size

Official guidance changes the input data amount of each Map worker, which is the split size of the input file. You can use this property to indirectly control the number of workers at each Map stage (default value: 256, unit: MB). In a word, if there are many small files, you can increase the value of split.size, which ensures that it is easier to apply to Mapper under limited resources and improves execution efficiency. If you have abundant resources and want more Mapper resources, reduce the value of split.size to apply for more Mappers and improve execution efficiency. Deal with it in light of the circumstances.

For example:

-- original sql
CREATE TABLE if not EXISTS tmp_zhl_test LIFECYCLE 1 AS
SELECT sig, venture, seller_id, COUNT(product_id) as cnt
FROM sku_main_image_sig
WHERE LENGTH(sig) > 10 --some bad cases may have weird sigs like '#NEXT#'
GROUP BY sig, venture, seller_id
HAVING cnt>2
;

If it is the default setting and 553 mappers and 184 reducers are allocated, it takes about 3m18s.

Under the condition of abundant resources, we can set odps.sql.mapper.split.size=64 to apply for more Mappers to process the parts of the file. At the same time, more reducers can be allocated. The execution time of the same SQL code is reduced to 2m34s.

Similarly, if your data volume is large with small pieces of data and limited space resources (after all, resource control is relatively strict now), instead of waiting for 9999 Mappers to be allocated, you can try to set odps.sql.mapper.split.size=2048 (or even larger) to reduce the number of Mappers to be allocated so the task can be executed quickly.

set odps.sql.reducer.instances

Display the number of reducers set (the default value is 0 to 4000). If it is not set, it will be dynamically allocated according to the task. After setting, the number of reducers set will be allocated. In the same example, when we set odps.sql.reducer.instances=1000, the duration becomes 2m.

set odps.sql.mapper(reducer).memory

Set the memory of each Map/Reducer worker (the default value is 1024, which can be set to a value between 256 and 12288). Generally, we do not need to set this value specially, but when the task reports an error and says, data exceeds the memory, we can set this option accordingly.

Use Third-Party Libraries in Python UDF

In this part, I would like to discuss how to install the required third-party libraries (such as numpy, opencv, etc.) in Python UDF of ODPS and the effective solutions if there is version incompatibility between different dependent libraries.

Upload&Call Package

You need to download the installation package xxx.whl of the third-party library directly onto your computer. Then, the consistency of multiple versions can be verified in the offline environment. We need to look at the Python version number required by the installation package and the compatible machine environment. Generally speaking, cp37-cp37m or py2.py3-none-any is in the middle, and then the installation package of x86_64 is at the end.
Convert xxx.whl to xxx.zip locally and use the command mv xxx.whl xxx.zip
Upload the zip resource file to the corresponding ODPS environment
In your UDF, use the following code to specify the path and reference of the resource package (just copy).

def include_package_path(res_name, lib_name):    
    archive_files = get_cache_archive(res_name)
    dir_names = sorted([os.path.dirname(os.path.normpath(f.name)) for f in archive_files
                        if '.dist_info' not in f.name], key=lambda v: len(v))
    for dir_name in dir_names:
        if dir_name.endswith(lib_name):
            sys.path.insert(0, os.path.dirname(dir_name))
            break
        elif os.path.exists(os.path.join(dir_name, lib_name + '.py')):
            sys.path.insert(0, os.path.abspath(dir_name))
            break

class PostProcess(BaseUDTF):
    def __init__(self):
        include_package_path('opencv_python-3.4.0.zip','cv2')
        include_package_path('numpy-1.16.6.zip','numpy')

After Python UDF is written, you can write your resource name into Resources in the creation function, so your resources will be effectively called after the process starts.

The default version of Python UDF is 2.x. If your Python version is 3.x, you need to add the following instructions before ODPS runs. At the same time, some functions need to open sandboxes. So, if an error is reported, you can add the sandbox command in the second line.

set odps.sql.python.version = cp37; --use python 3.7, default is 2.x version
set odps.isolation.session.enable = true;

Solve Compatibility Issue

Sometimes, when using multiple libraries, conflicts between different versions may occur (for example, when using the opencv library, if the corresponding numpy version is incompatible, an error will be reported). Therefore, before uploading resource packages of multiple libraries, you need to confirm the compatibility between versions. We do not recommend trying different versions. Confirm the version locally before uploading. The feasible steps are listed below:

A virtual environment can be built locally with tools like conda.
Use pip or conda install to install the three-party library you need locally.
Query the version of the three-party library and the dependency library you downloaded. For example, you can print cv2.__version__if it is python-opencv.
Follow the preceding method to download the corresponding version of the xxx.whl package and upload it to the ODPS resource for dependency.

Additional Recommendations When Releasing Tasks

When you release a task configuration, you can flexibly use exclude and extra to remove or add the dependencies you want. Exclude can remove temporary tables that you produce in the middle, while extra can help you add upstream tables that you want to depend on even though they are not in the code (this is useful when summarizing data from different upstream tables writing to corresponding partitions in the downstream and hoping to produce output data at the same time).

--exclude input or output tables (especially those tmp tables)
--@exclude_input=lsiqg_iqc_sku_product_detection_result
--@exclude_output=lsmp_sku_image_url_bizdate

-- include input or output tables (especially those separate venture tables)
--@extra_input=lsiqg_iqc_sku_product_detection_result
--@extra_output=lsmp_sku_image_url_bizdate

If you need to use a temporary table to transition the intermediate output data during the SQL code process (to avoid excessive SQL nesting that affects the operation efficiency), we recommend adding a timestamp ex. lsiqg_iqc_input_tmp_${bizdate} to the temporary table. Otherwise, a series of overwrite problems will occur when two tasks are scheduled at the same time and blocked or when data is being filled.
If the upstream table has multiple partitions, but the processing logic of each partition is the same (for example, the processing logic of partition tables in different countries is the same), we highly recommend summarizing the data of different partition tables in the first step and adding another partition (such as venture) to store the merged data. Example:

INSERT OVERWRITE TABLE sku_main_image_sig PARTITION (ds = '${bizdate}',venture)
SELECT  id
        ,image_url
        ,venture
FROM    (
            SELECT  id
                    ,image_url
                    ,'ID' AS venture
            FROM    auction_image_id
            WHERE   ds = MAX_PT('auction_image_id')
            UNION
            SELECT  id
                    ,image_url
                    ,'PH' AS venture
            FROM    auction_image_ph
            WHERE   ds = MAX_PT('auction_image_ph')
            UNION
            SELECT  id
                    ,image_url
                    ,'VN' AS venture
            FROM    auction_image_vn
            WHERE   ds = MAX_PT('auction_image_vn')
            UNION
            SELECT  id
                    ,image_url
                    ,'SG' AS venture
            FROM    auction_image_sg
            WHERE   ds = MAX_PT('auction_image_sg')
            UNION
            SELECT  id
                    ,image_url
                    ,'MY' AS venture
            FROM    auction_image_my
            WHERE   ds = MAX_PT('auction_image_my')
            UNION
            SELECT  id
                    ,image_url
                    ,'TH' AS venture
            FROM    auction_image_th
            WHERE   ds = MAX_PT('auction_image_th')
        ) union_table
;

For important data tables, you must set up monitoring to prevent data loss and abnormal output. The specific methods can be divided into two categories:

Set a task baseline to ensure the task priority so the scheduling time is more guaranteed
Set warning SMS/phone calls or DQC monitoring rules to monitor specific data

Simple tasks can be set in the task center details.

You can configure the DQC platform for more detailed data-level monitoring (such as no data output, data fluctuation, and data maximum/minimum monitoring).

Summary

Data development has a process from unfamiliarity to familiarity. If you write enough, you will find all kinds of useful tools/functions and all kinds of pitfalls. My experience from data acquisition to data development to data monitoring is listed below:

Verify the repeatability and validity of the data as soon as you get the data. Feedback if it is a problem in the group, and the upstream link problem should be filtered by us.
After writing each part of the data, verify the rationality first, which will improve the success rate of the final data.
Generally, after the node goes online, I will observe for 3-4 days to ensure the output is in line with expectations (if data that should be stable suddenly increases or decreases, there is likely a problem with the data logic).
Define reasonable data monitoring to avoid problems (such as empty data, data fluctuations, and unreasonable data fields)

Enjoy Data Engineering!!

Community

ODPS Optimization Suggestions for Beginners

Introduction

Usage of Join

Join/Inner Join

Left Join

Left Semi Join

Left Anti Join

Full Join

Set Mapper and Reducepriority Appropriately

set odps.instance.priority

set odps.sql.mapper.split.size

set odps.sql.reducer.instances

set odps.sql.mapper(reducer).memory

Use Third-Party Libraries in Python UDF

Upload&Call Package

Solve Compatibility Issue

Additional Recommendations When Releasing Tasks

Summary

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

ApsaraDB for HBase

Financial Services Solutions