Processing Cartesian Products with PyODPS DataFrame

By Ji Sheng

PyODPS provides DataFrame APIs to analyze and preprocess large-scale data with interfaces like Pandas. This article mainly introduces how to use PyODPS to perform Cartesian product operations.

The most common scenario of Cartesian products is the comparison or operation of two collections. Taking the computation of geographic distance as an example, assume that the big table Coordinates1 stores the longitude and latitude coordinates of destination points, with a total of M rows of data, while the small table Coordinates2 stores the longitude and latitude coordinates of departure points, with a total of N rows of data. Now, the coordinates of all the departure points closest to each of the destination points need to be computed. For a destination point, we need to compute the distance from all the departure points to the destination point and find the minimum distance. Therefore, the entire intermediate process requires M * N data records, which results in a Cartesian product.

Haversine Formula

First, let's have a brief understanding of the background. Given the longitude and latitude of the coordinate points of two geographical locations, the haversine formula can be used to compute the distance between the two points. The Python expression is as follows:

def  haversine(lat1,  lon1,  lat2,  lon2):
        #  lat1,  lon1  为位置  1  的经纬度坐标
        #  lat2,  lon2  为位置  2  的经纬度坐标
        import  numpy  as  np

        dlon  =  np.radians(lon2  -  lon1)
        dlat  =  np.radians(lat2  -  lat1)
        a  =  np.sin(  dlat  /2  )  **2  +  np.cos(np.radians(lat1))  *  np.cos(np.radians(lat2))  *  np.sin(  dlon  /2  )  **2
        c  =  2  *  np.arcsin(np.sqrt(a))
        r  =  6371  #  地球平均半径，单位为公里
        return  c  *  r

MapJoin

Currently, MapJoin is the most recommended method. In PyODPS, the method of using MapJoin is simple. You only need to specify mapjoin=True when two dataframes join, and the MapJoin will be performed on the table on the right during execution.

In  [3]:  df1  =  o.get_table('coordinates1').to_df()                                                                                                                                                                                        

In  [4]:  df2  =  o.get_table('coordinates2').to_df()                                                                                                                                                                                        

In  [5]:  df3  =  df1.join(df2,  mapjoin=True)                                                                                                                                                                                                        

In  [6]:  df1.schema                                                                                                                                                                                                                                                      
Out[6]:  
odps.Schema  {
    latitude                    float64              
    longitude                  float64              
    id                                string                
}

In  [7]:  df2.schema                                                                                                                                                                                                                                                      
Out[7]:  
odps.Schema  {
    latitude                    float64              
    longitude                  float64              
    id                                string                
}

In  [8]:  df3.schema                                                                                                                                                                                                                                                      
Out[8]:  
odps.Schema  {
    latitude_x                        float64              
    longitude_x                      float64              
    id_x                                    string                
    latitude_y                        float64              
    longitude_y                      float64              
    id_y                                    string                
}

We can see that the _x and _y suffixes are added to the duplicate columns by default during Join operations. The suffixes can be customized by passing a binary tuple in the suffixes parameter. After the joined table is obtained, the distance can be computed by using the user-defined function DataFrame in PyODPS, which is very simple, clear and efficient.

In  [9]:  r  =  6371  
      ...:  dis1  =  (df3.latitude_y  -  df3.latitude_x).radians()  
      ...:  dis2  =  (df3.longitude_y  -  df3.longitude_x).radians()  
      ...:  a  =  (dis1  /  2).sin()  **  2  +  df3.latitude_x.radians().cos()  *  df3.latitude_y.radians().cos()  *  (dis2  /  2).sin()  **  2  
      ...:  df3['dis']  =  2  *  a.sqrt().arcsin()  *  r                                                                                                                                                                                              
                                                                                                                                                                                                        
In [12]: df3.head(10)                                                                                                                        
Out[12]: 
    latitude_x  longitude_x id_x  latitude_y   longitude_y id_y       dis
0   76.252432    59.628253    0   84.045210     6.517522    0  1246.864981
1   76.252432    59.628253    0   59.061796     0.794939    1  2925.953147
2   76.252432    59.628253    0   42.368304    30.119837    2  4020.604942
3   76.252432    59.628253    0   81.290936    51.682749    3   584.779748
4   76.252432    59.628253    0   34.665222   147.167070    4  6213.944942
5   76.252432    59.628253    0   58.058854   165.471565    5  4205.219179
6   76.252432    59.628253    0   79.150677    58.661890    6   323.070785
7   76.252432    59.628253    0   72.622352   123.195778    7  1839.380760
8   76.252432    59.628253    0   80.063614   138.845193    8  1703.782421
9   76.252432    59.628253    0   36.231584    90.774527    9  4717.284949

In [13]: df1.count()                                                                                                                         
Out[13]: 2000

In [14]: df2.count()                                                                                                                         
Out[14]: 100

In [15]: df3.count()                                                                                                                         
Out[15]: 200000

df3 already has M * N data records. Next, if we need to know the minimum distance, we can directly call groupby to df3, followed by the min aggregate function, to get the minimum distance of each target point.

In [16]: df3.groupby('id_x').dis.min().head(10)                                                                                              
Out[16]: 
       dis_min
0   323.070785
1    64.755493
2  1249.283169
3   309.818288
4  1790.484748
5   385.107739
6   498.816157
7   615.987467
8   437.765432
9   272.589621

DataFrame User-defined Function

If we need to know the city corresponding to the minimum distance point, that is, the corresponding ID in the table, we can call MapReduce after MapJoin, but we can also use the apply method of DataFrame. To use user-defined functions (UDFs) for one row, we can use the apply method. The axis parameter must be 1 to indicate that the operation works on the row.

Table Resources

Note that "apply" is a UDF executed on the server, so expressions similar to df=o.get_table('table_name').to_df() cannot be used in the function to obtain table data. The specific principle is covered in Where is the Code for PyODPS DataFrame Executed?. Take the case in this article as an example. To compute all records in Tables 1 and 2, we need to use table 2 as a resource table, and then reference the table resource in customization. It is also very convenient to use table resources in PyODPS. You only need to pass a collection into the resources parameter. A collection is an iteratable object, not a DataFrame object, and cannot directly call the DataFrame interface. Each iteration value is a namedtuple, and the corresponding value can be obtained by the field name or offset.

## use dataframe udf

df1 = o.get_table('coordinates1').to_df()
df2 = o.get_table('coordinates2').to_df()

def func(collections):
    import pandas as pd
    
    collection = collections[0]
    
    ids = []
    latitudes = []
    longitudes = []
    for r in collection:
        ids.append(r.id)
        latitudes.append(r.latitude)
        longitudes.append(r.longitude)

    df = pd.DataFrame({'id': ids, 'latitude':latitudes, 'longitude':longitudes})
    def h(x):        
        df['dis'] = haversine(x.latitude, x.longitude, df.latitude, df.longitude)
        return df.iloc[df['dis'].idxmin()]['id']
    return h

df1[df1.id, df1.apply(func, resources=[df2], axis=1, reduce=True, types='string').rename('min_id')].execute(
    libraries=['pandas.zip', 'python-dateutil.zip', 'pytz.zip', 'six.tar.gz'])

In a UDF, the table resource is looped to be read into the Pandas DataFrame, and the row corresponding to the minimum value can be easily found by using the Pandas loc, thus obtaining the ID of the closest departure point.

Global Variable

When the data size of a small table is small, we can even use the data in the small table as global variables in a UDF.

df1 = o.get_table('coordinates1').to_df()
df2 = o.get_table('coordinates2').to_df()
df = df2.to_pandas()

def func(x):
    df['dis'] = haversine(x.latitude, x.longitude, df.latitude, df.longitude)
    return df.iloc[df['dis'].idxmin()]['id']

df1[df1.id, df1.apply(func, axis=1, reduce=True, types='string').rename('min_id')].execute(
    libraries=['pandas.zip', 'python-dateutil.zip', 'pytz.zip', 'six.tar.gz'])

When the function is uploaded, the global variables (df in the above code) used in the function will be pickled into the UDF. However, note that applicable use cases for this method are very limited, because the size of files to be uploaded in ODPS is limited. Therefore, if there is too much data, the resources generated by the UDF will be too large to be uploaded. In addition, for this method, it is better to ensure that the client version of the third-party package is consistent with that of the server. Otherwise, serialization problems are very likely to occur, so we recommend that you only use this method when dealing with small data volumes.

Summary

PyODPS can be used to solve Cartesian product problems in two ways: One is MapJoin, which is intuitive and has good performance. Generally, we recommend MapJoin to solve problems that can be solved by MapJoin, and it is better to compute using built-in functions to maximize efficiency. The problem is that it is not flexible enough. The other is the DataFrame UDF, which is relatively flexible, but its performance is relatively poor (Pandas or NumPy can be used to improve performance). By using table resources, the small table is passed into the DataFrame UDF as a table resource to complete the Cartesian product operation.

Community

Processing Cartesian Products with PyODPS DataFrame

Haversine Formula

MapJoin

DataFrame User-defined Function

Table Resources

Global Variable

Summary

Read previous post:

Read next post:

Alibaba Cloud MaxCompute

You may also like

Comments

Alibaba Cloud MaxCompute

Related Products

Big Data Consulting for Data Technology Solution

MaxCompute

Big Data Consulting Services for Retail Solution

E-MapReduce Service