MaxCompute cannot directly call external API operations. This tutorial provides a complete case study that shows you how to use an offline IP database and a user-defined function (UDF) to efficiently analyze IP address geolocations in MaxCompute.
Background
MaxCompute is a closed computing environment and cannot directly call external API operations via HTTP, such as the Taobao IP database API, to query IP geolocations in real time. The following three solutions can be used to analyze the geolocation of IP addresses in MaxCompute.
Solution | Implementation | Pros | Cons | Conclusion |
Export data for query | Export the IP addresses to be analyzed from MaxCompute to a local device, and then use a script to cyclically call API operations for queries. | Simple | Low efficiency. Subject to API rate limits (such as 10 QPS), making it unsuitable for large-scale data. | Not recommended |
Query a local database | Download both the IP database and the IP addresses to be analyzed to a local device, and then perform matching using a local program. | No API rate limits | The data is separated from the data warehouse and cannot be associated with other business data (such as order tables) in MaxCompute for analysis. | Not recommended |
Import the IP database for query | Upload the entire IP address database to a MaxCompute table, and then use a UDF in SQL to perform queries. |
| Requires regular manual updates to the IP database. | Recommended |
Conclusion: The third solution is the only one that can meet the demands for high-performance and highly integrated analysis in big data use cases. This tutorial details how to implement this solution.
Solution overview
This solution consists of three main modules:
Data preparation: Upload a public IP address database file to a MaxCompute table as a reference.
UDF development: Create a UDF to convert string IP addresses into integers. This is essential for efficient range matching.
SQL analysis: Write SQL statements that call the UDF to look up IP addresses in the MaxCompute table and obtain their geolocations.
Preparations
Create a workspace with Data Studio public preview enabled. This tutorial uses a DataWorks workspace in Basic mode.
Bind a serverless resource group to the workspace.
Bind a MaxCompute computing resource to the workspace and test its network connectivity.
Procedure
Prepare an IP address database
Sample database: ipdata.csv.
Sample data structure:
The data format is
UTF-8.The first four data fields are the start and end addresses of the IP range. The first two are in decimal integer format, and the next two are in dotted-decimal format.
Upload the IP address database to MaxCompute
Create a MaxCompute table.
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
On the Data Studio page, click the
icon in the navigation pane on the left to go to the Workspace Directories.Click the
icon and select .Enter a name and click OK.
On the node editor page, enter the following code:
-- If a table named ipresource already exists, delete it first. DROP TABLE IF EXISTS ipresource; -- Create a table to store IP address database information. CREATE TABLE IF NOT EXISTS ipresource ( start_ip BIGINT, -- Start address of the IP range (in decimal integer format). end_ip BIGINT, -- End address of the IP range (in decimal integer format). start_ip_arg STRING, -- Start address of the IP range (in dotted-decimal string format, such as "1.0.1.x"). end_ip_arg STRING, -- End address of the IP range (in dotted-decimal string format, such as "1.0.3.x"). country STRING, -- Country name. area STRING, -- Area name (usually empty or same as the country). city STRING, -- City name (may be a province for some data). county STRING, -- District/county name. isp STRING -- Internet service provider (such as China Telecom, China Unicom, or China Mobile). );In the Select DataSource dropdown list, select the MaxCompute computing resource you bound.
Click Running Duration and view the results.
Upload data to the table.
Click the
icon in the upper-left corner. In the displayed page, click . Click the
icon in the left navigation pane.Click Upload Data. Configure the parameters as described in the following table.
Parameter
Description
Specify Data to Be Uploaded
Select File
Upload the
ipdata.csvfile.Configure Destination Table
Compute Engine
MaxCompute
MaxComputeProject Name
Select the taget MaxCompute project.
Destination Table
Select the ipresource table.
ImportantIf the newly created table is not in the list, go to and manually refresh the metadata for
odps.<project_name>.ipresource.Resource Group
Select the serverless resource group you prepared.
Preview Data of Uploaded File
Click Mapping by Order to map the data in the .csv file to the fields of the ipresource table.
Click Upload Data.
Verify that the data is uploaded.
Return to the MaxCompute SQL node editor on the Data Studio page. Execute the following SQL query. If you can see the number of data rows and sample data, the upload was successful.
-- Query the number of data rows in the table. SELECT COUNT(*) FROM ipresource; -- Preview the first 10 rows of data. SELECT * FROM ipresource limit 10;
Develop a UDF
To look up the geolocation of an IP addresses in SQL, you need a UDF that converts a dotted-decimal IP string (such as '1.0.2xx.4x') into a comparable decimal integer.
On the Data Studio page, click the Resource Management icon
.Create a MaxCompute Python resource.
On the Resource Management page, click the
icon and select .In the dialog box, enter a name such as
mc.py, and click OK.Upload the local file mc.py to File Content and set Data Source to the computing resource you bound.
In the toolbar, click Save and Publish.
Create a MaxCompute UDF.
On the Resource Management page, click the
icon and choose .In the dialog box, enter a name such as
ip2int, and click OK. This will be the UDF name and cannot be changed after creation.On the editor page, configure the parameters. Key parameters are described below:
For more information, see MaxCompute resources and functions.
Parameter
Description
Function Type
Keep the default setting OTHER.
Data Source
Select the compute resource bound to the workspace.
Class Name
The main class that implements the UDF. Example:
mc.IPtoInt.Resources
Select the resource file
mc.pyfrom the previous step.In the toolbar, click Save and Publish.
Look up IP geolocation by using SQL
Return to the MaxCompute SQL node editor. Execute the following SQL query:
-- Replace this with the actual IP address you want to query. SELECT * FROM ipresource WHERE ip2int('1.0.2xx.4x') >= start_ip AND ip2int('1.0.2xx.4x') <= end_ip;In the Select DataSource dropdown list, select the MaxCompute computing resource you bound.
Click Running Duration. You can find the location of the IP address.