MaxCompute is a closed computing environment that cannot call external HTTP APIs directly. To query IP geolocations at scale, upload an offline IP database into MaxCompute and query it with a user-defined function (UDF).
By the end of this tutorial, you will have:
-
Uploaded a public IP address database into a MaxCompute table
-
Created a Python UDF that converts dotted-decimal IP strings to integers for range matching
-
Queried IP geolocations using SQL within MaxCompute
Background
Because MaxCompute cannot make outbound HTTP calls, external IP lookup APIs (such as the Taobao IP database API) are not accessible from within a SQL job. The table below compares three approaches and explains why importing the IP database into MaxCompute is the right choice for large-scale analysis.
| Approach | How it works | Pros | Cons | Verdict |
|---|---|---|---|---|
| Export data for query | Export IP addresses from MaxCompute to a local device and call an external API in a loop | Simple | Rate-limited (~10 QPS); not viable for large datasets | Not recommended |
| Query a local database | Download both the IP database and the IP addresses locally, then match with a local program | No rate limits | Data is separated from MaxCompute; cannot join with other tables (such as order tables) | Not recommended |
| Import IP database for query | Upload the IP database to a MaxCompute table and query it with a UDF in SQL | Efficient (parallel computing), integrated (joins any MaxCompute table), scalable (any data size) | Requires periodic manual updates to the IP database | Recommended |
This tutorial implements the third approach.
Prerequisites
Before you begin, ensure that you have:
-
A DataWorks workspace in basic mode (without enabling Data Studio new version)
-
A serverless resource group bound to the workspace
-
A MaxCompute computing resource bound to the workspace with network connectivity verified
-
A business flow created in the target workspace
How it works
The solution has three phases:
-
Data preparation: Create a MaxCompute table and upload a public IP database file into it.
-
UDF development: Write a Python UDF that converts a dotted-decimal IP string (such as
'203.0.113.5') to a BIGINT for range matching. Then register the UDF in DataWorks. -
SQL analysis: Call the UDF in a SQL query to look up the geolocation of any IP address in the database.
Upload the IP address database
Create the table
-
Log on to the DataWorks console. In the top navigation bar, select the region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select the target workspace from the drop-down list and click Go to Data Development.
-
On the DataStudio page, right-click the target business flow and choose Create Node > MaxCompute > ODPS SQL.
-
In the dialog box, select the computing resource for Engine Instance, enter a name, and click Confirm.
-
In the node editor, run the following SQL to create the table:
-- Drop the table if it already exists. DROP TABLE IF EXISTS ipresource; -- Create a table to store the IP database. CREATE TABLE IF NOT EXISTS ipresource ( start_ip BIGINT, -- Start of the IP range (decimal integer). end_ip BIGINT, -- End of the IP range (decimal integer). start_ip_arg STRING, -- Start of the IP range (dotted-decimal, e.g. "1.0.1.x"). end_ip_arg STRING, -- End of the IP range (dotted-decimal, e.g. "1.0.3.x"). country STRING, -- Country name. area STRING, -- Area name (usually empty or same as the country). city STRING, -- City name (may be a province for some data). county STRING, -- District or county name. isp STRING -- Internet service provider (e.g., China Telecom, China Unicom, China Mobile). ); -
Click the
icon to run the code. In the Parameters dialog box, select the serverless resource group and click Run. Wait for Estimate MaxCompute Computing Cost to complete, then click Run.
Upload data to the table
The sample IP database file is ipdata.csv (UTF-8 encoded). The first four fields are the start and end addresses of each IP range: two in decimal integer format and two in dotted-decimal format.
-
Click the
icon in the upper-left corner. On the page that appears, click All Products > Data Integration > Upload and Download. -
Click the
icon in the left navigation pane. -
Click Upload Data and configure the following parameters:
ImportantIf the
ipresourcetable does not appear in the list, go to Data Map > My Data > Refresh Table Metadata and manually refresh the metadata forodps.<project_name>.ipresource.Parameter Value Specify Data to Be Uploaded > Select File Upload the ipdata.csvfileConfigure Destination Table > Compute Engine MaxCompute MaxComputeProject Name Select the target MaxCompute project Destination Table Select the ipresource table Resource Group Select the serverless resource group Preview Data of Uploaded File Click Mapping by Order to map the CSV columns to the table fields -
Click Upload Data.
Verify the upload
Return to the ODPS SQL node editor on the DataStudio page and run the following queries:
-- Check the total number of rows loaded.
SELECT COUNT(*) FROM ipresource;
-- Preview the first 10 rows.
SELECT * FROM ipresource LIMIT 10;Develop the UDF
The UDF converts a dotted-decimal IP string to a BIGINT so that SQL can compare it against the start_ip and end_ip integer ranges in the ipresource table. BIGINT is required because the maximum IPv4 value (255.255.255.255) converts to 4,294,967,295 — larger than the 32-bit signed integer limit.
This involves two steps: uploading the Python source as a MaxCompute resource, then registering that resource as a callable UDF function.
Create the Python resource
-
On the DataStudio page, right-click the target business flow and choose Create Resource > MaxCompute > Python.
-
In the dialog box, enter a name (for example,
mc.py). Select Upload to MaxCompute and click Create. -
In the node editor, enter the following code:
from functools import reduce from odps.udf import annotate # Declares the UDF signature: takes a STRING, returns a BIGINT. @annotate("string->bigint") class IPtoInt(object): def evaluate(self, ip): try: # Converts each octet to an integer and combines them using bit-shifting. # Example: '203.0.113.5' -> (203 << 24) | (0 << 16) | (113 << 8) | 5 return reduce(lambda x, y: (x << 8) + y, map(int, ip.split('.'))) except: return 0 -
Click the
icon to save and upload the resource.
Register the UDF
-
On the DataStudio page, right-click the target business flow and choose Create Function > MaxCompute > Function.
-
In the dialog box, enter a name (for example,
ip2int) and click Create. This name is the UDF name used in SQL and cannot be changed after creation. -
On the editor page, configure the following parameters. For a full parameter reference, see Register a MaxCompute function.
Parameter Value Function Type Other Function (default) Class Name mc.IPtoInt— format is<resource_name_without_.py>.<ClassName>Resources Select mc.pyfrom the previous step -
Click the
icon to save the function. -
Click the
icon. In the dialog box, enter a change description and click Confirm.
Query IP geolocations
-
Return to the ODPS SQL node editor and run the following query. Replace
'1.0.2xx.4x'with the actual IP address to look up.-- Replace the placeholder with the actual IP address to query. SELECT * FROM ipresource WHERE ip2int('1.0.2xx.4x') >= start_ip AND ip2int('1.0.2xx.4x') <= end_ip; -
Click the
icon to run the code. In the Parameters dialog box, select the serverless resource group and click Run. Wait for Estimate MaxCompute Computing Cost to complete, then click Run. -
Check the query result. A matching row contains the country, city, and ISP for the queried IP address. If no rows are returned, the IP address is not covered by the database or the UDF returned 0 due to a malformed input.