Process unstructured data in OSS using external volumes for Spark and Proxima jobs - MaxCompute

External volumes act as distributed file systems in MaxCompute, backed by Object Storage Service (OSS). Mount an external volume to an OSS directory, and your Spark on MaxCompute and MapReduce jobs can read and write files through MaxCompute's permission system—without granting direct OSS access to each user.

Each MaxCompute project can have multiple external volumes.

Use cases

External volumes are useful when you need to:

Load job dependencies at startup — automatically download JAR files, Python wheels, or model archives to the job's working directory before execution starts
Read and write OSS files in Spark code — access files stored in OSS using the odps:// path scheme directly in your Spark job code
Apply fine-grained permission control — use MaxCompute's permission system to control who can read or write specific volume paths, instead of managing OSS bucket policies per user
Store ML job outputs — save index data or model files generated by engines like Proxima CE back to OSS through a volume

Billing

Data in external volumes is stored in OSS. You are not charged for storage within MaxCompute. Compute charges apply when a MaxCompute engine reads or processes data in an external volume—for example, when running a Spark on MaxCompute or MapReduce job. Outputs written back to OSS (such as index data from Proxima CE) are charged standard OSS storage fees.

Prerequisites

Before you begin, ensure that you have:

Submitted and received approval for trial use of external volumes. See Apply for trial use of new features
MaxCompute client (odpscmd) V0.43.0 or later installed. See MaxCompute client (odpscmd). If you use the SDK for Java, version V0.43.0 or later is required. See Version updates
An OSS bucket created. See Create buckets
Your MaxCompute project authorized to access OSS. See Configure an OSS access method

Quick start

Step 1: Grant the required permissions

To use external volumes, your account needs the following permissions: CreateInstance, CreateVolume, List, Read, and Write. See MaxCompute permissions.

Check whether your account has the CreateVolume permission:
```
SHOW GRANTS FOR <user_name>;
```

If the CreateVolume permission is missing, grant it:

GRANT CreateVolume ON project <project_name> TO USER <user_name>;

To revoke the permission later:

REVOKE CreateVolume ON project <project_name> FROM USER <user_name>;

Run SHOW GRANTS again to confirm the permission is granted.

Step 2: Create an external volume

Run the following command using the account that has the CreateVolume permission:

vfs -create <volume_name>
    -storage_provider oss
    -url oss://<oss_endpoint>/<bucket>/<path>
    -acd <true|false>
    -role_arn <arn:aliyun:xxx/aliyunodpsdefaultrole>

For parameter details and other volume operations, see External volume operations.

After creation, the volume path is odps://[project_name]/[volume_name]. Use this path in Spark on MaxCompute and MapReduce jobs.

Step 3: Verify the volume

List all volumes in the current project to confirm the volume was created:

vfs -ls /;

Use Spark on MaxCompute with external volumes

Spark on MaxCompute is compatible with open source Spark and runs on MaxCompute's integrated computing resources, datasets, and permission system.

There are two ways to access external volumes from a Spark job:

Reference files at job startup — volume files are downloaded to the job's working directory before the job starts
Access files in code — use the odps:// path scheme directly in your Spark code to read and write volume files at runtime

Reference files at job startup

Configure the following parameters in the Parameters section of the DataWorks ODPS Spark node, or in the spark-defaults.conf file. These parameters cannot be set inside your job code.

Parameter	Description
`spark.hadoop.odps.cupid.volume.files`	Files to download to the job's working directory before startup. Separate multiple files with commas. Each value must include the full file name.
`spark.hadoop.odps.cupid.volume.archives`	Archive files (`.zip`, `.tar.gz`, `.tar`) to download and decompress to the job's working directory before startup. Separate multiple archives with commas.

Value format:

odps://[project_name]/[volume_name]/[path_to_file]

Example — files:

spark.hadoop.odps.cupid.volume.files=
odps://mc_project/external_volume/data/mllib/kmeans_data.txt,
odps://mc_project/external_volume/target/PythonKMeansExample/KMeansModel/data/part-00000-a2d44ac5-54f6-49fd-b793-f11e6a189f90-c000.snappy.parquet

After the job starts, the working directory contains kmeans_data.txt and part-00000-a2d44ac5-54f6-49fd-b793-f11e6a189f90-c000.snappy.parquet.

Example — archives:

spark.hadoop.odps.cupid.volume.archives=
odps://spark_test_wj2/external_volume/pyspark-3.1.1.zip,
odps://spark_test_wj2/external_volume/python-3.7.9-ucs4.tar.gz

After the job starts, the working directory contains the decompressed contents of pyspark-3.1.1.zip and python-3.7.9-ucs4.tar.gz.

Access files in code

To read and write external volume files from your Spark job code, set the following parameters in the code:

Parameter	Value	Description
`spark.hadoop.odps.volume.common.filesystem`	`true`	Enables external volume recognition. Default: `false`.
`spark.hadoop.odps.cupid.volume.paths`	`odps://[project_name]/[volume_name]/`	The volume path to access. Default: empty.
`spark.hadoop.fs.odps.impl`	`org.apache.hadoop.fs.aliyun.volume.OdpsVolumeFileSystem`	Implementation class for OSS access.
`spark.hadoop.fs.AbstractFileSystem.odps.impl`	`org.apache.hadoop.fs.aliyun.volume.abstractfsimpl.OdpsVolumeFs`	Abstract file system implementation class.

Example — K-means clustering with external volume:

The following example uses the K-means algorithm. It reads training data from odps://ms_proj1_dev/volume_yyy1/, trains a model, and saves the output back to the same volume.

All file paths in the code use the odps:// scheme to read from and write to the external volume.

Note

Set the four parameters above in spark-defaults.conf or in the Parameters section of the DataWorks ODPS Spark node before running this code. The example also requires the following additional parameters for OSS access, the JindoFS SDK, and the Python runtime:

-- Parameters
spark.hadoop.odps.cupid.volume.paths=odps://ms_proj1_dev/volume_yyy1/
spark.hadoop.odps.volume.common.filesystem=true
spark.hadoop.fs.odps.impl=org.apache.hadoop.fs.aliyun.volume.OdpsVolumeFileSystem
spark.hadoop.fs.AbstractFileSystem.odps.impl=org.apache.hadoop.fs.aliyun.volume.abstractfsimpl.OdpsVolumeFs

spark.hadoop.odps.access.id=xxxxxxxxx
spark.hadoop.odps.access.key=xxxxxxxxx
spark.hadoop.fs.oss.endpoint=oss-cn-beijing-internal.aliyuncs.com
spark.hadoop.odps.cupid.resources=ms_proj1_dev.jindofs-sdk-3.8.0.jar
spark.hadoop.fs.oss.impl=com.aliyun.emr.fs.oss.JindoOssFileSystem

spark.hadoop.odps.cupid.resources=public.python-2.7.13-ucs4.tar.gz
spark.pyspark.python=./public.python-2.7.13-ucs4.tar.gz/python-2.7.13-ucs4/bin/python
spark.hadoop.odps.spark.version=spark-2.4.5-odps0.34.0

-- Code

from numpy import array
from math import sqrt

from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans, KMeansModel

if __name__ == "__main__":
    sc = SparkContext(appName="KMeansExample")

    # Read training data from the external volume
    data = sc.textFile("odps://ms_proj1_dev/volume_yyy1/kmeans_data.txt")
    parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

    # Train the K-means model
    clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random")

    # Evaluate the model
    def error(point):
        center = clusters.centers[clusters.predict(point)]
        return sqrt(sum([x**2 for x in (point - center)]))

    WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
    print("Within Set Sum of Squared Error = " + str(WSSSE))

    # Save the model to the external volume
    clusters.save(sc, "odps://ms_proj1_dev/volume_yyy1/target/PythonKMeansExample/KMeansModel")
    print(parsedData.map(lambda feature: clusters.predict(feature)).collect())

    # Load and use the saved model
    sameModel = KMeansModel.load(sc, "odps://ms_proj1_dev/volume_yyy1/target/PythonKMeansExample/KMeansModel")
    print(parsedData.map(lambda feature: sameModel.predict(feature)).collect())

    sc.stop()

After the job completes, view the output files in the OSS directory mapped to the volume.

Data Import and Export between External Volume Files and Internal Tables

To import data files from an External Volume into a MaxCompute table or partition, you can use the LOAD command. To store structured data in OSS, you can export data files from a MaxCompute project to OSS using the UNLOAD command.

LOAD

Syntax:
```
{load overwrite|into} table <table_name> [partition (<pt_spec>)]
from location <Volume_location>
stored by <StorageHandler>
[with serdeproperties (<Options>)];
```
Volume_location is the specified External Volume path. The format is odps://[project_name]/[volume_name]/. project_name is the name of the MaxCompute project. volume_name is the name of the External Volume. For descriptions of other parameters, see LOAD.

Important
You must also configure the odps.properties.rolearn parameter in the with serdeproperties list to provide RoleArn authentication information. The RoleArn information can be different from that of the External Volume, as long as it has permission to access the OSS folder. For more information, see ORC external tables.

Example:

In this example, the OSS path mapped by an External Volume contains several CSV files with a uniform schema. An External Volume named volume_external is created. You can run the following statements to import the files into a MaxCompute internal table.

-- Create the sink table ambulance_data_csv_load
create table ambulance_data_csv_load (
  vehicleId INT,
  recordId INT,
  patientId INT,
  calls INT,
  locationLatitute DOUBLE,
  locationLongtitue DOUBLE,
  recordTime STRING,
  direction STRING );

-- Run the load command
load overwrite table ambulance_data_csv_load
from
location 'odps://<project_name>/volume_external/'
stored by 'com.aliyun.odps.CsvStorageHandler'
with serdeproperties (
  'odps.properties.rolearn'='acs:ram::xxxxx:role/aliyunodpsdefaultrole',
  'odps.text.option.delimiter'=','
);

-- Query
SELECT * from ambulance_data_csv_load;

-- Example result
vehicleid	recordid	patientid	calls	locationlatitute	locationlongtitue	recordtime	direction
1	1	51	1	46.81006	-92.08174	9/14/2014 0:00	S
1	2	13	1	46.81006	-92.08174	9/14/2014 0:01	NE
1	3	48	1	46.81006	-92.08174	9/14/2014 0:02	NE
1	4	30	1	46.81006	-92.08174	9/14/2014 0:03	W
1	5	47	1	46.81006	-92.08174	9/14/2014 0:04	S
1	6	9	1	46.81006	-92.08174	9/14/2014 0:05	S
1	7	53	1	46.81006	-92.08174	9/14/2014 0:06	N
1	8	63	1	46.81006	-92.08174	9/14/2014 0:07	SW
1	9	4	1	46.81006	-92.08174	9/14/2014 0:08	NE
1	10	31	1	46.81006	-92.08174	9/14/2014 0:09	N

UNLOAD

Syntax:
```
unload from {<select_statement>|<table_name> [partition (<pt_spec>)]} 
into 
location <Volume_location>
[stored by <StorageHandler>]
[with serdeproperties ('<property_name>'='<property_value>',...)];
```
Volume_location is the specified External Volume path. The format is odps://[project_name]/[volume_name]/. project_name is the name of the MaxCompute project. volume_name is the name of the External Volume. For descriptions of other parameters, see UNLOAD.

Example:

This example shows how to export data from a MaxCompute internal table to the OSS path that is mapped by an External Volume. An External Volume named volume_external_unload is created. You can run the following statements to export the table data to the External Volume.

-- Control the number of exported files: Set the size of MaxCompute table data that a single worker can read, in MB. Because MaxCompute tables are compressed, the data exported to OSS is generally about four times larger.
set odps.stage.mapper.split.size=256;
-- Export data.
unload from
(select * from ambulance_data_csv_load)
into
location 'odps://project_name/volume_external_unload'
stored by 'com.aliyun.odps.CsvStorageHandler'
with serdeproperties (
'odps.text.option.delimiter'=',');

-- This is equivalent to the following statements.
set odps.stage.mapper.split.size=256;
unload from ambulance_data_csv_load 
into
location 'odps://project_name/volume_external_unload'
stored by 'com.aliyun.odps.CsvStorageHandler'
with serdeproperties (
  'odps.properties.rolearn'='acs:ram::139xxx:role/aliyunodpsdefaultrole',
  'odps.text.option.delimiter'=',');

Example result: A data file is generated in the OSS folder that is mapped by the External Volume.

Create an external table from an External Volume file

If the OSS path that is mapped by the External Volume contains semi-structured files with a uniform schema, such as CSV, PARQUET, or CRC files, you can run the following command to create an external table from the External Volume. For more information about the syntax for creating an external table, see ORC external tables.

Syntax:
```
create external table [if not exists] <mc_oss_extable_name> 
(
<col_name> <data_type>,
...
)
[partitioned by (<col_name> <data_type>, ...)] 
stored by '<StorageHandler>'  
with serdeproperties (
 ['<property_name>'='<property_value>',...]
) 
location '<Volume_location>';
```
Volume_location is the specified External Volume path. The format is odps://[project_name]/[volume_name]/. project_name is the name of the MaxCompute project. volume_name is the name of the External Volume. For descriptions of other parameters, see ORC external tables.

Note
You must also configure the odps.properties.rolearn parameter in the with serdeproperties list to provide RoleArn authentication information. The RoleArn information can be different from that of the External Volume, as long as it has permission to access the OSS folder. For more information, see ORC external tables.

Example:

In this example, the OSS path mapped by an External Volume contains several CSV files with a uniform schema. An External Volume named demo_volume3 is created. You can run the following statements to create and query an external table.

create external table ext_tbl_onvolume
(
    col1 string,
    col2 string,
    col3 string
)
stored by 'com.aliyun.odps.CsvStorageHandler' 
with serdeproperties (
 'odps.properties.rolearn'='acs:ram::1248xxx:role/aliyunodpsdefaultrole'
) 
location 'odps://project_name/demo_volume3/';

-- Query the external table
SELECT * from ext_tbl_onvolume;

Use Proxima CE for vectorization in MaxCompute

Proxima CE performs vector indexing and nearest-neighbor search on data stored in MaxCompute tables. Results are saved to an external volume in OSS.

Limitations

The Proxima SDK for Java supports Linux and macOS only. JAR files contain Linux-specific dependencies and cannot run on the MaxCompute client on Windows.
Proxima CE runs two types of tasks: local tasks (not involving SQL, MapReduce, or Graph) and MaxCompute tasks (executed via SQL, MapReduce, or Graph engines). The two types run alternately. At startup, Proxima CE attempts to load the Proxima kernel on the local machine. If the kernel loads successfully, certain modules run locally; if loading fails, errors are reported but the job continues using fallback functions.
Submit the task using the MaxCompute client (odpscmd). DataWorks MapReduce nodes are not supported because the underlying MaxCompute client version is being upgraded.

Run a Proxima CE vectorization task

Step 1: Install the Proxima CE resource package.

Step 2: Prepare input data.

Create the input tables and insert sample data:

-- Create a base table and a query table
CREATE TABLE doc_table_float_smoke(pk STRING, vector STRING) PARTITIONED BY (pt STRING);
CREATE TABLE query_table_float_smoke(pk STRING, vector STRING) PARTITIONED BY (pt STRING);

-- Insert data into the base table
ALTER TABLE doc_table_float_smoke ADD PARTITION(pt='20230116');
INSERT OVERWRITE TABLE doc_table_float_smoke PARTITION (pt='20230116') VALUES
('1.nid','1~1~1~1~1~1~1~1'),
('2.nid','2~2~2~2~2~2~2~2'),
('3.nid','3~3~3~3~3~3~3~3'),
('4.nid','4~4~4~4~4~4~4~4'),
('5.nid','5~5~5~5~5~5~5~5'),
('6.nid','6~6~6~6~6~6~6~6'),
('7.nid','7~7~7~7~7~7~7~7'),
('8.nid','8~8~8~8~8~8~8~8'),
('9.nid','9~9~9~9~9~9~9~9'),
('10.nid','10~10~10~10~10~10~10~10');

-- Insert data into the query table
ALTER TABLE query_table_float_smoke ADD PARTITION(pt='20230116');
INSERT OVERWRITE TABLE query_table_float_smoke PARTITION (pt='20230116') VALUES
('q1.nid','1~1~1~1~2~2~2~2'),
('q2.nid','4~4~4~4~3~3~3~3'),
('q3.nid','9~9~9~9~5~5~5~5');

Step 3: Submit the Proxima CE task.

jar -libjars proxima-ce-aliyun-1.0.0.jar
-classpath proxima-ce-aliyun-1.0.0.jar com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_float_smoke
-doc_table_partition 20230116
-query_table query_table_float_smoke
-query_table_partition 20230116
-output_table output_table_float_smoke
-output_table_partition 20230116
-data_type float
-dimension 8
-topk 1
-job_mode train:build:seek:recall
-external_volume shanghai_vol_ceshi
-owner_id 1248953xxx
;

Step 4: Verify the results.

Query the output table to check the nearest-neighbor results:

SELECT * FROM output_table_float_smoke WHERE pt='20230116';

Expected output:

+------------+------------+------------+------------+
| pk         | knn_result | score      | pt         |
+------------+------------+------------+------------+
| q1.nid     | 2.nid      | 4.0        | 20230116   |
| q1.nid     | 1.nid      | 4.0        | 20230116   |
| q1.nid     | 3.nid      | 20.0       | 20230116   |
| q2.nid     | 4.nid      | 4.0        | 20230116   |
| q2.nid     | 3.nid      | 4.0        | 20230116   |
| q2.nid     | 2.nid      | 20.0       | 20230116   |
| q3.nid     | 7.nid      | 32.0       | 20230116   |
| q3.nid     | 8.nid      | 40.0       | 20230116   |
| q3.nid     | 6.nid      | 40.0       | 20230116   |
+------------+------------+------------+------------+

What's next

External volume operations — create, list, and manage external volumes
Access OSS from Spark on MaxCompute — direct OSS access without external volumes
MaxCompute permissions — manage user permissions for volumes and projects