External volumes act as distributed file systems in MaxCompute, backed by Object Storage Service (OSS). Mount an external volume to an OSS directory, and your Spark on MaxCompute and MapReduce jobs can read and write files through MaxCompute's permission system—without granting direct OSS access to each user.
Each MaxCompute project can have multiple external volumes.
Use cases
External volumes are useful when you need to:
-
Load job dependencies at startup — automatically download JAR files, Python wheels, or model archives to the job's working directory before execution starts
-
Read and write OSS files in Spark code — access files stored in OSS using the
odps://path scheme directly in your Spark job code -
Apply fine-grained permission control — use MaxCompute's permission system to control who can read or write specific volume paths, instead of managing OSS bucket policies per user
-
Store ML job outputs — save index data or model files generated by engines like Proxima CE back to OSS through a volume
Billing
Data in external volumes is stored in OSS. You are not charged for storage within MaxCompute. Compute charges apply when a MaxCompute engine reads or processes data in an external volume—for example, when running a Spark on MaxCompute or MapReduce job. Outputs written back to OSS (such as index data from Proxima CE) are charged standard OSS storage fees.
Prerequisites
Before you begin, ensure that you have:
-
Submitted and received approval for trial use of external volumes. See Apply for trial use of new features
-
MaxCompute client (odpscmd) V0.43.0 or later installed. See MaxCompute client (odpscmd). If you use the SDK for Java, version V0.43.0 or later is required. See Version updates
-
An OSS bucket created. See Create buckets
-
Your MaxCompute project authorized to access OSS. See Configure an OSS access method
Quick start
Step 1: Grant the required permissions
To use external volumes, your account needs the following permissions: CreateInstance, CreateVolume, List, Read, and Write. See MaxCompute permissions.
-
Check whether your account has the
CreateVolumepermission:SHOW GRANTS FOR <user_name>; -
If the
CreateVolumepermission is missing, grant it:GRANT CreateVolume ON project <project_name> TO USER <user_name>;To revoke the permission later:
REVOKE CreateVolume ON project <project_name> FROM USER <user_name>; -
Run
SHOW GRANTSagain to confirm the permission is granted.
Step 2: Create an external volume
Run the following command using the account that has the CreateVolume permission:
vfs -create <volume_name>
-storage_provider oss
-url oss://<oss_endpoint>/<bucket>/<path>
-acd <true|false>
-role_arn <arn:aliyun:xxx/aliyunodpsdefaultrole>
For parameter details and other volume operations, see External volume operations.
After creation, the volume path is odps://[project_name]/[volume_name]. Use this path in Spark on MaxCompute and MapReduce jobs.
Step 3: Verify the volume
List all volumes in the current project to confirm the volume was created:
vfs -ls /;
Use Spark on MaxCompute with external volumes
Spark on MaxCompute is compatible with open source Spark and runs on MaxCompute's integrated computing resources, datasets, and permission system.
There are two ways to access external volumes from a Spark job:
-
Reference files at job startup — volume files are downloaded to the job's working directory before the job starts
-
Access files in code — use the
odps://path scheme directly in your Spark code to read and write volume files at runtime
Reference files at job startup
Configure the following parameters in the Parameters section of the DataWorks ODPS Spark node, or in the spark-defaults.conf file. These parameters cannot be set inside your job code.
| Parameter | Description |
|---|---|
spark.hadoop.odps.cupid.volume.files |
Files to download to the job's working directory before startup. Separate multiple files with commas. Each value must include the full file name. |
spark.hadoop.odps.cupid.volume.archives |
Archive files (.zip, .tar.gz, .tar) to download and decompress to the job's working directory before startup. Separate multiple archives with commas. |
Value format:
odps://[project_name]/[volume_name]/[path_to_file]
Example — files:
spark.hadoop.odps.cupid.volume.files=
odps://mc_project/external_volume/data/mllib/kmeans_data.txt,
odps://mc_project/external_volume/target/PythonKMeansExample/KMeansModel/data/part-00000-a2d44ac5-54f6-49fd-b793-f11e6a189f90-c000.snappy.parquet
After the job starts, the working directory contains kmeans_data.txt and part-00000-a2d44ac5-54f6-49fd-b793-f11e6a189f90-c000.snappy.parquet.
Example — archives:
spark.hadoop.odps.cupid.volume.archives=
odps://spark_test_wj2/external_volume/pyspark-3.1.1.zip,
odps://spark_test_wj2/external_volume/python-3.7.9-ucs4.tar.gz
After the job starts, the working directory contains the decompressed contents of pyspark-3.1.1.zip and python-3.7.9-ucs4.tar.gz.
Access files in code
To read and write external volume files from your Spark job code, set the following parameters in the code:
| Parameter | Value | Description |
|---|---|---|
spark.hadoop.odps.volume.common.filesystem |
true |
Enables external volume recognition. Default: false. |
spark.hadoop.odps.cupid.volume.paths |
odps://[project_name]/[volume_name]/ |
The volume path to access. Default: empty. |
spark.hadoop.fs.odps.impl |
org.apache.hadoop.fs.aliyun.volume.OdpsVolumeFileSystem |
Implementation class for OSS access. |
spark.hadoop.fs.AbstractFileSystem.odps.impl |
org.apache.hadoop.fs.aliyun.volume.abstractfsimpl.OdpsVolumeFs |
Abstract file system implementation class. |
Example — K-means clustering with external volume:
The following example uses the K-means algorithm. It reads training data from odps://ms_proj1_dev/volume_yyy1/, trains a model, and saves the output back to the same volume.
All file paths in the code use the odps:// scheme to read from and write to the external volume.
Set the four parameters above in spark-defaults.conf or in the Parameters section of the DataWorks ODPS Spark node before running this code. The example also requires the following additional parameters for OSS access, the JindoFS SDK, and the Python runtime:
-- Parameters
spark.hadoop.odps.cupid.volume.paths=odps://ms_proj1_dev/volume_yyy1/
spark.hadoop.odps.volume.common.filesystem=true
spark.hadoop.fs.odps.impl=org.apache.hadoop.fs.aliyun.volume.OdpsVolumeFileSystem
spark.hadoop.fs.AbstractFileSystem.odps.impl=org.apache.hadoop.fs.aliyun.volume.abstractfsimpl.OdpsVolumeFs
spark.hadoop.odps.access.id=xxxxxxxxx
spark.hadoop.odps.access.key=xxxxxxxxx
spark.hadoop.fs.oss.endpoint=oss-cn-beijing-internal.aliyuncs.com
spark.hadoop.odps.cupid.resources=ms_proj1_dev.jindofs-sdk-3.8.0.jar
spark.hadoop.fs.oss.impl=com.aliyun.emr.fs.oss.JindoOssFileSystem
spark.hadoop.odps.cupid.resources=public.python-2.7.13-ucs4.tar.gz
spark.pyspark.python=./public.python-2.7.13-ucs4.tar.gz/python-2.7.13-ucs4/bin/python
spark.hadoop.odps.spark.version=spark-2.4.5-odps0.34.0
-- Codefrom numpy import array
from math import sqrt
from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans, KMeansModel
if __name__ == "__main__":
sc = SparkContext(appName="KMeansExample")
# Read training data from the external volume
data = sc.textFile("odps://ms_proj1_dev/volume_yyy1/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Train the K-means model
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random")
# Evaluate the model
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
# Save the model to the external volume
clusters.save(sc, "odps://ms_proj1_dev/volume_yyy1/target/PythonKMeansExample/KMeansModel")
print(parsedData.map(lambda feature: clusters.predict(feature)).collect())
# Load and use the saved model
sameModel = KMeansModel.load(sc, "odps://ms_proj1_dev/volume_yyy1/target/PythonKMeansExample/KMeansModel")
print(parsedData.map(lambda feature: sameModel.predict(feature)).collect())
sc.stop()
After the job completes, view the output files in the OSS directory mapped to the volume.
Data Import and Export between External Volume Files and Internal Tables
To import data files from an External Volume into a MaxCompute table or partition, you can use the LOAD command. To store structured data in OSS, you can export data files from a MaxCompute project to OSS using the UNLOAD command.
LOAD
-
Syntax:
{load overwrite|into} table <table_name> [partition (<pt_spec>)] from location <Volume_location> stored by <StorageHandler> [with serdeproperties (<Options>)];Volume_location is the specified External Volume path. The format is
odps://[project_name]/[volume_name]/. project_name is the name of the MaxCompute project. volume_name is the name of the External Volume. For descriptions of other parameters, see LOAD.ImportantYou must also configure the
odps.properties.rolearnparameter in thewith serdepropertieslist to provide RoleArn authentication information. The RoleArn information can be different from that of the External Volume, as long as it has permission to access the OSS folder. For more information, see ORC external tables. -
Example:
In this example, the OSS path mapped by an External Volume contains several CSV files with a uniform schema. An External Volume named
volume_externalis created. You can run the following statements to import the files into a MaxCompute internal table.-- Create the sink table ambulance_data_csv_load create table ambulance_data_csv_load ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongtitue DOUBLE, recordTime STRING, direction STRING ); -- Run the load command load overwrite table ambulance_data_csv_load from location 'odps://<project_name>/volume_external/' stored by 'com.aliyun.odps.CsvStorageHandler' with serdeproperties ( 'odps.properties.rolearn'='acs:ram::xxxxx:role/aliyunodpsdefaultrole', 'odps.text.option.delimiter'=',' ); -- Query SELECT * from ambulance_data_csv_load; -- Example result vehicleid recordid patientid calls locationlatitute locationlongtitue recordtime direction 1 1 51 1 46.81006 -92.08174 9/14/2014 0:00 S 1 2 13 1 46.81006 -92.08174 9/14/2014 0:01 NE 1 3 48 1 46.81006 -92.08174 9/14/2014 0:02 NE 1 4 30 1 46.81006 -92.08174 9/14/2014 0:03 W 1 5 47 1 46.81006 -92.08174 9/14/2014 0:04 S 1 6 9 1 46.81006 -92.08174 9/14/2014 0:05 S 1 7 53 1 46.81006 -92.08174 9/14/2014 0:06 N 1 8 63 1 46.81006 -92.08174 9/14/2014 0:07 SW 1 9 4 1 46.81006 -92.08174 9/14/2014 0:08 NE 1 10 31 1 46.81006 -92.08174 9/14/2014 0:09 N
UNLOAD
-
Syntax:
unload from {<select_statement>|<table_name> [partition (<pt_spec>)]} into location <Volume_location> [stored by <StorageHandler>] [with serdeproperties ('<property_name>'='<property_value>',...)];Volume_location is the specified External Volume path. The format is
odps://[project_name]/[volume_name]/. project_name is the name of the MaxCompute project. volume_name is the name of the External Volume. For descriptions of other parameters, see UNLOAD. -
Example:
This example shows how to export data from a MaxCompute internal table to the OSS path that is mapped by an External Volume. An External Volume named
volume_external_unloadis created. You can run the following statements to export the table data to the External Volume.-- Control the number of exported files: Set the size of MaxCompute table data that a single worker can read, in MB. Because MaxCompute tables are compressed, the data exported to OSS is generally about four times larger. set odps.stage.mapper.split.size=256; -- Export data. unload from (select * from ambulance_data_csv_load) into location 'odps://project_name/volume_external_unload' stored by 'com.aliyun.odps.CsvStorageHandler' with serdeproperties ( 'odps.text.option.delimiter'=','); -- This is equivalent to the following statements. set odps.stage.mapper.split.size=256; unload from ambulance_data_csv_load into location 'odps://project_name/volume_external_unload' stored by 'com.aliyun.odps.CsvStorageHandler' with serdeproperties ( 'odps.properties.rolearn'='acs:ram::139xxx:role/aliyunodpsdefaultrole', 'odps.text.option.delimiter'=',');Example result: A data file is generated in the OSS folder that is mapped by the External Volume.
Create an external table from an External Volume file
If the OSS path that is mapped by the External Volume contains semi-structured files with a uniform schema, such as CSV, PARQUET, or CRC files, you can run the following command to create an external table from the External Volume. For more information about the syntax for creating an external table, see ORC external tables.
-
Syntax:
create external table [if not exists] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [partitioned by (<col_name> <data_type>, ...)] stored by '<StorageHandler>' with serdeproperties ( ['<property_name>'='<property_value>',...] ) location '<Volume_location>';Volume_location is the specified External Volume path. The format is
odps://[project_name]/[volume_name]/. project_name is the name of the MaxCompute project. volume_name is the name of the External Volume. For descriptions of other parameters, see ORC external tables.NoteYou must also configure the
odps.properties.rolearnparameter in thewith serdepropertieslist to provide RoleArn authentication information. The RoleArn information can be different from that of the External Volume, as long as it has permission to access the OSS folder. For more information, see ORC external tables. -
Example:
In this example, the OSS path mapped by an External Volume contains several CSV files with a uniform schema. An External Volume named
demo_volume3is created. You can run the following statements to create and query an external table.create external table ext_tbl_onvolume ( col1 string, col2 string, col3 string ) stored by 'com.aliyun.odps.CsvStorageHandler' with serdeproperties ( 'odps.properties.rolearn'='acs:ram::1248xxx:role/aliyunodpsdefaultrole' ) location 'odps://project_name/demo_volume3/'; -- Query the external table SELECT * from ext_tbl_onvolume;
Use Proxima CE for vectorization in MaxCompute
Proxima CE performs vector indexing and nearest-neighbor search on data stored in MaxCompute tables. Results are saved to an external volume in OSS.
Limitations
-
The Proxima SDK for Java supports Linux and macOS only. JAR files contain Linux-specific dependencies and cannot run on the MaxCompute client on Windows.
-
Proxima CE runs two types of tasks: local tasks (not involving SQL, MapReduce, or Graph) and MaxCompute tasks (executed via SQL, MapReduce, or Graph engines). The two types run alternately. At startup, Proxima CE attempts to load the Proxima kernel on the local machine. If the kernel loads successfully, certain modules run locally; if loading fails, errors are reported but the job continues using fallback functions.
-
Submit the task using the MaxCompute client (odpscmd). DataWorks MapReduce nodes are not supported because the underlying MaxCompute client version is being upgraded.
Run a Proxima CE vectorization task
Step 1: Install the Proxima CE resource package.
Step 2: Prepare input data.
Create the input tables and insert sample data:
-- Create a base table and a query table
CREATE TABLE doc_table_float_smoke(pk STRING, vector STRING) PARTITIONED BY (pt STRING);
CREATE TABLE query_table_float_smoke(pk STRING, vector STRING) PARTITIONED BY (pt STRING);
-- Insert data into the base table
ALTER TABLE doc_table_float_smoke ADD PARTITION(pt='20230116');
INSERT OVERWRITE TABLE doc_table_float_smoke PARTITION (pt='20230116') VALUES
('1.nid','1~1~1~1~1~1~1~1'),
('2.nid','2~2~2~2~2~2~2~2'),
('3.nid','3~3~3~3~3~3~3~3'),
('4.nid','4~4~4~4~4~4~4~4'),
('5.nid','5~5~5~5~5~5~5~5'),
('6.nid','6~6~6~6~6~6~6~6'),
('7.nid','7~7~7~7~7~7~7~7'),
('8.nid','8~8~8~8~8~8~8~8'),
('9.nid','9~9~9~9~9~9~9~9'),
('10.nid','10~10~10~10~10~10~10~10');
-- Insert data into the query table
ALTER TABLE query_table_float_smoke ADD PARTITION(pt='20230116');
INSERT OVERWRITE TABLE query_table_float_smoke PARTITION (pt='20230116') VALUES
('q1.nid','1~1~1~1~2~2~2~2'),
('q2.nid','4~4~4~4~3~3~3~3'),
('q3.nid','9~9~9~9~5~5~5~5');
Step 3: Submit the Proxima CE task.
jar -libjars proxima-ce-aliyun-1.0.0.jar
-classpath proxima-ce-aliyun-1.0.0.jar com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_float_smoke
-doc_table_partition 20230116
-query_table query_table_float_smoke
-query_table_partition 20230116
-output_table output_table_float_smoke
-output_table_partition 20230116
-data_type float
-dimension 8
-topk 1
-job_mode train:build:seek:recall
-external_volume shanghai_vol_ceshi
-owner_id 1248953xxx
;
Step 4: Verify the results.
Query the output table to check the nearest-neighbor results:
SELECT * FROM output_table_float_smoke WHERE pt='20230116';
Expected output:
+------------+------------+------------+------------+
| pk | knn_result | score | pt |
+------------+------------+------------+------------+
| q1.nid | 2.nid | 4.0 | 20230116 |
| q1.nid | 1.nid | 4.0 | 20230116 |
| q1.nid | 3.nid | 20.0 | 20230116 |
| q2.nid | 4.nid | 4.0 | 20230116 |
| q2.nid | 3.nid | 4.0 | 20230116 |
| q2.nid | 2.nid | 20.0 | 20230116 |
| q3.nid | 7.nid | 32.0 | 20230116 |
| q3.nid | 8.nid | 40.0 | 20230116 |
| q3.nid | 6.nid | 40.0 | 20230116 |
+------------+------------+------------+------------+
What's next
-
External volume operations — create, list, and manage external volumes
-
Access OSS from Spark on MaxCompute — direct OSS access without external volumes
-
MaxCompute permissions — manage user permissions for volumes and projects