Mount and use OSS

This topic uses practical code examples to demonstrate how to efficiently and securely mount and use Alibaba Cloud Object Storage Service (OSS) as storage for distributed computing in MaxFrame. The with_fs_mount decorator enables file system-level mounting to provide stable and reliable external data access for large-scale data processing.

Scenarios

This method is applicable to big data analytics scenarios that combine MaxFrame jobs with persistent object storage, such as OSS. For example:

Load raw data from OSS for cleaning or processing.
Write intermediate results to OSS for consumption by downstream tasks.
Share static resources, such as trained model files and configuration files.

Traditional read/write methods, such as pd.read_csv("oss://..."), are limited by SDK performance and network overhead, making them inefficient in a distributed environment. Using file system-level mounting (FS Mount), you can access OSS files in MaxCompute as if they were on a local disk. This greatly improves development efficiency.

Guide

Activate services and grant permissions

Activate OSS and create a bucket.
1. Log on to the Object Storage Service (OSS) console.
2. In the navigation pane on the left, click Buckets.
3. On the Buckets page, click Create Bucket.
  In this example, the bucket name is xxx-oss-test-sh.
Create a RAM role for MaxCompute and attach the role to the MaxCompute runtime environment.
1. Log on to the Resource Access Management (RAM) console.
2. In the navigation pane on the left, choose Identities > Roles.
3. On the Roles page, click Create Role.
4. In the upper-right corner of the Create Role page, click Create Service Linked Role.
  1. On the Create Role page, set Principal Type to Cloud Service.
  2. For Principal Name, select Cloud-native Big Data Computing Service MaxCompute.
  3. On the Permissions tab, click Grant Permission. In the Grant Permission panel, select an access policy for the role and click OK.
    Select the following access policies:
    - Permission to manage Object Storage Service (OSS): AliyunOSSFullAccess
    - Permission to manage MaxCompute: AliyunMaxComputeFullAccess

Mount OSS using `with_fs_mount`

Recommended usage

from maxframe.udf import with_fs_mount

@with_fs_mount(
    "oss://oss-cn-xxxx-internal.aliyuncs.com/xxx-oss-test-sh/test/",
    "/mnt/oss_data",
    storage_options={
        "role_arn": "acs:ram::xxx:role/maxframe-oss"
    },
)
def _process(batch_df):
    import os
    if os.path.exists('/mnt/oss_data'):
        print(f"Mounted files: {os.listdir('/mnt/oss_data')}")
    else:
        print("/mnt/oss_data not mounted!")
    return batch_df * 2

Not recommended
This method is suitable for testing but not for production environments.
```
storage_options={
    "access_key_id": "LTAI5t...",
    "access_key_secret": "Wp9H..."
}
```
Important
Avoid hard coding AccessKeys. Using role_arn allows the system to automatically request a temporary Security Token Service (STS) token. This helps avoid the risk of leaking your AccessKey ID and AccessKey secret.

Control resource allocation with `with_running_options`

Set appropriate CPU and memory resources based on the task type:

from maxframe.udf import with_running_options
@with_running_options(engine="dpe", cpu=2, memory=16)
@with_fs_mount(...)
def _process(batch_df):
    ...

Parameter	Recommended value	Description
`engine="dpe"`	Fixed	Currently, FS Mount supports only the DPE engine.
`cpu`	1 to 4	Increase this value for complex I/O operations or decompression.
`memory`	8 GB or more	For loading large files, 16 GB or more is recommended.

Usage example

Recommended pattern: Data batch processing.

For large-scale data processing, use the MaxFrame apply_chunk feature to process input data in batches.

Create a MaxFrame session

import os
from odps import ODPS
from maxframe import new_session
from maxframe.udf import with_fs_mount

# Initialize the ODPS client
o = ODPS(
    # Make sure the ALIBABA_CLOUD_ACCESS_KEY_ID environment variable is set to your AccessKey ID.
    # Make sure the ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variable is set to your AccessKey secret.
    # Using the AccessKey ID and AccessKey secret strings directly is not recommended.
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='<your project>',
    endpoint='https://service.cn-<region>.maxcompute.aliyun.com/api',
)

# Set the runtime image
# The maxframe_service_dpe_runtime image includes ossfs2_2.0.3.1_linux_x86_64.deb.
# If you use a custom image, download the OSS dependency, and then upload and use it in the image. The dependency package is listed below this code block.
options.sql.settings = { "odps.session.image": "maxframe_service_dpe_runtime"}

# Start the session
session = new_session(o)

print("LogView:", session.get_logview_address())
print("Session ID:", session.session_id)

@with_running_options(engine="dpe", cpu=2, memory=8)
@with_fs_mount(
    "oss://oss-cn-<region>-internal.aliyuncs.com/wzy-oss-test-sh/test/",
    "/mnt/oss_data",
    storage_options={
        "role_arn": "acs:ram::<uid>:role/maxframe-oss"
    },
)

ossfs dependency package: ossfs2_2.0.3.1_linux_x86_64.deb

Create a user-defined function

def _process(batch_df):
  import pandas as pd
  import os

  # Step 1: Check if the mount is successful
  mount_point = "/mnt/oss_data"
  if not os.path.exists(mount_point):
    raise RuntimeError("OSS mount failed!")

    # Step 2: Load data (such as mapping tables or dictionaries)
  mapping_file = os.path.join(mount_point, "category_map.csv")
  if os.path.isfile(mapping_file):
    mapping_df = pd.read_csv(mapping_file)

    # Step 3: Process the current chunk
  result = batch_df.copy()
  result['F'] = result['A'] * 10

  return result

Build a DataFrame and apply the user-defined function

data = [[1.0, 2.0, 3.0, 4.0, 5.0], ...]
df = md.DataFrame(data, columns=['A', 'B', 'C', 'D', 'E'])

# Use apply_chunk to apply the function after mounting
result_df = df.mf.apply_chunk(
  _process,
  skip_infer=True,
  output_type="dataframe",
  dtypes=df.dtypes,
  index=df.index
)

# Execute and get the result
result = result_df.execute().fetch()

skip_infer=True skips type inference, which speeds up execution. Ensure that you pass dtypes and index correctly.

Debugging tips

Verify the mount status

Add debugging logs to the _process function:

import os
print("Mount path exists:", os.path.exists("/mnt/oss_data"))
print("Files in mount:", os.listdir("/mnt/oss_data") if os.path.exists("/mnt/oss_data") else [])

Check the LogView output for logs similar to the following:

FS Mount successful! /mnt/oss_data: ['data.csv', 'config.json', 'model.pkl']
Processing batch with shape: (1000, 5)