All Products
Search
Document Center

Object Storage Service:Overview

Last Updated:Sep 28, 2023

ossimport is a tool for migrating data to Object Storage Service (OSS). You can deploy ossimport on local servers or Elastic Compute Service (ECS) instances in the cloud to migrate data from your computer or other cloud storage systems to OSS.

Features

  • Supports a wide range of data sources, such as on-premises file systems, Qiniu Cloud Object Storage (KODO), Baidu Object Storage (BOS), Amazon Simple Storage Service (Amazon S3), Azure Blob, UPYUN Storage Service (USS), Tencent Cloud Object Service (COS), Kingsoft Standard Storage Service (KS3), HTTP and HTTPS URL lists, and Alibaba Cloud OSS.

  • Supports the standalone deployment and distributed deployment modes. ossimport is easy to deploy and use in standalone mode. The distributed mode is suitable for large-scale data migration.

    Note

    In standalone mode, only one bucket can be migrated at a time.

  • Supports resumable upload.

  • Supports traffic throttling.

  • Supports migration of objects that are modified later than a specified time or objects whose names contain a specified prefix.

  • Supports data uploads and downloads in parallel.

Billing

ossimport is available free of charge. However, using ossimport for data migration from third-party data sources over the Internet may incur outbound traffic fees and request fees on the data source side, and may also incur OSS-related fees such as API operation calling fees. If data migration between OSS buckets across regions is accelerated by using data transfer, you are additionally charged transfer acceleration fees.

Usage notes

  • Migration speed

    The migration speed of ossimport varies based on various factors, such as the read bandwidth of the data source, local network bandwidth, and file size. Migration of files smaller than 200 KB is slow due to high IOPS.

  • Migration of archived files

    If you want to migrate archived files, you must restore the archived files before you can migrate the files.

  • Data staging

    When you use ossimport to migrate data, data streams are first transferred to the local memory and then uploaded to the destination.

  • Source data retention

    During a data migration task, ossimport performs only read operations on the data in the data source. It does not perform write operations, ensuring that the original data is not modified or deleted.

  • Other migration tools

    • Data Online Migration

      To migrate data from third-party data sources, we recommend that you use Data Online Migration.

    • ossutil

      To migrate data smaller than 30 TB in size, we recommend that you use ossutil. ossutil is a lightweight, easy-to-use tool. You can use the -u, --update and --snapshot-path options to incrementally migrate files. For more information, see cp.

Runtime environment

ossimport can be deployed on Linux or Windows that meets the following requirements:

  • Windows 7 or later

  • Latest version of Linux

  • Java 7

Important

ossimport cannot be deployed on Windows in distributed mode.

Deployment modes

ossimport supports the standalone and distributed deployment modes.

  • Standalone deployment is suitable for the migration of data smaller than 30 TB in size. To deploy ossimport in standalone mode, download the standalone package. You can deploy ossimport on a device that can access the data to be migrated and the OSS bucket to which you want to migrate the data.

  • Distributed deployment is suitable for the migration of data larger than 30 TB in size. To deploy ossimport in distributed mode, download the distributed package. You can deploy ossimport on any number of devices that can access the data that you want to migrate and the OSS bucket to which you want to migrate the data.

    Note

    To reduce the time that is required to migrate large amounts of data, you can deploy ossimport on an ECS instance that resides in the same region as your OSS bucket. Then, you can use a leased line to create a connection between the source server and a virtual private cloud (VPC). Data migration over internal networks enjoys faster speeds.

    You can also use ossimport to transmit data over the Internet. In this case, the transmission speed is affected by the bandwidth of your on-premises machine.

Standalone mode

The master, worker, tracker, and console modules are compressed into ossimport2.jar and run on a device. The system has only a single worker.

The following content shows the file structure in standalone mode:

ossimport
├── bin
│ └── ossimport2.jar  # The JAR package that contains the master, worker, tracker, and console modules.
├── conf
│ ├── local_job.cfg   # The job configuration file in standalone deployment.
│ └── sys.properties  # The configuration file that contains system parameters.
├── console.bat         # The Windows command line tool used to run tasks in a distributed manner.
├── console.sh          # The Linux command line tool used to run tasks in a distributed manner.
├── import.bat          # The script that automatically imports files based on the conf/local_job.cfg configuration file on Windows. The configuration file contains parameters that specify data migration operations such as start, migration, verification, and retry.
├── import.sh           # The script that automatically imports files based on the conf/local_job.cfg configuration file on Linux. The configuration file contains parameters that specify data migration operations such as start, migration, verification, and retry.
├── logs                # The directory that contains logs.
└ ── README.md # The file that provides a description of ossimport. We recommend that you read this file before you use ossimport.
  • import.bat and import.sh are scripts used to import files. You can run these scripts after you modify the local_job.cfg configuration file.

  • console.bat and console.sh are scripts used to perform specific operations step by step.

  • Run scripts or commands in the ossimport directory. These scripts and the*.bat/*.sh file are at the same directory level.

Distributed mode

The ossimport architecture in distributed mode consists of a master and multiple workers:

Master --------- Job --------- Console
    |
    |
   TaskTracker
    |_____________________
    |Task     | Task      | Task
    |         |           |
Worker      Worker      Worker

Parameter

Description

Master

Splits a job into multiple tasks by data size and number of files. The data size and number of files can be configured in the sys.properties file. The master splits a job into multiple tasks by performing the following steps:

  1. The master traverses the full list of files to be migrated from the local source or a cloud storage system.

  2. The master splits a job into multiple tasks by data size and number of files. Each task is responsible for the migration or verification of a portion of files.

Worker

  • Migrates files and verifies data for tasks. A worker pulls a specific file from the data source and uploads the file to the specified directory in OSS. You can specify the data source and OSS configurations in the job.cfg or local_job.cfg configuration file.

  • Supports throttling and a custom number of concurrent tasks for data migration. You configure the settings in the sys.properties configuration file.

TaskTracker

Distributes tasks and tracks task status. The TaskTracker is abbreviated as tracker.

Console

Interacts with users, receives command input, and displays command output. The console supports system management commands including deploy, start, and stop, and job management commands including submit, retry, and clean.

Job

Specifies the data migration jobs submitted by users. One job corresponds to one job.cfg configuration file.

Task

Migrates a portion of files. A job can be divided into multiple tasks by data size and number of files. The minimal unit for dividing a job into tasks is a file. One file is not assigned to multiple tasks.

In distributed deployment, you can start multiple devices and run only one worker on each device to migrate data. Tasks are evenly assigned to workers, and a worker runs multiple tasks.

The following content shows the file structure in distributed mode:

ossimport
├── bin
│ ├── console.jar     # The JAR package for the console module.
│ ├── master.jar      # The JAR package for the master module.
│ ├── tracker.jar     # The JAR package for the tracker module.
│ └── worker.jar      # The JAR package for the worker module.
├── conf
│ ├── job.cfg         # The Job configuration file template.
│ ├── sys.properties  # The configuration file that contains system parameters.
│ └── workers         # The list of workers.
├── console.sh          # The command-line tool. Only Linux is supported.
├── logs                # The directory that contains logs.
└ ── README.md # The file that provides a description of ossimport. We recommend that you read this file before you use ossimport.

Configuration files

The sys.properties and local_job.cfg configuration files are available in standalone mode. The sys.properties, job.cfg, and workers configuration files are available in distributed mode. The local_job.cfg and job.cfg configuration files have the same parameters. The workers configuration file is exclusive to the distributed mode.

  • sys.properties: the system parameters.

    Parameter

    Description

    Remarks

    workingDir

    The working directory.

    The directory to which the package is decompressed. Do not modify this parameter in standalone mode. In distributed mode, the working directory must be the same for each device.

    workerUser

    The SSH username used to log on to the device on which a worker resides.

    • If privateKeyFile is configured, the value specified for privateKeyFile is used.

    • If privateKeyFile is not configured, the values specified for workerUser and workerPassword are used.

    • Do not modify this parameter in standalone mode.

    workerPassword

    The SSH password used to log on to the device on which a worker resides.

    Do not modify this parameter in standalone mode.

    privateKeyFile

    The path of the private key file.

    • If you have already established an SSH connection, you can specify this parameter. Otherwise, leave this parameter empty.

    • If privateKeyFile is configured, the value specified for privateKeyFile is used.

    • If privateKeyFile is not configured, the values specified for workerUser and workerPassword are used.

    • Do not modify this parameter in standalone mode.

    sshPort

    The SSH port.

    The default value is 22. In most cases, we recommend that you retain the default value. Do not modify this parameter in standalone mode.

    workerTaskThreadNum

    The maximum number of threads for a worker to run tasks.

    • This parameter is related to the device memory and network conditions. We recommend that you set this parameter to 60.

    • The value can be increased for physical machines. For example, you can set this parameter to 150. If the network bandwidth is already full, do not further increase the value.

    • If the network conditions are poor, reduce the value. For example, you can set this parameter to 30. This way, you can prevent timeout errors caused by competition for network resources.

    workerMaxThroughput(KB/s)

    The maximum throughput for data migration of a worker.

    This parameter can be used for throttling. The default value is 0, which indicates that no throttling is imposed.

    dispatcherThreadNum

    The number of threads for task distribution and state confirmation of the tracker.

    If you do not have special requirements, retain the default value.

    workerAbortWhenUncatchedException

    Specifies whether to skip or terminate a task if an unknown error occurs.

    By default, a task is skipped if an unknown error occurs.

    workerRecordMd5

    Indicates whether to use the x-oss-meta-md5 metadata item to record the MD5 hash of files to be migrated. By default, the MD5 hash is not recorded.

    The value of this parameter is used to verify data integrity.

  • job.cfg: the configurations for data migration jobs. The local_job.cfg and job.cfg configuration files differ in names but contain the same parameters.

    Parameter

    Description

    Remarks

    jobName

    The name of the job. The value is of the String type.

    • A job name uniquely identifies a job. A job name must comply with the following naming rules: The name can contain letters, digits, underscores (_), and hyphens (-). The name must be 4 to 128 characters in length. You can submit multiple jobs that have different names.

    • If you submit a job with the same name as an existing job, the system prompts that the job already exists. Before you clean the existing job, you are not allowed to submit the job with the same name.

    jobType

    The type of the job. The value is of the String type.

    Valid values: import and audit. Default value: import.

    • import: runs the data migration job and verifies the migration data for consistency.

    • audit: only verifies data consistency.

    isIncremental

    Specifies whether to enable the incremental migration mode. The value is of the Boolean type.

    • Default value: false.

    • If this parameter is set to true, ossimport performs a data scan at intervals of seconds specified by incrementalModeInterval to detect incremental data and then migrates the incremental data to OSS.

    incrementalModeInterval

    The migration interval in seconds in incremental migration mode. The value is of the Integer type.

    This parameter is valid when isIncremental is set to true. The minimum interval is 900 seconds. We recommend that you set the parameter to a value not less than 3,600 seconds to prevent request surges and additional overhead.

    importSince

    The time condition for the data migration job. Data whose last modified time is later than the value of this parameter is migrated. The value is of the Integer type. Unit: seconds.

    • The timestamp must be in the UNIX format. It is the number of seconds that have elapsed since 00:00:00 Thursday, January 1, 1970. You can run the date +%s command to query the UNIX timestamp.

    • The default value is 0, which indicates that all data is migrated.

    srcType

    The source of the data migration. The value is of the String type and is case-sensitive.

    Valid values:

    • local: migrates data from a local file to OSS. If this value is specified, specify srcPrefix and leave srcAccessKey, srcSecretKey, srcDomain, and srcBucket unspecified.

    • oss: migrates data from an OSS bucket to another bucket.

    • qiniu: migrates data from KODO to OSS.

    • bos: migrates data from BOS to OSS.

    • ks3: migrates data from KS3 to OSS.

    • s3: migrates data from Amazon S3 to OSS.

    • youpai: migrates data from USS to OSS.

    • http: migrates data from HTTP or HTTPS URL lists to OSS.

    • cos: migrates data from COS to OSS.

    • azure: migrates data from Azure Blob to OSS.

    srcAccessKey

    The AccessKey ID used to access the source. The value is of the String type.

    • If srcType is set to oss, qiniu, baidu, ks3, ors3, specify the AccessKey ID used to access the source.

    • If srcType is set to local or http, ignore this parameter.

    • If srcType is set to youpai or azure, specify the username used to access the source.

    srcSecretKey

    The AccessKey secret used to access the source. The value is of the String type.

    • If srcType is set to oss, qiniu, baidu, ks3, or s3, specify the AccessKey secret used to access the source.

    • If srcType is set to local or http, ignore this parameter.

    • If srcType is set to youpai, specify the operator password used to access the source.

    • If srcType is set to azure, specify the account key used to access the source.

    srcDomain

    The source endpoint.

    • If srcType is set to local or http, ignore this parameter.

    • If srcType is set to oss, enter the endpoint obtained from the OSS console. The endpoint is a root domain, which does not include the bucket name.

    • If srcType is set to qiniu, enter the domain name corresponding to the bucket obtained from the KODO console.

    • If srcType is set to bos, enter the BOS domain name. Example: http://bj.bcebos.com or http://gz.bcebos.com.

    • If srcType is set to ks3, enter the Kingsoft ks3 domain name, such as http://kss.ksyun.com, http://ks3-cn-beijing.ksyun.com, or http://ks3-us-west-1.ksyun.coms.

    • If srcType is set to S3, enter the endpoint for the corresponding Amazon S3 region.

    • If srcType is set to youpai, enter the USS domain names such as http://v0.api.upyun.com (automatically identified optimal line), http://v1.api.upyun.com (telecommunication line), http://v2.api.upyun.com (China Unicom or China Netcom line), or http://v3.api.upyun.com (China Mobile or China Railcom line).

    • If srcType is set to cos, enter the region where the COS bucket resides. Example: ap-guangzhou.

    • If srcType is set to cos, enter the endpoint suffix in the Azure Blob connection string. Example: core.chinacloudapi.cn.

    srcBucket

    The name of the source bucket or container.

    • If srcType is set tolocal or http, ignore this parameter.

    • If srcType is set to azure, enter the name of the source container.

    • In other cases, enter the name of the source bucket.

    srcPrefix

    The source prefix. The value is of the String type. This parameter is empty by default.

    • If srcType is set to local, enter the full path that ends with a forward slash (/). If the path contains two or more directory levels, separate them with a forward slash (/). Examples: c:/example/ and /data/example/ .

      Important

      Paths such as c:/example//, /data//example/, and /data/example// are invalid.

    • If srcType is set to oss, qiniu, bos, ks3, youpai, and s3, enter the prefix of the object to be synchronized, excluding the bucket name, such as data/to/oss/.

    • To migrate all objects, leave srcPrefix empty.

    destAccessKey

    The AccessKey ID used to access the destination OSS bucket. The value is of the String type.

    You can obtain the AccessKey ID used to access the destination OSS bucket from the OSS console.

    destSecretKey

    The AccessKey secret used to access the destination OSS bucket. The value is of the String type.

    You can obtain the AccessKey secret used to access the destination OSS bucket from the OSS console.

    destDomain

    The destination endpoint. The value is of the String type.

    The endpoint is a root domain, which does not include the bucket name. You can obtain the endpoint from the OSS console.

    destBucket

    The destination bucket. The value is of the String type.

    The name of the destination OSS bucket. The name cannot end with a forward slash (/).

    destPrefix

    The prefix.

    • The name prefix of the migrated object in the destination OSS bucket. This parameter is empty by default. If you retain the default value, the migrated objects are stored in the root directory in the destination bucket.

    • To synchronize data to a specified directory in OSS, end the prefix with a forward slash (/). Example: data/in/oss/.

    • OSS object names cannot start with a forward slash (/). Do not set destPrefix to a value that starts with a forward slash (/).

    • A local file whose path is in the srcPrefix+relativePath format is migrated to an OSS path in the destDomain/destBucket/destPrefix +relativePath format.

    • An object whose path is in the srcDomain/srcBucket/srcPrefix+relativePath format in the cloud is migrated to an OSS path in the destDomain/destBucket/destPrefix+relativePath format.

    taskObjectCountLimit

    The maximum number of files in each task. The value is of the Integer type. The default value is 10000.

    This parameter affects the concurrency level of jobs that you want to run. In most cases, this parameter is set to a value calculated based on the following formula: Total number of files/Total number of workers/Number of migration threads (workerTaskThreadNum). The maximum value is 50000. If the total number of files is unknown, retain the default value.

    taskObjectSizeLimit

    The maximum data size in bytes for each task. The value is of the Integer type. The maximum data size is 1 GB.

    This parameter affects the concurrency level of jobs that you want to run. In most cases, this parameter is set to a value calculated based on the following formula: Total data size/Total number of workers/Number of migration threads (workerTaskThreadNum). If the total data size is unknown, retain the default value.

    isSkipExistFile

    Specifies whether to skip objects that already exist during data migration. The value is of the Boolean type.

    If this parameter is set to true, the objects are skipped based on their size and last modified time. If this parameter is set to false, objects that already exist are overwritten. Default value: false. This parameter is invalid when jobType is set to audit.

    scanThreadCount

    The number of threads that scan files in parallel. The value is of the Integer type.

    • Default value: 1.

    • Valid values: 1 to 32.

    This parameter affects the efficiency of file scanning. If you do not have special requirements, retain the default value.

    maxMultiThreadScanDepth

    The maximum depth of directories for parallel scanning. The value is of the Integer type.

    • Default value: 1.

    • Valid values: 1 to 16.

    • The default value indicates parallel scanning in top-level directories.

    • If you do not have special requirements, retain the default value. A large value may cause task failures.

    appId

    The application ID of COS. The value is of the Integer type.

    This parameter is valid when srcType is set to cos.

    httpListFilePath

    The absolute path of the HTTP URL list file. The value is of the String type.

    • This parameter is valid when srcType is set to http. If the source is an HTTP URL list, you must include the absolute path in the list file. Example: c:/example/http.list.

    • The HTTP URL in the file must be divided into two parts separated with spaces. The first part specifies the prefix, and the second part specifies the relative path of the object in OSS after the object hosted on the URL is migrated. For example, the HTTP URL list file in the c:/example/http.list path contains the following URLs:

      http://xxx.xxx.com/aa/  bb.jpg
      http://xxx.xxx.com/cc/  dd.jpg

      If you set destPrefix to ee/, the objects migrated to OSS have the following names:

      ee/bb.jpg
      ee/dd.jpg
  • workers: only available in distributed mode. Multiple IP addresses are separated with line feeds. Examples:

    192.168.1.6
    192.168.1.7
    192.168.1.8
    • In the preceding configurations, 192.168.1.6 in the first line must be the master. In other word, 192.168.1.6 is the IP address of the device on which the master, workers, and tracker are started. The console also runs on this device. The console also runs on this device.

    • Make sure that the same username, logon method, and working directory are used for all workers.

Configuration file examples

The following table describes the configuration file of a data migration job in distributed mode. The name of the configuration file in standalone mode is local_job.cfg, which contains the same configuration items as the configuration file in distributed mode.

Migration scenario

Configuration file

Description

Migrate local data to OSS

job.cfg

srcPrefix specifies an absolute path that ends with a forward slash (/). Examples: D:/work/oss/data/ and /home/user/work/oss/data/.

Migrate data from KODO to OSS

job.cfg

srcPrefix and destPrefix can be left empty. Otherwise, end the prefixes with a forward slash (/). Example: destPrefix=docs/.

Migrate data from BOS to OSS

job.cfg

srcPrefix and destPrefix can be left empty. If they are not left empty, end the prefixes with a forward slash (/). Example: destPrefix=docs/.

Migrate data from Amazon S3 to OSS

job.cfg

For more information, see AWS service endpoints.

Migrate data from USS to OSS

job.cfg

Set srcAccessKey to the operator account and srcSecretKey to the corresponding password.

Migrate data from COS to OSS

job.cfg

Specify srcDomain based on V4. Example: srcDomain=sh. srcPrefix can be left empty. If it is not left empty, start and end the prefix with a forward slash (/). Example: srcPrefix=/docs/.

Migrate data from Azure Blob to OSS

job.cfg

Set srcAccessKey to the storage account and srcSecretKey to the access key. Set srcDomain to the endpoint suffix in the Azure Blob connection string. Example: core.chinacloudapi.cn.

Migrate data between buckets in OSS

job.cfg

This method is suitable for data migration between buckets across different regions, of different storage classes, or with different prefixes in their names. We recommend that you deploy your service on ECS and use internal endpoints to minimize traffic costs.

Advanced settings

  • Throttle traffic

    In the sys.properties configuration file, workerMaxThroughput(KB/s) specifies the maximum throughput for data migration of a worker. To configure throttling for scenarios such as source-side throttling and network throttling, set this parameter to a value less than the maximum available bandwidth of your device based on your business needs. After the modification is complete, restart the service for the modification to take effect.

    In distributed mode, modify the sys.properties configuration file in the $OSS_IMPORT_WORK_DIR/conf directory for each worker and restart the service.

    To throttle traffic, modify the sys.properties configuration file as scheduled by using crontab and restart the service for the modification to take effect.

  • Modify the number of concurrent tasks

    • In the sys.properties configuration file, workerTaskThreadNum specifies the number of concurrent tasks run by a worker. If the network conditions are poor and a worker has to process a large number of tasks, timeout errors occur. To resolve this issue, modify the configuration by reducing the number of concurrent tasks and restart the service.

    • In the sys.properties configuration file, workerMaxThroughput(KB/s) specifies the maximum throughput for data migration of a worker. To configure throttling for scenarios such as source-side throttling and network throttling, set this parameter to a value less than the maximum available bandwidth of your device based on your business needs.

    • In the job.cfg configuration file, taskObjectCountLimit specifies the maximum number of files in each task. The default value is 10000. This parameter configuration affects the number of tasks. The efficiency of implementing concurrent tasks degrades if the number of tasks is small.

    • In the job.cfg configuration file, taskObjectSizeLimit specifies the maximum data size for each task. The default maximum data size for each task is 1 GB. This parameter configuration affects the number of tasks. The efficiency of implementing concurrent tasks degrades if the number of tasks is small.

      Important
      • Before you start your data migration, configure the parameters in the configuration files.

      • After you modify parameters in the sys.properties configuration file, restart the local server or the ECS instance on which ossimport is deployed for the modification to take effect.

      • After job.cfg is submitted, parameters in the job.cfg configuration file cannot be modified.

  • Verify data without migrating data

    To only verify data by using ossimport, set jobType to audit instead of import in the job.cfg or local_job.cfg configuration file. Configure other parameters in the same way as you configure them for data migration.

  • Specify the incremental data migration mode

    In incremental data migration mode, ossimport migrates existing full data after the migration task is started and migrates incremental data at intervals. The full data migration is started after you submit the task. Then, incremental data is migrated at intervals. The incremental data migration mode is suitable for data backup and synchronization.

    The following configuration items are related to incremental data migration:

    • In the job.cfg configuration file, isIncremental specifies whether to enable the incremental data migration mode. true indicates that the incremental data migration mode is enabled. false indicates that the incremental data migration mode is disabled. The default value is false.

    • In the job.cfg configuration file, incrementalModeInterval specifies the interval in seconds at which incremental data migration is implemented. This configuation item takes effect only when isIncremental is set to true. The minimum value is 900. We recommend that you do not set this parameter to a value less than 3600. Otherwise, a large number of requests are wasted. This results in additional overheads.

  • Filter data to be migrated

    You can set filtering conditions to migrate objects that meet specific conditions. ossimport allows you to use the prefix and last modified time to specify objects to migrate.

    • In the job.cfg configuration file, srcPrefix specifies the prefix of source objects. This parameter is empty by default.

      • If you set srcType to local, enter the path of the local directory. Enter the full path that ends with a forward slash (/). If the path contains two or more directory levels, separate them with forward slashes (/). Examples: c:/example/ or /data/example/.

      • If you set srcType to oss, qiniu, bos, ks3, youpai, or s3, enter the name prefix of source objects without the bucket name. Example: data/to/oss/. To migrate all objects, leave srcPrefix empty.

    • In the job.cfg configuration file, importSince specifies the last modified time in seconds for source objects, importSince specifies the timestamp in the UNIX format. It is the number of seconds that have elapsed since 00:00:00 Thursday, January 1, 1970. You can run the date +%s command to query the UNIX timestamp. The default value is 0, which indicates that all data will be migrated. In incremental data migration mode, this parameter is valid only for full data migration. In other modes, this parameter is valid for the entire migration job.

      • If the value of LastModified Time of an object is less than the value of importSince, the object is not migrated.

      • If the value of LastModified Time of an object is greater than the value of importSince, the object is migrated.