ossimport is a tool used to migrate data to Object Storage Service (OSS). You can deploy ossimport on local servers or Elastic Compute Service (ECS) instances in the cloud to migrate data from your computer or other cloud storage systems to OSS.

Features

  • Supports a wide range of data sources including local files, Qiniu Cloud Object Storage (KODO), Baidu Object Storage (BOS), Amazon Simple Storage Service (Amazon S3), Azure Blob, UPYUN Storage Service (USS), Tencent Cloud Object Service (COS), Kingsoft Standard Storage Service (KS3), HTTP and HTTPS URL lists, and OSS. These sources can be expanded as needed.
  • Supports the standalone and distributed modes. ossimport is easy to deploy and use in standalone mode. The distributed mode is suitable for large-scale data migration.
    Note In standalone mode, only one bucket can be migrated at a time.
  • Supports resumable upload.
  • Supports throttling.
  • Supports migration of objects that are modified later than a specified time or objects whose names contain a specified prefix.
  • Supports data upload and download in parallel.

Usage notes

  • You can also use Data Online Migration to migrate your data without deploying other migration tools.
  • To migrate data smaller than 30 TB in size, we recommend that you use ossutil. ossutil is easy to use. You can use the -u, --update and --snapshot-path options to incrementally upload objects. For more information, see cp.

Runtime environments

ossimport can be deployed on Linux or Windows that meets the following requirements:
  • Windows 7 or later
  • Latest version of Linux
  • Java 1.7
Important ossimport cannot be deployed in distributed mode on Windows.

Deployment modes

ossimport supports the standalone and distributed modes.

  • The standalone mode is suitable for the migration of data smaller than 30 TB in size. You can deploy ossimport on a device that can access the data to be migrated and the OSS bucket to which you want to migrate the data.
  • The distributed mode is suitable for the migration of data larger than 30 TB in size. You can deploy ossimport on multiple devices that can access the data to be migrated and the OSS bucket to which you want to migrate the data.
    Note To reduce the time it takes to migrate large amounts of data, you can deploy ossimport on an ECS instance in the same region as your OSS bucket. Then, you can use a leased line to attach the server that stores the data to a virtual private cloud (VPC). Migration efficiency is greatly improved when an internal network is used to migrate data from ECS instances to OSS.

Standalone mode

A master, worker, tracker, and console are compressed as ossimport2.jar and run on a device. The system has only a single worker.

The following code shows the file structure in standalone mode:
ossimport
├── bin
│ └── ossimport2.jar  # The JAR package that contains the master, worker, tracker, and console modules.
├── conf
│ ├── local_job.cfg   # The job configuration file for the standalone mode.
│ └── sys.properties  # The configuration file that contains system parameters.
├── console.bat         # The Windows command line tool used to run tasks in a distributed manner.
├── console.sh          # The Linux command line tool used to run tasks in a distributed manner.
├── import.bat          # The script that automatically imports files based on the conf/local_job.cfg configuration file on Windows. The configuration file contains parameters for data migration operations such as start, migration, verification, and retry.
├── import.sh           # The script that automatically imports files based on the conf/local_job.cfg configuration file on Linux. The configuration file contains parameters for data migration operations such as start, verification, and retry.
├── logs                # The directory that stores logs.
└── README.md           # The file that provides a description of ossimport. We recommend that you read this file before you use ossimport.
  • import.bat and import.sh are scripts used to import files based on the configuration file. You can run these scripts after you modify the local_job.cfg configuration file.
  • console.bat and console.sh are scripts used to perform specific operations step by step.
  • Run scripts or commands in the ossimport directory. These scripts and the *.bat/*.sh file are at the same directory level.

Distributed mode

The ossimport architecture in distributed mode consists of a master and multiple workers. The following code show the structure:
Master --------- Job --------- Console
    |
    |
   TaskTracker
    |_____________________
    |Task     | Task      | Task
    |         |           |
Worker      Worker      Worker
ParameterDescription
MasterSplits a job into multiple tasks by data size and number of files. The data size and number of files can be configured in the sys.properties file. The master splits a job into multiple tasks by performing the following steps:
  1. The master traverses the full list of files to be migrated from the local source or cloud storage system.
  2. The master splits a job into multiple tasks by data size and number of files. Each task is responsible for the migration or verification of a portion of files.
Worker
  • Migrates files and verifies data for tasks. A worker pulls a specific file from the data source and uploads the file to the specified directory in OSS. You can specify the data source and OSS configurations in the job.cfg or local_job.cfg configuration file.
  • Supports throttling and specifies the number of concurrent tasks for data migration. You can configure the settings in the sys.properties configuration file.
TaskTrackerDistributes tasks and tracks task states. The TaskTracker is abbreviated to tracker.
ConsoleInteracts with users and receives and displays command output. The console supports system management commands such as deploy, start, and stop, and job management commands such as submit, retry, and clean.
JobSpecifies the data migration jobs submitted by users. One job corresponds to one configuration file namedjob.cfg.
TaskMigrates a portion of files. A job can be divided into multiple tasks by data size and number of files. The minimal unit for dividing a job into tasks is a file. One file is not assigned to multiple tasks.

In distributed mode, multiple workers can be started to migrate data. Tasks are evenly allocated to workers. One worker can run multiple tasks. Only a single worker can be started on each device. The master and tracker are started on the device in which the first worker specified by workers resides. The console must also run on this device.

The following code shows the file structure in distributed mode:
ossimport
├── bin
│ ├── console.jar     # The JAR package for the console module.
│ ├── master.jar      # The JAR package for the master module.
│ ├── tracker.jar     # The JAR package for the tracker module.
│ └── worker.jar      # The JAR package for the worker module.
├── conf
│ ├── job.cfg         # The job configuration file template.
│ ├── sys.properties  # The configuration file that contains system parameters.
│ └── workers         # The list of workers.
├── console.sh          # The command-line tool. Only Linux is supported.
├── logs                # The directory that stores logs.
└── README.md           # The file that provides a description of ossimport. We recommend that you read this file before you use ossimport.

Configuration files

The sys.properties and local_job.cfg configuration files are available in standalone mode. The sys.properties, job.cfg, and workers configuration files are available in distributed mode. The local_job.cfg and job.cfg configuration files contain the same parameters. The workers configuration file is only available in the distributed mode.

  • sys.properties: the system parameters
    ParameterDescriptionRemarks
    workingDirThe working directory.The directory to which the tool package is decompressed. Do not modify this parameter in standalone mode. In distributed mode, the working directory must be the same for each device.
    workerUserThe SSH username used to log on to the device in which a worker resides.
    • If privateKeyFile is configured, the value specified for privateKeyFile is used.
    • If privateKeyFile is not configured, the values specified for workerUser and workerPassword are used.
    • Do not modify this parameter in standalone mode.
    workerPasswordThe SSH password used to log on to the device in which a worker resides.Do not modify this parameter in standalone mode.
    privateKeyFileThe path of the private key file.
    • If you have already established an SSH connection, you can specify this parameter. Otherwise, leave this parameter empty.
    • If privateKeyFile is configured, the value specified for privateKeyFile is used.
    • If privateKeyFile is not configured, the values specified for workerUser and workerPassword are used.
    • Do not modify this parameter in standalone mode.
    sshPortThe SSH port.The default value is 22. In most cases, we recommend that you retain the default value. Do not modify this parameter in standalone mode.
    workerTaskThreadNumThe maximum number of threads for a worker to run tasks.
    • This parameter is related to the device memory and network conditions. We recommend that you set this parameter to 60.
    • The value can be increased for physical machines. For example, you can set this parameter to 150. If the network bandwidth is already full, do not further increase the value.
    • If the network conditions are poor, reduce the value. For example, you can set this parameter to 30. This way, you can prevent timeout errors caused by competition for network resources.
    workerMaxThroughput(KB/s)The maximum throughput for data migration of a worker.This parameter can be used for throttling. The default value is 0, which indicates that no throttling is imposed.
    dispatcherThreadNumThe number of threads for task distribution and state confirmation of the tracker.If you do not have special requirements, retain the default value.
    workerAbortWhenUncatchedExceptionIndicates whether to skip or terminate a task if an unknown error occurs.By default, a task is skipped if an unknown error occurs.
    workerRecordMd5Indicates whether to use metadata x-oss-meta-md5 to record the MD5 hash of files to be migrated. By default, the MD5 hash is not recorded. The value of this parameter is used to verify data integrity of files to be migrated.
  • job.cfg: the configurations for data migration jobs. The local_job.cfg and job.cfg configuration files have different file names but contain the same parameters. The following table describes the parameters.
    ParameterDescriptionRemarks
    jobNameThe name of the job. The value is of the String type.
    • A job name uniquely identifies a job. A job name must comply with the following naming conventions: The name can contain letters, digits, underscores (_), and hyphens (-). The name must be 4 to 128 characters in length. You can submit multiple jobs that have different names.
    • If you submit a job that has the same name as an existing job, the system displays a message that indicates that the job already exists. You must delete the existing job before you can submit the job that has the same name.
    jobTypeThe type of the job. The value is of the String type.Valid values: import and audit. Default value: import.
    • import: runs the data migration job and verifies data consistency.
    • audit: only verifies data consistency.
    isIncrementalSpecifies whether to enable the incremental migration mode. The value is of the Boolean type.
    • Default value: false.
    • If this parameter is set to true, ossimport performs a data scan at the interval (in seconds) specified by incrementalModeInterval to detect incremental data and then migrates the incremental data to OSS.
    incrementalModeIntervalThe migration interval in seconds in incremental migration mode. The value is of the Integer type. This parameter is valid when isIncremental is set to true. The minimum interval is 900 seconds. We recommend that you set the parameter to a value not less than 3,600 seconds to prevent request surges and additional overhead.
    importSinceThe time condition for the data migration job. Data whose last modified time is later than the value of this parameter is migrated. The value is of the Integer type. Unit: seconds.
    • The timestamp must be in the UNIX time format. It is the number of seconds that have elapsed since 00:00:00 Thursday, January 1, 1970. You can run the date +%s command to query the number of seconds.
    • Default value: 0. The default value indicates that all data is migrated.
    srcTypeThe source of the data migration. The value is of the String type and is case-sensitive. Valid values:
    • local: migrates data from a local file to OSS. If you set srcType to this value, specify srcPrefix and leave srcAccessKey, srcSecretKey, srcDomain, and srcBucket unspecified.
    • oss: migrates data from one OSS bucket to another OSS bucket.
    • qiniu: migrates data from KODO to OSS.
    • bos: migrates data from BOS to OSS.
    • ks3: migrates data from KS3 to OSS.
    • s3: migrates data from Amazon S3 to OSS.
    • youpai: migrates data from USS to OSS.
    • http: migrates data from HTTP or HTTPS URL lists to OSS.
    • cos: migrates data from COS to OSS.
    • azure: migrates data from Azure Blob to OSS.
    srcAccessKeyThe AccessKey ID used to access the source. The value is of the String type.
    • If you set srcType to oss, qiniu, baidu, ks3, or s3, specify the AccessKey ID used to access the source.
    • If you set srcType to local or http, ignore this parameter.
    • If you set srcType to youpai or azure, specify the username used to access the source.
    srcSecretKeyThe AccessKey secret used to access the source. The value is of the String type.
    • If you set srcType to oss, qiniu, baidu, ks3, or s3, specify the AccessKey secret used to access the source.
    • If you set srcType to local or http, ignore this parameter.
    • If you set srcType to youpai, specify the operator password used to access the source.
    • If you set srcType to azure, specify the account key used to access the source.
    srcDomainThe source endpoint.
    • If you set srcType to local or http, ignore this parameter.
    • If you set srcType to oss, enter the endpoint obtained from the OSS console. The endpoint is a subdomain, which does not include the bucket name.
    • If you set srcType to qiniu, enter the domain name corresponding to the bucket obtained from the KODO console.
    • If you set srcType to bos, enter the BOS domain name. Examples: http://bj.bcebos.com and http://gz.bcebos.com.
    • If you set srcType to ks3, enter the KS3 domain name. Examples: http://kss.ksyun.com, http://ks3-cn-beijing.ksyun.com, and http://ks3-us-west-1.ksyun.coms.
    • If you set srcType to s3, enter the domain name of the region in which the source Amazon s3 bucket is located.
    • If you set srcType to youpai, enter the USS domain name. Example: http://v0.api.upyun.com (automatically identified optimal line), http://v1.api.upyun.com (telecommunication line), http://v2.api.upyun.com (China Unicom or China Netcom line), or http://v3.api.upyun.com (China Mobile or China Railcom line).
    • If you set srcType to cos, enter the region in which the COS bucket is located. Example: ap-guangzhou.
    • If you set srcType to azure, enter the endpoint suffix in the Azure Blob connection string. Example: core.chinacloudapi.cn.
    srcBucketThe name of the source bucket or container.
    • If you set srcType to local or http, ignore this parameter.
    • If you set srcType to azure, enter the name of the source container.
    • In other cases, enter the name of the source bucket.
    srcPrefixThe source prefix. The value is of the String type. This parameter is empty by default.
    • If you set srcType to local, enter the full path that ends with a forward slash (/). If the path contains two or more directory levels, separate them with a forward slash (/). Examples: c:/example/ and /data/example/.
      Important Paths such as c:/example//, /data//example/, and /data/example// are invalid.
    • If you set srcType to oss, qiniu, bos, ks3, youpai, or s3, enter the prefix of the names of the objects that you want to migrate. The prefix does not include the bucket name. Example: data/to/oss/.
    • To migrate all objects, leave srcPrefix empty.
    destAccessKeyThe AccessKey ID used to access the destination OSS bucket. The value is of the String type.

    You can obtain the AccessKey ID used to access the destination OSS bucket from the OSS console.

    destSecretKeyThe AccessKey secret used to access the destination OSS bucket. The value is of the String type.

    You can obtain the AccessKey secret used to access the destination OSS bucket from the OSS console.

    destDomainThe destination endpoint. The value is of the String type.

    The endpoint is a subdomain, which does not include the bucket name. You can obtain the endpoint from the OSS console.

    destBucketThe destination bucket. The value is of the String type. The name of the destination OSS bucket. The name cannot end with a forward slash (/).
    destPrefixThe name prefix of the migrated objects in the destination OSS bucket. The value is of the String type. This parameter is empty by default.
    • The name prefix of the migrated object in the destination OSS bucket. This parameter is empty by default. If you retain the default value, the migrated objects are stored in the root directory in the destination bucket.
    • To migrate data to a specific directory in a bucket, end the prefix with a forward slash (/). Example: data/in/oss/.
    • OSS object names cannot start with a forward slash (/). Do not set destPrefix to a value that starts with a forward slash (/).
    • A local file whose path is in the srcPrefix+relativePath format is migrated to an OSS path in the destDomain/destBucket/destPrefix+relativePath format.
    • An object whose path is in the srcDomain/srcBucket/srcPrefix+relativePath format in the cloud is migrated to an OSS path in the destDomain/destBucket/destPrefix+relativePath format.
    taskObjectCountLimitThe maximum number of files in each task. The value is of the Integer type. The default value is 10000. This parameter affects the concurrency level of jobs that you want to run. In most cases, this parameter is set to a value calculated based on the following formula: Total number of files/Total number of workers/Number of migration threads (workerTaskThreadNum). The maximum value is 50000. If the total number of files is unknown, retain the default value.
    taskObjectSizeLimitThe maximum data size for each task. The value is of the Integer type. The default value is 1 GB. Unit: bytes. This parameter affects the concurrency level of jobs that you want to run. In most cases, this parameter is set to a value calculated based on the following formula: Total data size/Total number of workers/Number of migration threads (workerTaskThreadNum). If the total data size is unknown, retain the default value.
    isSkipExistFileSpecifies whether to skip objects that already exist during data migration. The value is of the Boolean type. If this parameter is set to true, the objects are skipped based on their size and last modified time. If this parameter is set to false, objects that already exist are overwritten. Default value: false. This parameter is invalid when jobType is set to audit.
    scanThreadCountThe number of threads that scan files in parallel. The value is of the Integer type.
    • Default value: 1.
    • Valid values: 1 to 32.
    This parameter affects the efficiency of file scanning. If you do not have special requirements, retain the default value.
    maxMultiThreadScanDepthThe maximum depth of directories for parallel scanning. The value is of the Integer type.
    • Default value: 1.
    • Valid values: 1 to 16.
    • The default value indicates parallel scanning in top-level directories.
    • If you do not have special requirements, retain the default value. A large value may cause task failures.
    appIdThe application ID of COS. The value is of the Integer type. This parameter is valid when srcType is set to cos.
    httpListFilePathThe absolute path of the HTTP URL list file. The value is of the String type.
    • This parameter is valid when srcType is set to http. If the source is an HTTP URL list, you must include the absolute path in the list file. Example: c:/example/http.list.
    • The HTTP URL in the file must be divided into two parts separated with spaces. The first part specifies the prefix, and the second part specifies the relative path of the object in OSS after the object hosted on the URL is migrated. For example, the HTTP URL list file in the c:/example/http.list path contains the following URLs:
      http://xxx.xxx.com/aa/  bb.jpg
      http://xxx.xxx.com/cc/  dd.jpg
      If you set destPrefix to ee/, the objects migrated to OSS have the following names:
      ee/bb.jpg
      ee/dd.jpg
  • workers: only available in distributed mode. Multiple IP addresses are separated with line feeds. Examples:
    192.168.1.6
    192.168.1.7
    192.168.1.8
    • In the preceding configurations, 192.168.1.6 in the first line must be the master. That means, 192.168.1.6 is the IP address of the device on which the master, workers, and tracker are started. The console also runs on this device.
    • Make sure that the same username, logon method, and working directory are used for all workers.

Configuration file examples

The following table describes the configuration file of a data migration job in distributed mode. The name of the configuration file in standalone mode is local_job.cfg. These configuration files contain the same configuration items.

Migration scenarioConfiguration fileDescription
Migrate local data to OSSjob.cfgsrcPrefix specifies an absolute path that ends with a forward slash (/). Examples: D:/work/oss/data/ and /home/user/work/oss/data/.
Migrate data from KODO to OSSjob.cfgsrcPrefix and destPrefix can be left empty. Otherwise, end the prefixes with a forward slash (/). Example: destPrefix=docs/.
Migrate data from BOS to OSSjob.cfgsrcPrefix and destPrefix can be left empty. Otherwise, end the prefixes with a forward slash (/). Example: destPrefix=docs/.
Migrate data from AWS S3 to OSSjob.cfgFor more information, see AWS service endpoints.
Migrate data from USS to OSSjob.cfgSet srcAccessKey to the operator account and srcSecretKey to the corresponding password.
Migrate data from COS to OSSjob.cfgSpecify srcDomain based on V4. Example: srcDomain=sh. srcPrefix can be left empty. Otherwise, start and end the prefix with a forward slash (/). Example: srcPrefix=/docs/.
Migrate data from Azure Blob to OSSjob.cfgSet srcAccessKey to the storage account and srcSecretKey to the access key. Set srcDomain to the endpoint suffix in the Azure Blob connection string. Example: core.chinacloudapi.cn.
Migrate data between buckets in OSSjob.cfgThis method is suitable for data migration between buckets across different regions, of different storage classes, or with different prefixes in their names. We recommend that you deploy your service on ECS and use internal endpoints to minimize traffic costs.

Advanced settings

  • Throttle traffic

    In the sys.properties configuration file, workerMaxThroughput(KB/s) specifies the maximum throughput for data migration of a worker. To configure throttling for business such as throttling for the source and network limits, set this parameter to a value less than the maximum available bandwidth of your device based on your business needs. After the modification is complete, restart the service for the modification to take effect.

    In distributed mode, modify the sys.properties configuration file in the $OSS_IMPORT_WORK_DIR/conf directory for each worker and restart the service.

    To throttle traffic, modify the sys.properties configuration file as scheduled by using crontab and restart the service for the modification to take effect.

  • Modify the number of concurrent tasks
    • In the sys.properties configuration file, workerTaskThreadNum specifies the number of concurrent tasks run by a worker. If the network conditions are poor and a worker has to process a large number of tasks, timeout errors occur. To resolve this issue, modify the configuration by reducing the number of concurrent tasks and restart the service.
    • In the sys.properties configuration file, workerMaxThroughput(KB/s) specifies the maximum throughput for data migration of a worker. To configure throttling for business such as throttling for the source and network limits, set this parameter to a value less than the maximum available bandwidth of your device based on your business needs.
    • In the job.cfg configuration file, taskObjectCountLimit specifies the maximum number of files in each task. The default value is 10000. This parameter configuration affects the number of tasks. The efficiency of implementing concurrent tasks degrades if the number of tasks is small.
    • In the job.cfg configuration file, taskObjectSizeLimit specifies the maximum data size for each task. The default value is 1. Unit: GB. This parameter configuration affects the number of tasks. The efficiency of implementing concurrent tasks degrades if the number of tasks is small.
      Important
      • Before you start your data migration, configure the parameters in the configuration files.
      • After you modify parameters in the sys.properties configuration file, restart the local server or the ECS instance on which ossimport is deployed for the modification to take effect.
      • After job.cfg is submitted, parameters in the job.cfg configuration file cannot be modified.
  • Verify data without migrating data

    To only verify data by using ossimport, set jobType to audit instead of import in the job.cfg or local_job.cfg configuration file. Configure other parameters in the same way as you configure them for data migration.

  • Specify the incremental data migration mode

    In incremental data migration mode, ossimport migrates existing full data after the migration task is started and migrates incremental data at intervals. The full data migration is started after you submit the task. Then, incremental data is migrated at intervals. The incremental data migration mode is suitable for data backup and synchronization.

    The following configuration items are available for the incremental data migration mode:
    • In the job.cfg configuration file, isIncremental specifies whether to enable the incremental data migration mode. true indicates that the incremental data migration mode is enabled. false indicates that the incremental data migration mode is disabled. The default value is false.
    • In the job.cfg configuration file, incrementalModeInterval specifies the interval in seconds at which incremental data migration is implemented. This setting takes effect only when you set isIncremental to true. The minimum value is 900. We recommend that you do not set this parameter to a value less than 3600. Otherwise, a large number of requests are wasted. This results in additional overheads.
  • Filter data to be migrated
    You can set filtering conditions to migrate objects that meet specific conditions. ossimport allows you to specify the prefixes and last modified time.
    • In the job.cfg configuration file, srcPrefix specifies the prefix of source objects. This parameter is empty by default.
      • If you set srcType to local, enter the path of the local directory. Enter the full path that ends with a forward slash (/). If the path contains two or more directory levels, separate them with one forward slash (/). Examples: c:/example/ and /data/example/.
      • If you set srcType to oss, qiniu, bos, ks3, youpai, or s3, enter the name prefix of source objects without the bucket name. Example: data/to/oss/. To migrate all objects, leave srcPrefix empty.
    • In the job.cfg configuration file, importSince specifies the last modified time in seconds for source objects, importSince specifies the timestamp in the UNIX format. It is the number of seconds that have elapsed since 00:00:00 Thursday, January 1, 1970. You can run the date +%s command to query the number of seconds. The default value is 0, which indicates that all data will be migrated. In incremental data migration mode, this parameter is valid only for full data migration. In other modes, this parameter is valid for the entire migration job.
      • If the LastModified Time value of an object is less than the importSince value, the object is not migrated.
      • If the LastModified Time value of an object is greater than the importSince value, the object is migrated.