ossimport is a tool for migrating data to Object Storage Service (OSS). You can deploy ossimport on local servers or Elastic Compute Service (ECS) instances in the cloud to migrate data from your computer or other cloud storage systems to OSS.
Features
Supports a wide range of data sources, such as on-premises file systems, Qiniu Cloud Object Storage (KODO), Baidu Object Storage (BOS), Amazon Simple Storage Service (Amazon S3), Azure Blob, UPYUN Storage Service (USS), Tencent Cloud Object Service (COS), Kingsoft Standard Storage Service (KS3), HTTP and HTTPS URL lists, and Alibaba Cloud OSS.
Supports the standalone deployment and distributed deployment modes. ossimport is easy to deploy and use in standalone mode. The distributed mode is suitable for large-scale data migration.
NoteIn standalone mode, only one bucket can be migrated at a time.
Supports resumable upload.
Supports traffic throttling.
Supports migration of objects that are modified later than a specified time or objects whose names contain a specified prefix.
Supports data uploads and downloads in parallel.
Billing
ossimport is available free of charge. However, using ossimport for data migration from third-party data sources over the Internet may incur outbound traffic fees and request fees on the data source side, and may also incur OSS-related fees such as API operation calling fees. If data migration between OSS buckets across regions is accelerated by using data transfer, you are additionally charged transfer acceleration fees.
Usage notes
Migration speed
The migration speed of ossimport varies based on various factors, such as the read bandwidth of the data source, local network bandwidth, and file size. Migration of files smaller than 200 KB is slow due to high IOPS.
Migration of archived files
If you want to migrate archived files, you must restore the archived files before you can migrate the files.
Data staging
When you use ossimport to migrate data, data streams are first transferred to the local memory and then uploaded to the destination.
Source data retention
During a data migration task, ossimport performs only read operations on the data in the data source. It does not perform write operations, ensuring that the original data is not modified or deleted.
Other migration tools
Data Online Migration
To migrate data from third-party data sources, we recommend that you use Data Online Migration.
ossutil
To migrate data smaller than 30 TB in size, we recommend that you use ossutil. ossutil is a lightweight, easy-to-use tool. You can use the -u, --update and --snapshot-path options to incrementally migrate files. For more information, see cp.
Runtime environment
ossimport can be deployed on Linux or Windows that meets the following requirements:
Windows 7 or later
Latest version of Linux
Java 7
ossimport cannot be deployed on Windows in distributed mode.
Deployment modes
ossimport supports the standalone and distributed deployment modes.
Standalone deployment is suitable for the migration of data smaller than 30 TB in size. To deploy ossimport in standalone mode, download the standalone package. You can deploy ossimport on a device that can access the data to be migrated and the OSS bucket to which you want to migrate the data.
Distributed deployment is suitable for the migration of data larger than 30 TB in size. To deploy ossimport in distributed mode, download the distributed package. You can deploy ossimport on any number of devices that can access the data that you want to migrate and the OSS bucket to which you want to migrate the data.
NoteTo reduce the time that is required to migrate large amounts of data, you can deploy ossimport on an ECS instance that resides in the same region as your OSS bucket. Then, you can use a leased line to create a connection between the source server and a virtual private cloud (VPC). Data migration over internal networks enjoys faster speeds.
You can also use ossimport to transmit data over the Internet. In this case, the transmission speed is affected by the bandwidth of your on-premises machine.
Standalone mode
The master, worker, tracker, and console modules are compressed into ossimport2.jar
and run on a device. The system has only a single worker.
The following content shows the file structure in standalone mode:
ossimport
├── bin
│ └── ossimport2.jar # The JAR package that contains the master, worker, tracker, and console modules.
├── conf
│ ├── local_job.cfg # The job configuration file in standalone deployment.
│ └── sys.properties # The configuration file that contains system parameters.
├── console.bat # The Windows command line tool used to run tasks in a distributed manner.
├── console.sh # The Linux command line tool used to run tasks in a distributed manner.
├── import.bat # The script that automatically imports files based on the conf/local_job.cfg configuration file on Windows. The configuration file contains parameters that specify data migration operations such as start, migration, verification, and retry.
├── import.sh # The script that automatically imports files based on the conf/local_job.cfg configuration file on Linux. The configuration file contains parameters that specify data migration operations such as start, migration, verification, and retry.
├── logs # The directory that contains logs.
└ ── README.md # The file that provides a description of ossimport. We recommend that you read this file before you use ossimport.
import.bat and import.sh are scripts used to import files. You can run these scripts after you modify the
local_job.cfg
configuration file.console.bat and console.sh are scripts used to perform specific operations step by step.
Run scripts or commands in the
ossimport
directory. These scripts and the*.bat/*.sh
file are at the same directory level.
Distributed mode
The ossimport architecture in distributed mode consists of a master and multiple workers:
Master --------- Job --------- Console
|
|
TaskTracker
|_____________________
|Task | Task | Task
| | |
Worker Worker Worker
Parameter | Description |
Master | Splits a job into multiple tasks by data size and number of files. The data size and number of files can be configured in the sys.properties file. The master splits a job into multiple tasks by performing the following steps:
|
Worker |
|
TaskTracker | Distributes tasks and tracks task status. The TaskTracker is abbreviated as tracker. |
Console | Interacts with users, receives command input, and displays command output. The console supports system management commands including deploy, start, and stop, and job management commands including submit, retry, and clean. |
Job | Specifies the data migration jobs submitted by users. One job corresponds to one |
Task | Migrates a portion of files. A job can be divided into multiple tasks by data size and number of files. The minimal unit for dividing a job into tasks is a file. One file is not assigned to multiple tasks. |
In distributed deployment, you can start multiple devices and run only one worker on each device to migrate data. Tasks are evenly assigned to workers, and a worker runs multiple tasks.
The following content shows the file structure in distributed mode:
ossimport
├── bin
│ ├── console.jar # The JAR package for the console module.
│ ├── master.jar # The JAR package for the master module.
│ ├── tracker.jar # The JAR package for the tracker module.
│ └── worker.jar # The JAR package for the worker module.
├── conf
│ ├── job.cfg # The Job configuration file template.
│ ├── sys.properties # The configuration file that contains system parameters.
│ └── workers # The list of workers.
├── console.sh # The command-line tool. Only Linux is supported.
├── logs # The directory that contains logs.
└ ── README.md # The file that provides a description of ossimport. We recommend that you read this file before you use ossimport.
Configuration files
The sys.properties and local_job.cfg configuration files are available in standalone mode. The sys.properties, job.cfg, and workers configuration files are available in distributed mode. The local_job.cfg and job.cfg configuration files have the same parameters. The workers configuration file is exclusive to the distributed mode.
sys.properties: the system parameters.
Parameter
Description
Remarks
workingDir
The working directory.
The directory to which the package is decompressed. Do not modify this parameter in standalone mode. In distributed mode, the working directory must be the same for each device.
workerUser
The SSH username used to log on to the device on which a worker resides.
If privateKeyFile is configured, the value specified for privateKeyFile is used.
If privateKeyFile is not configured, the values specified for workerUser and workerPassword are used.
Do not modify this parameter in standalone mode.
workerPassword
The SSH password used to log on to the device on which a worker resides.
Do not modify this parameter in standalone mode.
privateKeyFile
The path of the private key file.
If you have already established an SSH connection, you can specify this parameter. Otherwise, leave this parameter empty.
If privateKeyFile is configured, the value specified for privateKeyFile is used.
If privateKeyFile is not configured, the values specified for workerUser and workerPassword are used.
Do not modify this parameter in standalone mode.
sshPort
The SSH port.
The default value is 22. In most cases, we recommend that you retain the default value. Do not modify this parameter in standalone mode.
workerTaskThreadNum
The maximum number of threads for a worker to run tasks.
This parameter is related to the device memory and network conditions. We recommend that you set this parameter to 60.
The value can be increased for physical machines. For example, you can set this parameter to 150. If the network bandwidth is already full, do not further increase the value.
If the network conditions are poor, reduce the value. For example, you can set this parameter to 30. This way, you can prevent timeout errors caused by competition for network resources.
workerMaxThroughput(KB/s)
The maximum throughput for data migration of a worker.
This parameter can be used for throttling. The default value is 0, which indicates that no throttling is imposed.
dispatcherThreadNum
The number of threads for task distribution and state confirmation of the tracker.
If you do not have special requirements, retain the default value.
workerAbortWhenUncatchedException
Specifies whether to skip or terminate a task if an unknown error occurs.
By default, a task is skipped if an unknown error occurs.
workerRecordMd5
Indicates whether to use the x-oss-meta-md5 metadata item to record the MD5 hash of files to be migrated. By default, the MD5 hash is not recorded.
The value of this parameter is used to verify data integrity.
job.cfg: the configurations for data migration jobs. The
local_job.cfg
andjob.cfg
configuration files differ in names but contain the same parameters.Parameter
Description
Remarks
jobName
The name of the job. The value is of the String type.
A job name uniquely identifies a job. A job name must comply with the following naming rules: The name can contain letters, digits, underscores (_), and hyphens (-). The name must be 4 to 128 characters in length. You can submit multiple jobs that have different names.
If you submit a job with the same name as an existing job, the system prompts that the job already exists. Before you clean the existing job, you are not allowed to submit the job with the same name.
jobType
The type of the job. The value is of the String type.
Valid values:
import
andaudit
. Default value:import
.import
: runs the data migration job and verifies the migration data for consistency.audit
: only verifies data consistency.
isIncremental
Specifies whether to enable the incremental migration mode. The value is of the Boolean type.
Default value: false.
If this parameter is set to true, ossimport performs a data scan at intervals of seconds specified by incrementalModeInterval to detect incremental data and then migrates the incremental data to OSS.
incrementalModeInterval
The migration interval in seconds in incremental migration mode. The value is of the Integer type.
This parameter is valid when isIncremental is set to true. The minimum interval is 900 seconds. We recommend that you set the parameter to a value not less than 3,600 seconds to prevent request surges and additional overhead.
importSince
The time condition for the data migration job. Data whose last modified time is later than the value of this parameter is migrated. The value is of the Integer type. Unit: seconds.
The timestamp must be in the UNIX format. It is the number of seconds that have elapsed since 00:00:00 Thursday, January 1, 1970. You can run the date +%s command to query the UNIX timestamp.
The default value is 0, which indicates that all data is migrated.
srcType
The source of the data migration. The value is of the String type and is case-sensitive.
Valid values:
local
: migrates data from a local file to OSS. If this value is specified, specify srcPrefix and leave srcAccessKey, srcSecretKey, srcDomain, and srcBucket unspecified.oss
: migrates data from an OSS bucket to another bucket.qiniu
: migrates data from KODO to OSS.bos
: migrates data from BOS to OSS.ks3
: migrates data from KS3 to OSS.s3
: migrates data from Amazon S3 to OSS.youpai
: migrates data from USS to OSS.http
: migrates data from HTTP or HTTPS URL lists to OSS.cos
: migrates data from COS to OSS.azure
: migrates data from Azure Blob to OSS.
srcAccessKey
The AccessKey ID used to access the source. The value is of the String type.
If srcType is set to
oss
,qiniu
,baidu
,ks3
, ors3
, specify the AccessKey ID used to access the source.If srcType is set to
local
orhttp
, ignore this parameter.If srcType is set to
youpai
orazure
, specify the username used to access the source.
srcSecretKey
The AccessKey secret used to access the source. The value is of the String type.
If srcType is set to
oss
,qiniu
,baidu
,ks3
, ors3
, specify the AccessKey secret used to access the source.If srcType is set to
local
orhttp
, ignore this parameter.If srcType is set to
youpai
, specify the operator password used to access the source.If srcType is set to
azure
, specify the account key used to access the source.
srcDomain
The source endpoint.
If srcType is set to
local
orhttp
, ignore this parameter.If srcType is set to
oss
, enter the endpoint obtained from the OSS console. The endpoint is a root domain, which does not include the bucket name.If srcType is set to
qiniu
, enter the domain name corresponding to the bucket obtained from the KODO console.If srcType is set to bos, enter the BOS domain name. Example:
http://bj.bcebos.com
orhttp://gz.bcebos.com
.If srcType is set to ks3, enter the Kingsoft ks3 domain name, such as
http://kss.ksyun.com
,http://ks3-cn-beijing.ksyun.com
, orhttp://ks3-us-west-1.ksyun.coms
.If srcType is set to
S3
, enter the endpoint for the corresponding Amazon S3 region.If srcType is set to
youpai
, enter the USS domain names such ashttp://v0.api.upyun.com
(automatically identified optimal line),http://v1.api.upyun.com
(telecommunication line),http://v2.api.upyun.com
(China Unicom or China Netcom line), orhttp://v3.api.upyun.com
(China Mobile or China Railcom line).If srcType is set to
cos
, enter the region where the COS bucket resides. Example: ap-guangzhou.If srcType is set to
cos
, enter the endpoint suffix in the Azure Blob connection string. Example: core.chinacloudapi.cn.
srcBucket
The name of the source bucket or container.
If srcType is set to
local
orhttp
, ignore this parameter.If srcType is set to
azure
, enter the name of the source container.In other cases, enter the name of the source bucket.
srcPrefix
The source prefix. The value is of the String type. This parameter is empty by default.
If srcType is set to local, enter the full path that ends with a forward slash (/). If the path contains two or more directory levels, separate them with a forward slash (/). Examples:
c:/example/
and/data/example/
.ImportantPaths such as c:/example//, /data//example/, and /data/example// are invalid.
If srcType is set to
oss
,qiniu
,bos
,ks3
,youpai
, ands3
, enter the prefix of the object to be synchronized, excluding the bucket name, such asdata/to/oss/
.To migrate all objects, leave srcPrefix empty.
destAccessKey
The AccessKey ID used to access the destination OSS bucket. The value is of the String type.
You can obtain the AccessKey ID used to access the destination OSS bucket from the OSS console.
destSecretKey
The AccessKey secret used to access the destination OSS bucket. The value is of the String type.
You can obtain the AccessKey secret used to access the destination OSS bucket from the OSS console.
destDomain
The destination endpoint. The value is of the String type.
The endpoint is a root domain, which does not include the bucket name. You can obtain the endpoint from the OSS console.
destBucket
The destination bucket. The value is of the String type.
The name of the destination OSS bucket. The name cannot end with a forward slash (/).
destPrefix
The prefix.
The name prefix of the migrated object in the destination OSS bucket. This parameter is empty by default. If you retain the default value, the migrated objects are stored in the root directory in the destination bucket.
To synchronize data to a specified directory in OSS
, end the prefix with a forward slash (/). Example:
data/in/oss/
.OSS object names cannot start with a forward slash (/). Do not set destPrefix to a value that starts with a forward slash (/).
A local file whose path is in the srcPrefix+relativePath format is migrated to an OSS path in the destDomain/destBucket/destPrefix +relativePath format.
An object whose path is in the srcDomain/srcBucket/srcPrefix+relativePath format in the cloud is migrated to an OSS path in the destDomain/destBucket/destPrefix+relativePath format.
taskObjectCountLimit
The maximum number of files in each task. The value is of the Integer type. The default value is 10000.
This parameter affects the concurrency level of jobs that you want to run. In most cases, this parameter is set to a value calculated based on the following formula: Total number of files/Total number of workers/Number of migration threads (workerTaskThreadNum). The maximum value is 50000. If the total number of files is unknown, retain the default value.
taskObjectSizeLimit
The maximum data size in bytes for each task. The value is of the Integer type. The maximum data size is 1 GB.
This parameter affects the concurrency level of jobs that you want to run. In most cases, this parameter is set to a value calculated based on the following formula: Total data size/Total number of workers/Number of migration threads (workerTaskThreadNum). If the total data size is unknown, retain the default value.
isSkipExistFile
Specifies whether to skip objects that already exist during data migration. The value is of the Boolean type.
If this parameter is set to true, the objects are skipped based on their size and last modified time. If this parameter is set to false, objects that already exist are overwritten. Default value: false. This parameter is invalid when jobType is set to audit.
scanThreadCount
The number of threads that scan files in parallel. The value is of the Integer type.
Default value: 1.
Valid values: 1 to 32.
This parameter affects the efficiency of file scanning. If you do not have special requirements, retain the default value.
maxMultiThreadScanDepth
The maximum depth of directories for parallel scanning. The value is of the Integer type.
Default value: 1.
Valid values: 1 to 16.
The default value indicates parallel scanning in top-level directories.
If you do not have special requirements, retain the default value. A large value may cause task failures.
appId
The application ID of COS. The value is of the Integer type.
This parameter is valid when srcType is set to cos.
httpListFilePath
The absolute path of the HTTP URL list file. The value is of the String type.
This parameter is valid when srcType is set to http. If the source is an HTTP URL list, you must include the absolute path in the list file. Example: c:/example/http.list.
The HTTP URL in the file must be divided into two parts separated with spaces. The first part specifies the prefix, and the second part specifies the relative path of the object in OSS after the object hosted on the URL is migrated. For example, the HTTP URL list file in the c:/example/http.list path contains the following URLs:
http://xxx.xxx.com/aa/ bb.jpg http://xxx.xxx.com/cc/ dd.jpg
If you set destPrefix to ee/, the objects migrated to OSS have the following names:
ee/bb.jpg ee/dd.jpg
workers: only available in distributed mode. Multiple IP addresses are separated with line feeds. Examples:
192.168.1.6 192.168.1.7 192.168.1.8
In the preceding configurations,
192.168.1.6
in the first line must be the master. In other word,192.168.1.6
is the IP address of the device on which the master, workers, and tracker are started. The console also runs on this device. The console also runs on this device.Make sure that the same username, logon method, and working directory are used for all workers.
Configuration file examples
The following table describes the configuration file of a data migration job in distributed mode. The name of the configuration file in standalone mode is local_job.cfg
, which contains the same configuration items as the configuration file in distributed mode.
Migration scenario | Configuration file | Description |
Migrate local data to OSS | srcPrefix specifies an absolute path that ends with a forward slash (/). Examples: | |
Migrate data from KODO to OSS | srcPrefix and destPrefix can be left empty. Otherwise, end the prefixes with a forward slash (/). Example: | |
Migrate data from BOS to OSS | srcPrefix and destPrefix can be left empty. If they are not left empty, end the prefixes with a forward slash (/). Example: | |
Migrate data from Amazon S3 to OSS | For more information, see AWS service endpoints. | |
Migrate data from USS to OSS | Set srcAccessKey to the operator account and srcSecretKey to the corresponding password. | |
Migrate data from COS to OSS | Specify srcDomain based on V4. Example: | |
Migrate data from Azure Blob to OSS | Set srcAccessKey to the storage account and srcSecretKey to the access key. Set srcDomain to the endpoint suffix in the Azure Blob connection string. Example: | |
Migrate data between buckets in OSS | This method is suitable for data migration between buckets across different regions, of different storage classes, or with different prefixes in their names. We recommend that you deploy your service on ECS and use internal endpoints to minimize traffic costs. |
Advanced settings
Throttle traffic
In the sys.properties configuration file, workerMaxThroughput(KB/s) specifies the maximum throughput for data migration of a worker. To configure throttling for scenarios such as source-side throttling and network throttling, set this parameter to a value less than the maximum available bandwidth of your device based on your business needs. After the modification is complete, restart the service for the modification to take effect.
In distributed mode, modify the sys.properties configuration file in the $OSS_IMPORT_WORK_DIR/conf directory for each worker and restart the service.
To throttle traffic, modify the sys.properties configuration file as scheduled by using crontab and restart the service for the modification to take effect.
Modify the number of concurrent tasks
In the sys.properties configuration file, workerTaskThreadNum specifies the number of concurrent tasks run by a worker. If the network conditions are poor and a worker has to process a large number of tasks, timeout errors occur. To resolve this issue, modify the configuration by reducing the number of concurrent tasks and restart the service.
In the sys.properties configuration file, workerMaxThroughput(KB/s) specifies the maximum throughput for data migration of a worker. To configure throttling for scenarios such as source-side throttling and network throttling, set this parameter to a value less than the maximum available bandwidth of your device based on your business needs.
In the job.cfg configuration file, taskObjectCountLimit specifies the maximum number of files in each task. The default value is 10000. This parameter configuration affects the number of tasks. The efficiency of implementing concurrent tasks degrades if the number of tasks is small.
In the job.cfg configuration file, taskObjectSizeLimit specifies the maximum data size for each task. The default maximum data size for each task is 1 GB. This parameter configuration affects the number of tasks. The efficiency of implementing concurrent tasks degrades if the number of tasks is small.
ImportantBefore you start your data migration, configure the parameters in the configuration files.
After you modify parameters in the sys.properties configuration file, restart the local server or the ECS instance on which ossimport is deployed for the modification to take effect.
After job.cfg is submitted, parameters in the job.cfg configuration file cannot be modified.
Verify data without migrating data
To only verify data by using ossimport, set jobType to audit instead of import in the job.cfg or local_job.cfg configuration file. Configure other parameters in the same way as you configure them for data migration.
Specify the incremental data migration mode
In incremental data migration mode, ossimport migrates existing full data after the migration task is started and migrates incremental data at intervals. The full data migration is started after you submit the task. Then, incremental data is migrated at intervals. The incremental data migration mode is suitable for data backup and synchronization.
The following configuration items are related to incremental data migration:
In the job.cfg configuration file, isIncremental specifies whether to enable the incremental data migration mode. true indicates that the incremental data migration mode is enabled. false indicates that the incremental data migration mode is disabled. The default value is false.
In the job.cfg configuration file, incrementalModeInterval specifies the interval in seconds at which incremental data migration is implemented. This configuation item takes effect only when
isIncremental
is set to true. The minimum value is900
. We recommend that you do not set this parameter to a value less than 3600. Otherwise, a large number of requests are wasted. This results in additional overheads.
Filter data to be migrated
You can set filtering conditions to migrate objects that meet specific conditions. ossimport allows you to use the prefix and last modified time to specify objects to migrate.
In the job.cfg configuration file, srcPrefix specifies the prefix of source objects. This parameter is empty by default.
If you set
srcType
to local, enter the path of the local directory. Enter the full path that ends with a forward slash (/). If the path contains two or more directory levels, separate them with forward slashes (/). Examples:c:/example/
or/data/example/
.If you set
srcType
tooss
,qiniu
,bos
,ks3
,youpai
, ors3
, enter the name prefix of source objects without the bucket name. Example:data/to/oss/
. To migrate all objects, leavesrcPrefix
empty.
In the job.cfg configuration file, importSince specifies the last modified time in seconds for source objects, importSince specifies the timestamp in the UNIX format. It is the number of seconds that have elapsed since 00:00:00 Thursday, January 1, 1970. You can run the date +%s command to query the UNIX timestamp. The default value is 0, which indicates that all data will be migrated. In incremental data migration mode, this parameter is valid only for full data migration. In other modes, this parameter is valid for the entire migration job.
If the value of
LastModified Time
of an object is less than the value ofimportSince
, the object is not migrated.If the value of
LastModified Time
of an object is greater than the value ofimportSince
, the object is migrated.