Tunnel命令操作 - MaxCompute

功能簡介

您可以通過用戶端提供的Tunnel命令實現原有Dship工具的功能。

Tunnel命令主要用於資料的上傳和下載等功能。

Upload：支援檔案或目錄（指一級目錄）的上傳，每一次上傳只支援資料上傳到一張表或表的一個分區，有分區的表一定要指定上傳的分區，多級分區一定要指定到末級分區。

tunnel upload log.txt test_project.test_table/p1="b1",p2="b2";
-- 將log.txt中的資料上傳至項目空間test_project的表test_table（二級分區表）中的p1="b1",p2="b2"分區
tunnel upload  log.txt  test_table --scan=only;
-- 將log.txt中的資料上傳至表 test_table 中。--scan參數表示需要掃描log.txt中的資料是否符合 test_table 的定義，如果不符合報錯，並停止上傳資料。

Download：只支援下載到單個檔案，每一次下載只支援下載一張表或一個分區到一個檔案，有分區的表一定要指定下載的分區，多級分區一定要指定到末級分區。
```
tunnel download  test_project.test_table/p1="b1",p2="b2"  test_table.txt;
-- 將test_project.test_table表（二級分區表）中的資料下載到 test_table.txt 檔案中
```
Resume：因為網路或tunnel服務的原因出錯，支援檔案或目錄的續傳。可以繼續上一次的資料上傳操作，但Resume命令暫時沒有對下載操作的支援。
```
tunnel resume;
```

Show：顯示曆史任務資訊。

tunnel show history -n 5;
--顯示前5次上傳/下載資料的詳細命令
tunnel show log;
--顯示最後一次上傳/下載資料的日誌

Purge：清理session目錄，預設清理3天內的。
```
tunnel purge 5;
--清理前5天的日誌
```

Tunnel上傳下載限制

Tunnel命令不支援上傳下載Array、Map和Struct類型的資料。

每個Tunnel的Session 在服務端的生命週期為 24 小時，建立後 24 小時內均可使用，也可以跨進程/線程共用使用，但是必須保證同一個 BlockId 沒有重複使用。

Tunnel命令使用說明

Tunnel命令支援在用戶端通過help子命令擷取協助資訊，每個命令和選擇支援短命令格式。

odps@ project_name>tunnel help;
    Usage: tunnel <subcommand> [options] [args]
    Type 'tunnel help <subcommand>' for help on a specific subcommand.
Available subcommands:
    upload (u)
    download (d)
    resume (r)
    show (s)
    purge (p)
    help (h)
tunnel is a command for uploading data to / downloading data from ODPS.

參數說明：

upload：上傳資料到MaxCompute的表中。
download：從MaxCompute的表中下載資料。
resume：如果上傳資料失敗，通過resume命令進行斷點續傳，目前僅支援上傳資料的續傳。每次上傳、下載資料被稱為一個session。在resume命令後指定session id完成續傳。
show：查看曆史運行資訊。
purge：清理session目錄。
help：輸出tunnel協助資訊。

Upload

將本地檔案的資料匯入MaxCompute的表中，以追加模式匯入。子命令使用提示：

odps@ project_name>tunnel help upload;
usage: tunnel upload [options] <path> <[project.]table[/partition]>
              upload data from local file
 -acp,-auto-create-partition <ARG>   auto create target partition if not
                                     exists, default false
 -bs,-block-size <ARG>               block size in MiB, default 100
 -c,-charset <ARG>                   specify file charset, default ignore.
                                     set ignore to download raw data
 -cp,-compress <ARG>                 compress, default true
 -dbr,-discard-bad-records <ARG>     specify discard bad records
                                     action(true|false), default false
 -dfp,-date-format-pattern <ARG>     specify date format pattern, default
                                     yyyy-MM-dd HH:mm:ss; 
 -fd,-field-delimiter <ARG>          specify field delimiter, support
                                     unicode, eg \u0001. default ","
 -h,-header <ARG>                    if local file should have table
                                     header, default false
 -mbr,-max-bad-records <ARG>         max bad records, default 1000
 -ni,-null-indicator <ARG>           specify null indicator string,
                                     default ""(empty string)
 -rd,-record-delimiter <ARG>         specify record delimiter, support
                                     unicode, eg \u0001. default "\r\n"
 -s,-scan <ARG>                      specify scan file
                                     action(true|false|only), default true
 -sd,-session-dir <ARG>              set session dir, default
                                     D:\software\odpscmd_public\plugins\ds
                                     hip
 -ss,-strict-schema <ARG>            specify strict schema mode. If false,
                                     extra data will be abandoned and
                                     insufficient field will be filled
                                     with null. Default true
 -te,-tunnel_endpoint <ARG>          tunnel endpoint
    -threads <ARG>                   number of threads, default 1
 -tz,-time-zone <ARG>                time zone, default local timezone:
                                     Asia/Shanghai
Example:
    tunnel upload log.txt test_project.test_table/p1="b1",p2="b2"

參數說明：

-acp：如果不存在，自動建立目標資料分割，預設關閉。
-bs：每次上傳至Tunnel的資料區塊大小，預設100MiB（1MiB＝1024*1024B）。
-c：指定本機資料檔案編碼，預設為UTF-8。不設定，預設下載來源資料。
-cp：指定是否在本地壓縮後再上傳，減少網路流量，預設開啟。
-dbr：是否忽略髒資料（多列、少列、列資料類型不匹配等情況）。
- 值為true時，將全部不符合表定義的資料忽略。
- 值為false時，若遇到髒資料，則給出錯誤提示資訊，目標表內的未經處理資料不會被汙染。
-dfp：DateTime類型資料格式，預設為yyyy-MM-dd HH:mm:ss。如果您想指定時間格式到毫秒層級，可以使用tunnel upload -dfp 'yyyy-MM-dd HH:mm:ss.SSS'，有關DateTime資料類型的詳情請參見資料類型。
-fd：本機資料檔案的列分割符，預設為逗號。
-h：資料檔案是否包括表頭，如果為true，則dship會跳過表頭從第二行開始上傳資料。
-mbr：預設情況下，當上傳的髒資料超過1000條時，上傳動作終止。通過此參數，可以調整可容忍的髒資料量。
-ni：NULL資料標誌符，預設為“ ”（Null 字元串）。
-rd：本機資料檔案的行分割符，預設為\r\n。
-s：是否掃描本機資料檔案，預設值為false。
- 值為true時，先掃描資料，若資料格式正確，再匯入資料。
- 值為false時，不掃描資料，直接進行資料匯入。
- 值為only時，僅進行掃描本機資料，掃描結束後不繼續匯入資料。
-sd：設定session目錄。
-te：指定tunnel的Endpoint。
-threads：指定threads的數量，預設為1。
-tz：指定時區。預設為本地時區：Asia/Shanghai。

樣本如下：

建立目標表，如下所示：

CREATE TABLE IF NOT EXISTS sale_detail(
      shop_name     STRING,
      customer_id   STRING,
      total_price   DOUBLE)
PARTITIONED BY (sale_date STRING,region STRING);

添加分區，如下所示：

alter table sale_detail add partition (sale_date='201312', region='hangzhou');

準備資料檔案data.txt，其內容如下所示：
```
shopx,x_id,100
shopy,y_id,200
shopz,z_id
```
這份檔案的第三行資料與sale_detail的表定義不符。sale_detail定義了三列，但資料只有兩列。

匯入資料，如下所示：

odps@ project_name>tunnel u d:\data.txt sale_detail/sale_date=201312,region=hangzhou -s false
Upload session: 20150610xxxxxxxxxxx70a002ec60c
Start upload:d:\data.txt
Total bytes:41   Split input to 1 blocks
2015-06-10 16:39:22     upload block: '1'
ERROR: column mismatch -,expected 3 columns, 2 columns found, please check data or delimiter

由於data.txt中有髒資料，資料匯入失敗。並給出session id及錯誤提示資訊。

資料驗證，如下所示：

odps@ odpstest_ay52c_ay52> select * from sale_detail where sale_date='201312';
ID = 20150610xxxxxxxxxxxvc61z5
+-----------+-------------+-------------+-----------+--------+
| shop_name | customer_id | total_price | sale_date | region |
+-----------+-------------+-------------+-----------+--------+
+-----------+-------------+-------------+-----------+--------+

由於有髒資料，資料匯入失敗，表中無資料。

Show

顯示記錄。子命令使用提示：

odps@ project_name>tunnel help show;
usage: tunnel show history [options]
              show session information
 -n,-number <ARG>   lines
Example:
       tunnel show history -n 5
       tunnel show log

參數說明：

-n：指定顯示行數。

樣本如下：

odps@ project_name>tunnel show  history;
20150610xxxxxxxxxxx70a002ec60c  failed  'u --config-file /D:/console/conf/odps_config.ini --project odpstest_ay52c_ay52 --endpoint http://service.odps.aliyun.com/api --id UlxxxxxxxxxxxrI1 --key 2m4r3WvTxxxxxxxxxx0InVke7UkvR d:\data.txt sale_detail/sale_date=201312,region=hangzhou -s false'

说明 20150610xxxxxxxxxxx70a002ec60c是上節中匯入資料失敗時的運行ID。

Resume

修複執行記錄，僅對上傳資料有效。子命令使用提示：

odps@  project_name>tunnel help resume;
usage: tunnel resume [session_id] [-force]
              resume an upload session
 -f,-force   force resume
Example:
       tunnel resume

樣本如下：

修改data.txt檔案為如下內容：

shopx,x_id,100
shopy,y_id,200

修複執行上傳資料，如下所示：

odps@ project_name>tunnel resume 20150610xxxxxxxxxxx70a002ec60c --force;
start resume
20150610xxxxxxxxxxx70a002ec60c
Upload session: 20150610xxxxxxxxxxx70a002ec60c
Start upload:d:\data.txt
Resume 1 blocks 
2015-06-10 16:46:42     upload block: '1'
2015-06-10 16:46:42     upload block complete, blockid=1
upload complete, average speed is 0 KB/s
OK

说明 20150610xxxxxxxxxxx70a002ec60c為上傳失敗的session ID。

資料驗證，如下所示：

odps@ project_name>select * from sale_detail where sale_date='201312';
 ID = 20150610xxxxxxxxxxxa741z5
 +-----------+-------------+-------------+-----------+--------+
 | shop_name | customer_id | total_price | sale_date | region |
 +-----------+-------------+-------------+-----------+--------+
 | shopx     | x_id        | 100.0       | 201312    | hangzhou|
 | shopy     | y_id        | 200.0       | 201312    | hangzhou|
 +-----------+-------------+-------------+-----------+--------+

Download

子命令使用提示：

odps@ project_name>tunnel help download;
usage:tunnel download [options] <[project.]table[/partition]> <path>
              download data to local file
 -c,-charset <ARG>                 specify file charset, default ignore.
                                   set ignore to download raw data
 -ci,-columns-index <ARG>          specify the columns index(starts from
                                   0) to download, use comma to split each
                                   index
 -cn,-columns-name <ARG>           specify the columns name to download,
                                   use comma to split each name
 -cp,-compress <ARG>               compress, default true
 -dfp,-date-format-pattern <ARG>   specify date format pattern, default
                                   yyyy-MM-dd HH:mm:ss
 -e,-exponential <ARG>             When download double values, use
                                   exponential express if necessary.
                                   Otherwise at most 20 digits will be
                                   reserved. Default false
 -fd,-field-delimiter <ARG>        specify field delimiter, support
                                   unicode, eg \u0001. default ","
 -h,-header <ARG>                  if local file should have table header,
                                   default false
    -limit <ARG>                   specify the number of records to
                                   download
 -ni,-null-indicator <ARG>         specify null indicator string, default
                                   ""(empty string)
 -rd,-record-delimiter <ARG>       specify record delimiter, support
                                   unicode, eg \u0001. default "\r\n"
 -sd,-session-dir <ARG>            set session dir, default
                                   D:\software\odpscmd_public\plugins\dshi
                                   p
 -te,-tunnel_endpoint <ARG>        tunnel endpoint
    -threads <ARG>                 number of threads, default 1
 -tz,-time-zone <ARG>              time zone, default local timezone:
                                   Asia/Shanghai
usage: tunnel download [options] instance://<[project/]instance_id> <path>
              download instance result to local file
 -c,-charset <ARG>                 specify file charset, default ignore.
                                   set ignore to download raw data
 -ci,-columns-index <ARG>          specify the columns index(starts from
                                   0) to download, use comma to split each
                                   index
 -cn,-columns-name <ARG>           specify the columns name to download,
                                   use comma to split each name
 -cp,-compress <ARG>               compress, default true
 -dfp,-date-format-pattern <ARG>   specify date format pattern, default
                                   yyyy-MM-dd HH:mm:ss
 -e,-exponential <ARG>             When download double values, use
                                   exponential express if necessary.
                                   Otherwise at most 20 digits will be
                                   reserved. Default false
 -fd,-field-delimiter <ARG>        specify field delimiter, support
                                   unicode, eg \u0001. default ","
 -h,-header <ARG>                  if local file should have table header,
                                   default false
    -limit <ARG>                   specify the number of records to
                                   download
 -ni,-null-indicator <ARG>         specify null indicator string, default
                                   ""(empty string)
 -rd,-record-delimiter <ARG>       specify record delimiter, support
                                   unicode, eg \u0001. default "\r\n"
 -sd,-session-dir <ARG>            set session dir, default
                                   D:\software\odpscmd_public\plugins\dshi
                                   p
 -te,-tunnel_endpoint <ARG>        tunnel endpoint
    -threads <ARG>                 number of threads, default 1
 -tz,-time-zone <ARG>              time zone, default local timezone:
                                   Asia/Shanghai
Example:
       tunnel download test_project.test_table/p1="b1",p2="b2" log.txt
       tunnel download instance://test_project/test_instance log.txt

參數說明：

-c：本機資料檔案編碼，預設為UTF-8。
-ci：指定列索引（從0）下載，使用逗號分隔。
-cn：指定要下載的列名稱，使用逗號分隔每個名稱。
-cp，-compress：指定是否壓縮後再下載，減少網路流量，預設開啟。
-dfp：DateTime類型資料格式，預設為yyyy-MM-dd HH:mm:ss。
-e：當下載double值時，如果需要，使用指數函數表示，否則最多保留20位。
-fd：本機資料檔案的列分割符，預設為逗號。
-h：資料檔案是否包括表頭，如果為true，則dship會跳過表頭從第二行開始下載資料。

说明 -h=true和 threads>1即多線程不能一起使用。
-limit：指定要下載的檔案數量。
-ni：NULL資料標誌符，預設為“ ”（Null 字元串）。
-rd：本機資料檔案的行分割符，預設為\r\n。
-sd：設定session目錄。
-te：指定tunnel endpoint。
-threads：指定threads的數量，預設為1。
-tz：指定時區。預設為本地時區：Asia/Shanghai。

樣本如下：

下載資料到 result.txt檔案中，如下所示：

$ ./tunnel download sale_detail/sale_date=201312,region=hangzhou result.txt;
    Download session: 20150610xxxxxxxxxxx70a002ed0b9
    Total records: 2
    2015-06-10 16:58:24     download records: 2
    2015-06-10 16:58:24     file size: 30 bytes
    OK

驗證 result.txt的檔案內容，如下所示：

shopx,x_id,100.0
shopy,y_id,200.0

Purge

清除session目錄，預設清除距離當前日期3天內的。子命令使用提示：

odps@ project_name>tunnel help purge;
usage: tunnel purge [n]
              force session history to be purged.([n] days before, default
              3 days)
Example:
       tunnel purge 5

資料類型說明：


類型	描述
STRING	字串類型，長度不能超過8MB。
BOOLEN	上傳值只支援true、false、0和1。下載值為true/false且不區分大小寫。
BIGINT	取值範圍[-9223372036854775807，9223372036854775807]。
DOUBLE	有效位元16位上傳支援科學計數法表示下載僅使用數字表示最大值：1.7976931348623157E308 最小值：4.9E-324 無窮大：Infinity 無窮小：-Infinity
DATETIME	Datetime類型預設支援時區為GMT+8的資料上傳，可以通過命令列指定使用者資料日期格式的format pattern。

如果您上傳DATETIME類型的資料，需要指定時間日期格式，具體格式請參見SimpleDateFormat。

"yyyyMMddHHmmss": 資料格式"20140209101000"
"yyyy-MM-dd HH:mm:ss"（預設）：資料格式"2014-02-09 10:10:00"
"yyyy年MM月dd日": 資料格式"2014年09月01日"

樣本如下：

tunnel upload log.txt test_table -dfp "yyyy-MM-dd HH:mm:ss"

空值：所有資料類型都可以有空值。

預設Null 字元串為空白值。
可在命令列下通過-null-indicator參數來指定空值的字串。

tunnel upload log.txt test_table -ni "NULL"

字元編碼：您可以指定檔案的字元編碼，預設為UTF-8。

tunnel upload log.txt test_table -c "gbk"

分隔字元：tunnel命令支援您自訂的檔案分隔字元，行分隔字元選項為-record-delimiter，資料行分隔符號選項為-field-delimiter。

分隔字元說明如下：

支援多個字元的行資料行分隔符號。
資料行分隔符號不能夠包含行分隔字元。
逸出字元分隔字元，在命令列方式下只支援\r、\n和\t。

樣本如下：

tunnel upload log.txt test_table -fd "||" -rd "\r\n"