Adolph
Engineer
Engineer
  • UID623
  • Fans2
  • Follows1
  • Posts72
Reads:812Replies:0

Customize data partitions for log shipping to OSS to optimize your computing

Created#
More Posted time:Feb 28, 2017 13:17 PM
Data partitioning
OSS features high reliability and a low cost and is ranked as one of the best options for massive data storage. It is particularly applicable to semi-structured log storage and can be used in concert with E-MapReduce (with Hive, Impala or other computing engines) to load data for query analysis in the schema-on-read approach.
It is a common practice to specify analysis on logs, a time-series data type, by year, month or day, such as analyzing OSS logs using Hive:
• Without partitions, the computation requires to scan the entire table (all the files under directories with the prefix of the OSS bucket in the name).
• If partitions have been set by time, you only need to load objects under the directories of the specified date in the OSS.


For most log analysis, there are three benefits to storing OSS logs in partitions so as to reduce the data size required to be scanned by the computing engine:
• Reduced data analysis latency
• Overhead for additional data processing by the computing engine removed
• The cost incurred by unnecessary OSS reading requests saved

Set partitions for log shipping to OSS
Alibaba Cloud Log Service runs through the entire lifecycle of the log. It provides log collection, storage, real-time consumption, query, warning and shipping (OSS/MaxCompute/TableStore) functions.


Next I would like to introduce how to use the custom partition feature:

1. Configure log shipping rules to OSS for Logstore in Log Service
Generate partitioning character strings from the shipping job creation time in the %Y, %m, %d, %H and %M formats (refer to strptime APIstrptime API for format) to define the hierarchical structure of the directory where the object files written to the OSS are located. The virgule “/” represents the first-level OSS directory.
Every shipping job of the Log Service will write a file into the OSS, and the path format is oss://OSS-BUCKET/OSS-PREFIX/PARTITION-FROMAT_RANDOM-ID. Taking the shipping job created at 2017/01/20 19:50:43 as an example, the relationship between the OSS Prefix, the partition format and the OSS target file path can be illustrated in the following table:


Here the partition format is set to: %Y/%m/%d/log_%H%M%s:

2. Organization of OSS data partitions
$./osscmd listallobject oss://test-bucket/log_ship_oss_example/
2017-01-20 19:50:53 34.83MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195043_1484913043351525351_2850008
2017-01-20 19:51:12 34.89MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195105_1484913065536560919_2850011
2017-01-20 19:51:21 34.83MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195109_1484913069790140729_2850012
2017-01-20 19:51:31 34.88MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195123_1484913083833159138_2850013
2017-01-20 19:51:44 34.99MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195136_1484913096459291992_2850016
2017-01-20 19:52:06 34.83MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195159_1484913119479868401_2850018
2017-01-20 19:52:11 34.95MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195204_1484913124490493518_2850019
2017-01-20 19:52:29 34.98MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195219_1484913139902136845_2850022
2017-01-20 19:52:42 34.90MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195235_1484913155018441825_2850025
2017-01-21 19:53:02 34.99MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195255_1484999575415811298_2850027
2017-01-21 19:53:12 34.94MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195303_1485003383961701486_2850028
2017-01-21 19:53:25 34.80MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195317_1485085997822904318_2850029
2017-01-21 19:53:48 34.97MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195340_1485077380309025973_2850032
2017-01-21 19:53:54 34.82MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195348_1485077388819212161_2850036
2017-01-21 19:54:08 34.96MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195359_1485077399681928884_2850038
2017-01-21 19:54:24 34.98MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195415_1485077415925150389_2850040
2017-01-22 19:54:53 34.82MB oss://test-bucket/log_ship_oss_example/2017/01/22/log_195441_1485086081854680230_2850044
2017-01-22 19:54:55 34.89MB oss://test-bucket/log_ship_oss_example/2017/01/22/log_195443_1485086083617274509_2850045
2017-01-22 19:55:06 34.91MB oss://test-bucket/log_ship_oss_example/2017/01/22/log_195459_1485086099107535552_2850046
2017-01-22 19:55:24 34.97MB oss://test-bucket/log_ship_oss_example/2017/01/22/log_195514_1485086114057357853_2850047
2017-01-22 19:55:44 34.97MB oss://test-bucket/log_ship_oss_example/2017/01/22/log_195537_1485086137670335428_2850051
2017-01-22 19:55:54 34.80MB oss://test-bucket/log_ship_oss_example/2017/01/22/log_195543_1485086143710253025_2850052


3. OSS partitions for Hive
CREATE EXTERNAL TABLE log_ship_oss_example (
    time STRING,
    ip STRING,
    url STRING,
    method STRING,
    status STRING,
    user-agent STRING,
    api-version STRING
  )
  PARTITIONED BY (
    year STRING,
    mon STRING,
    day STRING
  )
  ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
  STORED AS TEXTFILE
  LOCATION 'oss://${AccessKeyId}:${AccessKeySecret}@test-bucket.oss-cn-hangzhou.aliyuncs.com/log_ship_oss_example';


Create an external table and load the partitions on the OSS, Hive will be able to start the data analysis.
Guest