Reads:39613Replies:0
Customize data partitions for log shipping to OSS to optimize your computing
Data partitioning
OSS features high reliability and a low cost and is ranked as one of the best options for massive data storage. It is particularly applicable to semi-structured log storage and can be used in concert with E-MapReduce (with Hive, Impala or other computing engines) to load data for query analysis in the schema-on-read approach. It is a common practice to specify analysis on logs, a time-series data type, by year, month or day, such as analyzing OSS logs using Hive: • Without partitions, the computation requires to scan the entire table (all the files under directories with the prefix of the OSS bucket in the name). • If partitions have been set by time, you only need to load objects under the directories of the specified date in the OSS. ![]() For most log analysis, there are three benefits to storing OSS logs in partitions so as to reduce the data size required to be scanned by the computing engine: • Reduced data analysis latency • Overhead for additional data processing by the computing engine removed • The cost incurred by unnecessary OSS reading requests saved Set partitions for log shipping to OSS Alibaba Cloud Log Service runs through the entire lifecycle of the log. It provides log collection, storage, real-time consumption, query, warning and shipping (OSS/MaxCompute/TableStore) functions. ![]() Next I would like to introduce how to use the custom partition feature: 1. Configure log shipping rules to OSS for Logstore in Log Service Generate partitioning character strings from the shipping job creation time in the %Y, %m, %d, %H and %M formats (refer to strptime APIstrptime API for format) to define the hierarchical structure of the directory where the object files written to the OSS are located. The virgule “/” represents the first-level OSS directory. Every shipping job of the Log Service will write a file into the OSS, and the path format is oss://OSS-BUCKET/OSS-PREFIX/PARTITION-FROMAT_RANDOM-ID. Taking the shipping job created at 2017/01/20 19:50:43 as an example, the relationship between the OSS Prefix, the partition format and the OSS target file path can be illustrated in the following table: ![]() Here the partition format is set to: %Y/%m/%d/log_%H%M%s: 2. Organization of OSS data partitions $./osscmd listallobject oss://test-bucket/log_ship_oss_example/ 2017-01-20 19:50:53 34.83MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195043_1484913043351525351_2850008 2017-01-20 19:51:12 34.89MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195105_1484913065536560919_2850011 2017-01-20 19:51:21 34.83MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195109_1484913069790140729_2850012 2017-01-20 19:51:31 34.88MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195123_1484913083833159138_2850013 2017-01-20 19:51:44 34.99MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195136_1484913096459291992_2850016 2017-01-20 19:52:06 34.83MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195159_1484913119479868401_2850018 2017-01-20 19:52:11 34.95MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195204_1484913124490493518_2850019 2017-01-20 19:52:29 34.98MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195219_1484913139902136845_2850022 2017-01-20 19:52:42 34.90MB oss://test-bucket/log_ship_oss_example/2017/01/20/log_195235_1484913155018441825_2850025 2017-01-21 19:53:02 34.99MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195255_1484999575415811298_2850027 2017-01-21 19:53:12 34.94MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195303_1485003383961701486_2850028 2017-01-21 19:53:25 34.80MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195317_1485085997822904318_2850029 2017-01-21 19:53:48 34.97MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195340_1485077380309025973_2850032 2017-01-21 19:53:54 34.82MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195348_1485077388819212161_2850036 2017-01-21 19:54:08 34.96MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195359_1485077399681928884_2850038 2017-01-21 19:54:24 34.98MB oss://test-bucket/log_ship_oss_example/2017/01/21/log_195415_1485077415925150389_2850040 2017-01-22 19:54:53 34.82MB oss://test-bucket/log_ship_oss_example/2017/01/22/log_195441_1485086081854680230_2850044 2017-01-22 19:54:55 34.89MB oss://test-bucket/log_ship_oss_example/2017/01/22/log_195443_1485086083617274509_2850045 2017-01-22 19:55:06 34.91MB oss://test-bucket/log_ship_oss_example/2017/01/22/log_195459_1485086099107535552_2850046 2017-01-22 19:55:24 34.97MB oss://test-bucket/log_ship_oss_example/2017/01/22/log_195514_1485086114057357853_2850047 2017-01-22 19:55:44 34.97MB oss://test-bucket/log_ship_oss_example/2017/01/22/log_195537_1485086137670335428_2850051 2017-01-22 19:55:54 34.80MB oss://test-bucket/log_ship_oss_example/2017/01/22/log_195543_1485086143710253025_2850052 3. OSS partitions for Hive CREATE EXTERNAL TABLE log_ship_oss_example ( time STRING, ip STRING, url STRING, method STRING, status STRING, user-agent STRING, api-version STRING ) PARTITIONED BY ( year STRING, mon STRING, day STRING ) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION 'oss://${AccessKeyId}:${AccessKeySecret}@test-bucket.oss-cn-hangzhou.aliyuncs.com/log_ship_oss_example'; Create an external table and load the partitions on the OSS, Hive will be able to start the data analysis. |
|