All Products
Document Center

Data cleansing and ETL

Last Updated: May 08, 2018

An assumption during log processing is that data is not perfect. Due to the gap between the raw data and the final results, the raw data must be cleansed, converted, and sorted using methods like ETL (Extract Transformation Load) to get the final results.


“I Want Take-away” is an e-commerce website with a platform involving users, restaurants, and couriers. Users can place their take-away orders using the website, App, WeChat, or Alipay. Once order is received, a merchant starts preparing the ordered dish item, and the nearest couriers are auto-notified. Then, one of the couriers accepts the order and delivers it to the buyer.


The operation team has two jobs:

  • Determine couriers’ locations and assign orders location.
  • Understand how coupons and cash are used and distribute coupons locations as part of interactive operations.

Processing the courier’s location information (GPS)

GPS data (X and Y) is reported once a minute by the courier’s app in the following format:

  1. 2016-06-26 19:00:15 ID:10015 DeviceID:EXX12345678 Network:4G GPS-X:10.30.339 GPS-Y: Status:Delivering

The data feed records the reporting time, courier ID, network in use, device serial number, and coordinates (GPS-X and GPS-Y). The longitude and latitude given by GPS are accurate, but the operation team actually does not need such accurate data to understand the current status statistics for each area. Therefore, it is necessary to transform the raw data and convert the coordinates into readable fields like city, region, ZIP code.

  1. (GPS-X,GPS-Y) --> (GPS-X, GPS-Y, City, District, ZipCode)

This is a typical ETL requirement. Use the LogHub function to create two LogStores (PositionLog), and the transformed LogStore (EnhancedLog). Run the ETL application (for example, Spark Streaming, Storm, or Consumer Library enabled in a container) to subscribe to the real-time PositionLog, convert the coordinates, and then write the EnhancedLog. Carry out real-time computing operations for visualization, or create an index to query the EnhancedLog data model warehouse.

We recommend that you use the following architecture for the entire process:

  1. Each courier’s location is shown on the app and their GPS locations are reported every minute and written into the first LogStore (PositionLog).
    • We recommend that you use LogHub Android/IOS SDK and MAN (mobile analytics) for accessing the mobile device log.
  2. Use the real-time application to subscribe to the real-time PositionLog data, and write the processed data into EnhancedLog LogStore.
  3. Process the enhanced log, for example, visualization of computed log data. Recommendations:


Desensitization and analysis of payment orders

The Payment Service receives a payment request which includes the payment account, payment method, amount, and coupon.

  • Part of the sensitive information must be desensitized.
  • Two types of information, coupon and cash, must be stripped from the payment information.

The entire process is as follows:

  1. Create four LogStores for raw data (PayRecords), desensitized data (Cleaned), cash order (Cash), and coupon (Coupon) respectively. The application uses Log4J Appender to write the order data into the raw data LogStore (PayRecords).

    We recommend that you use Log4J Appender or Producer Library, so that the sensitive data is not written to the disk.

  2. The desensitization application consumes the PayRecords LogStore in real time and writes the desensitized data into the Cleaned LogStore after stripping off the account-related information.

  3. The traffic delivery application consumes the Cleaned LogStore in real time, and according to the business logics saves the two types of information, coupon and cash, into the corresponding LogStore (Cash and Coupon) for subsequent processing.

    We recommend that you use the Spark Streaming, Storm, Consumer Lib (an auto-balancing programming mode), or SDK subscription for real-time desensitization and traffic delivery.


Additional information

  • Under the LogHub function, the account permission of each LogStore can be controlled using RAM. For more information, see RAM.
  • LogHub current read and write capabilities can be obtained from the Monitor Log Service, and the consumption status can be viewed using the console.