Alibaba Cloud Data Lake Analytics is an interactive analytics service that utilizes serverless architecture for big data applications.
With the advancements in cloud, issues such as data availability and accessibility can be easily addressed. However, the growth in data volume also means systems need to be able to process more kinds of raw data. This raw data may include structured data, unstructured data, images, videos, files, blobs, and other file formats.
So what makes data lake a necessity for organizations? A data lake allows for raw data to be stored in its native format but at the same time allowing it to be processed at very high speeds, such as for reporting and visualization through business intelligence tools.
Alibaba Cloud Data Lake Analytics does exactly this, it is a storage or repository of information where all the raw data of the company sits in. It can perform ETL (Extract, Transform &Load) activity right in the Data Lake. All the extracted and transformed data will be available for feed at the same location. Currently, organization typically use Hadoop for this purpose, but Hadoop requires a Batch Compute mechanism which is again a dependency for the enterprises and might not be that easy and to work with.
A lot of people often mistake data lakes with data warehouses. In fact, they are quite different. Products such as Alibaba Cloud Table Store is a data warehouse while Data Lake Analytics, like its name suggests, is a data lake. There are differences between both types of data storage:
Data Warehouses removes data which is not part of the decision-making process. This is help in reducing the usage of the disk space. This might also be due to the scoped process of the Enterprise Data Warehouse. While on the other hand, Data Lake Analytics keeps the track of every data, due to different actions we are going to perform and the wide ranges of Data Types it supports. A majority of Data Lake Analytics users connect processed data to their own BI tools to automate reporting. A minor section of users may perform the data analysis on the old data using the new techniques. And there is also other group which might work on bring in new data and create a new dataset out of it.
Data Lake Analytics supports almost all of the data types of data, whereas the Data Warehouses contain generated data from the transactional systems. The data will be well built with all the attributes describing them. Whereas the data in the Data Lake might or might not contain much information on the data.
The data in the Data lake might be generated from logs of a web application, JSON files, CSV's, images, videos, PDF's and any other files, sensor data, in other words this also consist of data which is schema less. Which is analogues to NoSQL database compared to SQL. Also data in EDW is read-only while in Data Lake it supports both read and write.
There is an overhead when we talk of implementing the data warehouses just for the reason that there is lot of things to do like cleaning up of data, assigning schema and bring data into a proper shape for the Business Intelligence team to consume this takes a significant amount of time in development and there is time associated with it. But Data Lake Analytics doesn't pose these problems, since we have the raw data available we can experiment and leave the result if we don't want to else we can persist and provide the information.
Data Lake Analytics will be much faster due to the power of parallel computing and with more advantages, also the data will be readily available. This will come at a cost that use who accesses the data should be responsible for cleaning up the data and see whether it is a good fit for the business's users.
Is Data Lake Analytics Really for Me? You can refer Why I Love Data Lake Analytics.
This article describes how to use Data Lake Analytics (DLA) to analyze files stored in Object Storage Service (OSS) instances based on the file format.
Alibaba Cloud Data Lake Analytics (DLA) is a serverless interactive query and analysis service in Alibaba Cloud. You can query and analyze data stored in Object Storage Service (OSS) and Table Store instances simply by running standard SQL statements, without the necessity of moving the data.
Currently, DLA has been officially launched on Alibaba Cloud. You can apply for a trial of this out-of-the-box data analysis service.
Visit the official documentation page to apply for activation of the DLA service.
In addition to plain text files such as CSV and TSV files, DLA can also query and analyze data files in other formats, such as ORC, Parquet, JSON, RCFile, and Avro. DLA can even query geographical JSON data in line with the ESRI standard and files matching the specified regular expressions.
This article describes how to use DLA to analyze files stored in OSS instances based on the file format. DLA provides various built-in Serializers/Deserializers (SerDes) for file processing. Instead of compiling programs by yourself, you can choose one or more SerDes to match formats of data files in your OSS instances. Contact us if the SerDes do not meet your needs for processing special file formats.
I started my own company in the business of big data for promotion. A major difficulty that I had to conquer was processing offline data and cold backup data.
We needed to collect hundreds of millions of search results from the Apple's App Store. To reduce costs, we only kept data from the past one month in the online database and backed up historical data in Alibaba Cloud Object Storage Service (OSS). When we needed the historical data, we had to import it to the database or Hadoop cluster. Similar methods are used for data like users' click logs by companies with limited budgets. This method is inconvenient as it requires a large amount of work for ETL import and export. Currently, frequent requirements such as historical data backtracking and analysis, due to increasingly complex data operations, are consuming a considerable part of R&D engineers' time and energy.
Alibaba Cloud's Data Lake Analytics was like a ray of hope. It allows us to query and analyze data in the OSS using simple database methods that supports MySQL. It simplifies big data analysis and delivers satisfying performance. To a certain extent, it can directly support some of the online services.
Here I'd like to share my experience, using an example of big data analysis for blockchain logs. I picked blockchain data because the data volume is huge, which can fully demonstrate the features of Data Lake Analytics. Another reason is that all the blockchain data is open.
The data set used in this article contains all the data of Ethereum as of June 2018. More than 80% of today's DApps are deployed in Ethereum, so the data analysis and mining are of high values.
The data logic of Ethereum is chained data blocks, and each block contains detailed transaction data. The transaction data is similar to common access logs, which include the From (source user address), To (destination user address), and Data (sent data) fields. For example, ETH exchange between users and Ethereum program (smart contract) invoking by users are completed through transactions. Our experimental data set contains about 5 million blocks and 170 million transaction records.
In this tutorial, we will analyze raw data with Tableau using Data Lake Analytics. We will analyze the files that are available in Object Storage Service (OSS).
Firstly, you have to activate Alibaba Cloud Data Lake Analytics using the Console or through the product page. After successful activation, you will receive a mail containing the credentials for accessing Data Lake Analytics.
Upload Data Files to Object Storage Service (OSS)
You have to create a bucket in OSS console/OSS Browser in the same region. So assuming if you are using Data Lake Analytics outside China, you can use the Singapore region. This means that the data has to be stored in the OSS Bucket in Singapore. The choice of Availability Zone does not matter.
Grant permissions to Data Lake Analytics to access OSS in the RAM Console or by clicking the below link:
Data Lake Analytics does not require any ETL tools. This service allows you to use standard SQL syntax and business intelligence (BI) tools to efficiently analyze your data stored in the cloud with extremely low costs. Through the introduction and usage of the product, this course enables learners to quickly get started with Data Lake Analytics.
This course briefly explains the basic knowledge of Alibaba Cloud big data product system and several products in large data applications, such as MaxCompute, DataWorks, RDS, DRDS, QuickBI, TS, Analytic DB, OSS, Data Integration, etc. Students can refer to the application scenarios explained, combine with the enterprise's own business and demand, apply what we have learned to practice.
The data set of this course is from virtual blog site, we are going to use the data to solve business problems,for example what countries do your customers come from；Which day of the week gets the most online traffic; Which region contributes the most clickstream data etc.
XpoLog Log Analysis platform collects and indexes any machine-generated data from any device, server or application in the environment. Immediately start searching and analyzing your log data, run complex correlations and create mission critical real time visualizations of the data.
Data Lake Analytics (DLA) is a serverless cloud native interactive query and analytics service. DLA does not require Extract, Transform, Load (ETL) to connect to Alibaba Cloud services, such as Object Storage Service (OSS), Table Store, and Relational Database Service (RDS). For enterprise edition users, DLA supports Pay-As-You-Go billing. Due to its serverless architecture, DLA requires zero server maintenance. DLA allows you to run queries across multiple data sources in different formats through standard Java Database Connectivity (JDBC).
This Alibaba Cloud International Website Data Lake Analytics (DLA) Service Level Agreement (“SLA”) applies to your purchase and use of the Alibaba Cloud International Website Data Lake Analytics (DLA) (“Service”) and your use of the Service is subjected to the terms and conditions of the Alibaba Cloud International Website Product Terms of Service (“Product Terms”) between the relevant Alibaba Cloud entity described in the Product Terms (“Alibaba Cloud”, “us”, or “we”) and you. This SLA only applies to your purchase and use of the Services for a fee, and shall not apply to any free Services or trial Services provided by us.
Data Lake Analytics does not require any ETL tools. This service allows you to use standard SQL syntax and business intelligence (BI) tools to efficiently analyze your data stored in the cloud with extremely low costs.
MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.
Alibaba Cloud MaxCompute - July 14, 2021
ApsaraDB - November 17, 2020
ApsaraDB - February 25, 2021
Alibaba Clouder - January 20, 2021
Alibaba Clouder - November 9, 2018
Alibaba Clouder - October 18, 2019
A reliable, cost-efficient backup service for continuous data protection.Learn More
This technology can accurately detect virus mutations and shorten the duration of genetic analysis of suspected cases from hours to just 30 minutes, greatly reducing the analysis time.Learn More
More Posts by Alibaba Clouder