Say we wanted to gather data in real time, and we want to plug this real time data into any Alibaba Cloud service we want for analytics. With an Alibaba Elastic Compute Service (ECS) instance, we can achieve this.
Today we'll be making our very own data collection "Bot" with Python and Selenium. Selenium web driver is a very powerful extension to our Python tool kit. With Selenium we can mimic many user behaviors such as clicking on a specific element or hovering the mouse in a particular screen position. We can also automate many repetitive day to day tasks like data scraping such as retrieving real time cryptocurrency market data, as we do in this tutorial. This data can be passed into any other Alibaba cloud service.
First lets instantiate and login to a GUI based Windows Alibaba Elastic Compute Service instance. I have chosen Windows Server 2016 as it seems to run sufficiently fast enough. The main point here is that we can see the Python Selenium script work its magic in a graphical environment. At some point we may wish to run headless but that is beyond the scope of this article.
Open up Internet Explorer to navigate to mozilla.org. You may need to add some exceptions to the built in Windows firewall to enable the Firefox installer to download.
Download Firefox from www.mozilla.org
Download the Python 3.7.1 MSI installer from http://www.python.org/download/ Select 64 bit based executable.
Run the installer. Be sure to check the option to add Python to your PATH while installing.
The Windows version of Python 3.7.1 includes the pip installer. This will make it easy for us to install all the Python modules we will need. Specifically Selenium web driver.
Navigate to the Windows Power Shell interface and lets make sure that Python was successfully installed.
We should see Python 3.7.1 here
So let's install Selenium with pip.
pip install selenium
Now we need to install the proper web driver for Selenium. Since we are using Mozilla for a browser we need the Geckodriver. The process is similar to installing the Chrome web driver.
Save the link to a local file on your Alibaba Windows instance. Then extract the zip. I extracted to the desktop. Making sure to add the file to the system path in the Power Shell.
setx path "%path%;c:\Users\Administrator\Desktop\geckodriver.exe"
Now lets get our development environment organized. You can always just use notepad to write your code. I prefer to use notepad++ a free open source IDE for the syntax highlighting.
You can refer to Data Collection Bot with Python and Selenium to decide what we want our Python Selenium "Web Bot" to do.
This article looks into how you can use Apache Arrow to Assist PySark in data processing operations, and also discusses Apache Arrow and its usage in Spark in general and how the efficiency of data transmission has been improved with Column Store and Zero Copy.
What Is Column Store
Before we look into how you can use Apache Arrow in Spark, let's take a brief look at Apache Arrow and first Column Store.
To understand Column Store, you need to first know that most storage engines used Row Store before the big data era. Many early systems, such as trading systems and ERP systems, process all the information of an entity by adding, deleting, modifying, and querying each time. With Row Store, a single entity can be quickly located and processed. But with Column Store, operations on different attributes of an entity require multiple random reads and writes, which can be very inefficient.
However, with the advent of the big data era, especially given the continuous development of data analysis, a task does not need to read all attributes of an entity at one time, but rather a task only focuses on specific attributes, and performs complex operations, such as Aggregate, on these attributes. Given this, additional data will need to be read for Row Store, which in turn causes a bottleneck in the system. However, for the newer developed Column Store, the processes of reading extra data can be greatly reduced, and the data with the same attribute can also be compressed, greatly accelerating the processing speed.
The following is a comparison between Row Store and Column Store. It is taken from the Apache Arrow official website. The upper is a two-dimensional table consisting of three attributes: session_id, timestamp, and source_ip. On the left is the representation of Row Store in memory. Data is stored in sequence by rows, and each row is stored in sequence by columns. On the right is the representation of Column Store in memory. Each column is stored separately, and the cluster size of columns written at one time is controlled based on attributes, such as batch size. As such, when the query statement involves only a few columns (such as the SQL query in the figure), only the session_id column needs to be filtered to avoid reading all data columns, thus reducing a large amount of I/O loss. Meanwhile, considering the CPU pipeline and the use of CPU SIMD technology, the query speed will be greatly improved.
In the big data field, Column Store is inspired by the Dremel paper published in 2010 by Google. The Dremel paper discusses a storage format that supports nested structures, and uses Column Store to improve query performance. It also describes how Google uses this storage format to implement parallel queries. This paper affects the development of the Hadoop ecosystem. Apache Parquet and Apache ORC have been used by Hadoop ecosystems, such as Spark, Hive, and Impala, as Column Store formats.
But what is Apache Arrow anyway? Well, as defined on the official website, Apache Arrow is a cross-language and cross-platform memory data structure. From this definition, we can see the differences between Apache Arrow, and Apache Parquet, as well as Apache ORC. Parquet and ORC are designed to compress disk data by using efficient compression algorithms based on Column Store. For example, algorithms, such as Snappy, Gzip, and Zlib, are used to compress column data. Therefore, in most cases, data is first decompressed when it is read, which consumes some CPU. For Arrow, the data in memory is not supported to be compressed (data written to the disk is supported to be compressed). Arrow performs similar indexing operations through the dictionary-encoded mode.
This article explores different ways in which you can present and visualize data through the Zeppelin interface.
The main aim of of Big Data is to analyze the data and reveal business insights hidden in the data to be able to transform business accordingly. Big data analytics is a combination of various tools and technologies. The collective use of these technologies can assist companies in the process of examining large quantities of data so to uncover hidden patterns and arrive at useful findings. To put things in more practical terms, through gathering and analyzing data about customers and transactions, organizations can make the appropriate judgements in how they run and do their business.
As we continue to walk through the big data cycle, it's time for us to collect and make full use of business intelligence with various BI tools. After transforming data into information using spark, now we are all set to convert this information into insights for which we might need some help of visualisation tools. In this article, we will focus on building visual stories using Apache Zeppelin, an open-source notebook for all your needs including
Apache Zeppelin is a web-based multi-purpose notebook that comes inbuilt with the set of services provided by Alibaba Cloud E-MapReduce. This makes easy for you to process and visualise large datasets with the familiar codes without moving on to a separate set of Business Intelligence (BI) tools.
Zeppelin enables all set of people with interactive data analytics. It provides both data exploration and visualization, as well as sharing and collaboration features to Spark. It supports Python, and also a growing list of programming languages such as Scala, Hive, SparkSQL and markdown. It has in-built visualizations such as graphs and tables which can be used to present our newly found data in a more appealing format. In this article, we will focus on accessing the interface of Zeppelin and explore different ways in which we can present data.
Alibaba E-MapReduce supports Apache Zeppelin. Once the cluster is created, we can see the set of services initiated and running. To access Zeppelin, first step to do is to open the port. On creation of an E-MapReduce cluster, port 22 is accessible by default which is the default port of Secure Shell (SSH) with which we accessed the cluster in our previous articles. We can create security groups and assign the ECS instances to the groups created. The best way is to divide ECS instances by their functions, and put them into different security groups, as each security group is provided with unique access control as required.
In this article series, we will be exploring data analytics for businesses using Alibaba Cloud QuickBI and sample data from banking and financial services.
This multi-part article talks about how to collect data, wrangle the data, ingest the data, model the data, and visualize the data from three viewpoints (conceptual, practical, and best practice).
In the first article of this series, we have seen how to understand data conceptually through an example from the Banking, Financial services and Insurance (BFSI) domain.
In this article, we will learn how to wrangle the data (i.e. cleaning the data) according to your business scenario to Alibaba Cloud Quick BI. We may need Quick BI in the upcoming process of deciphering data so please ensure that you have registered for an Alibaba Cloud account. If you haven't, sign up for a free account through this link.
Data wrangling, sometimes referred as data munging, is the process of transforming data from one format to other with the intent of making it more appropriate and valuable for analytics. With the rapid rise of Big Data and IoT applications, the number of data types and formats are increasing each day. This makes data wrangling an indispensable element for big data processing, especially for larger applications.
Analysts often use libraries, tools in the Python ecosystem to analyze data on their personal computer. They like these tools because they are efficient, intuitive, and widely trusted. However when they choose to apply their analyses to larger datasets they find that these tools were not designed to scale beyond a single machine.
The data set of this course is from virtual blog site, we are going to use the data to solve business problems,for example what countries do your customers come from；Which day of the week gets the most online traffic; Which region contributes the most clickstream data etc.
Learn how to utilize data to make better business decisions. Optimize Alibaba Cloud's big data products to get the most value out of your data.
Through this course, you will learn what is python visualization and how to use python visualization tools to generate plots to visualize data.
This topic describes how to build a Python development environment on a local PC so you can run VPC Python SDK examples. Alibaba Cloud SDK for Python supports Python 2.7. To run VPC Python SDK examples, you first need to install the core library of Alibaba Cloud SDK for Python and then install VPC Python SDK. Python SDK supports Windows, Linux, and Mac operating systems, and all VPC Python SDK examples can run in a Windows system.
MaxCompute UDFs are classified into UDFs, UDAFs, and UDTFs. This topic describes how to implement these functions by using MaxCompute Python.
MaxCompute uses Python 2.7. User code is executed in a sandbox, which is a restricted environment. In this environment, the following operations are prohibited:
Due to these restrictions, uploaded code must all be implemented by Python, because C extension modules are disabled.
MySQL is one of the most popular open-source databases in the world. As a key component of the open-source software bundle LAMP (Linux, Apache, MySQL, and Perl/PHP/Python), MySQL has been widely applied to different scenarios.
An intelligent image search service with product search and generic search features to help users resolve image search requests.
Lee Li - January 4, 2021
Alibaba Cloud MaxCompute - March 24, 2021
Apache Flink Community China - September 27, 2020
Alibaba Clouder - August 8, 2018
Alibaba Clouder - August 7, 2020
Alibaba Clouder - November 15, 2018
As a unified PaaS platform for intelligent data creation and management at Exabyte Scale, Dataphin applies Alibaba Group’s unique and proven OneData, OneID & OneService technologies to help enterprises thrive in the new era of data intelligence.Learn More
An online computing service that offers elastic and secure virtual cloud servers to cater all your cloud hosting needs.Learn More
More Posts by Alibaba Clouder