Use Hive and Notebook to Analyze Video Website Data

Effective and efficient website data analysis and mining are critical in ensuring stable website operations. Alibaba Cloud's Elastic MapReduce (E-MapReduce) allows you to maximize business growth by analyzing data such as massive user attributes, video resource attributes, and users' video playback records. This article uses a video website as an example to introduce how to analyze website data with E-MapReduce.

E-MapReduce is a Hadoop-friendly SaaS launched by Alibaba Cloud, which supports Hadoop, Hive, Spark, Pig, Presto, and Storm. It offers a one-stop big data processing and analysis solution that covers cluster, job, and data management. This article uses Hive and Notebook to demonstrate how to efficiently and effectively conduct interim data analysis. The entire analysis process involves two major steps:
● Prepare data sources
● Analyze data with Notebook

The Alibaba Cloud products and services used in this article include: E-MapReduce and Object Storage Service (OSS).

1. Prepare Data Sources

At this step, you need to prepare the data sources and make sure that Hadoop can read the data. You can prepare data with either of these methods:
● OSS
Place the data to be analyzed on OSS through timed synchronization
● HDFS
Directly write the data to be analyzed to a Hadoop cluster
● Database
Directly read source data from a database

We recommend using OSS for the following reasons:
● OSS costs less than HDFS built with cloud disks
● HDFS is not flexible enough to cope with changes in data volume
● Directly reading data from a database will overload it

Since OSS is slightly slower than cloud-disk-based HDFS, it is the best used for low-cost offline computing.

We will use three sets of sample data to demonstrate how it works. When you become familiar with the entire process, you will be able to analyze the actual data.
User information data
Video information data
User playback data

1. Download the sample data above to your local disk.
2. Create directories on OSS.
First, create a bucket or use an existing bucket. The bucket and E-MapReduce must be in the same region.
Next, create three directories in a bucket to store data, as shown below.

The sample directory structure is as follows:
/mybucket/video-data/userinfo/
/mybucket/video-data/videoinfo/
/mybucket/video-data/videoplayinfo/

3. Upload the three sets of sample data you just downloaded to the corresponding OSS directories.
Take the userinfo/ data as an example.

After completing the steps above, you can proceed with data analysis.

2. Analyze Data with Notebook

With a data source available, you can create computing resources to carry out analysis. First, let's create a computing cluster.
a) Create a cluster.
i. Run E-MapReduce, click the Cluster tab on the left, and then click Create a Cluster at the upper right-hand corner. Please note that when choosing a region, you must choose the same one for the OSS data you prepared.
ii. Choose EMR-3.0.0 as the product version and use default options for other settings.
iii. Choose Pay-As-You-Go on the hardware page.
iv. Choose 4 Core 8 GB for both Master and Core nodes, and Ultra Cloud Disk or SSD as needed for Disk. Set 2 nodes for Core.

v. Set the cluster name.
vi. Set the OSS path to keep logs.
These logs are generated by Hadoop and other jobs in operation, and can be analyzed for troubleshooting when there is a job error or exception. Therefore, it is necessary to select this option and save logs for future maintenance.
vii. Set a password and then complete creating a cluster.
After you click OK, the cluster building process will begin. It takes about 5 minutes.
b) Create a Notebook.
i. Click the Notebook tab on the left to go to the Notebook page.

ii. Click New Notebook to create a Hive job.
Attached Cluster here can be left blank. You can attach a cluster later. Click OK to create a new Notebook job and set it to open automatically by default.

iii. Associate a cluster.
On the upper right-hand corner of the Notebook job page, click Not Attached, and select the cluster you created earlier from the drop-down list. The cluster should have been created at this point and is in either the Running or Idle status. If the cluster is still being created, wait for it to complete.

You are now all set and can begin to compile the scripts for data processing.
c) Analyze data.
i. Create a Hive table.

User information table

CREATE EXTERNAL TABLE IF NOT EXISTS user_info(id int,sex int,age int, marital_status int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'oss://[bucketname].[endpoint]/[bucketpath]/userinfo'

The fields in this table are
FieldTypeDescription
idintUser ID
sexintGender
ageintAge
marital_statusintMarital status

Video information table

CREATE EXTERNAL TABLE IF NOT EXISTS video_info(id int,title string,type string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'oss://[bucketname].[endpoint]/[bucketpath]/videoinfo'

FieldTypeDescription
IdintVideo ID
titlestringVideo title
typestringVideo type

Video playback information table

CREATE EXTERNAL TABLE IF NOT EXISTS play_video(user_id int,video_id int, play_time bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'oss://[bucketname].[endpoint]/[bucketpath]/videoplayinfo'

FieldTypeDescription
user_IdintUser ID
video_idintVideo ID
play_timebigintPlayback time

Bucketname: The OSS bucket designated to keep data earlier.
Endpoint: Any endpoint in any region can be queried using OSS Endpoint.
Bucketpath: The OSS path designated to keep data earlier.

Click Run to complete creating the data table.
Note:
● Use only one statement in every paragraph, and create the same number of paragraphs as sentences.
● No statement should end with a semicolon.
ii. Preview data.
You can use the following statements to preview the first 100 data entries.

select * from user_info limit 100


iii. Try a simple query.
Enter and run the following SQL statements in different paragraphs, and observe the outputs.
Query the total number of users

select count(*) from user_info

Query the total number of videos
select count(*) from video_info

Query the total number of playbacks
select count(*) from play_info

iv. Complex query composition.
You are now able to use simple SQL to compile some simple queries in Notebook. Now let's try some more complex analysis.
a) Let's take a look at the ranking of the most viewed videos.
select video.id, video.title, video.type, video_count.count from (select video_id, count(video_id) as count from play_video group by video_id order by count desc limit 10) video_count join video_info video on (video_count.video_id = video.id) order by count desc

Display the top 10 most viewed videos and their details.

b) Age distribution of viewers of the most viewed videos.

select age , count(*) as count from (select distinct(user_id) from play_video where video_id =49 ) play join user_info userinfo on (play.user_id = userinfo.id) group by userinfo.age order by count desc

Examine the age distribution of the video viewers.

You are welcome to explore other intriguing data analysis techniques. Remember to close the Pay-As-You-Go cluster after you complete testing.

Conclusion
This article introduces how to use Alibaba Cloud's E-MapReduce to analyze the data of a website. E-MapReduce is ideal for website operations and business growth. The example used in this article is just a rudimentary analysis. For data analysis and mining with higher complexity, you will need to consider the specific needs of a business or website.