Note: The cluster on which an interactive task is run must be EMR-2.3 or later, and has no less than three nodes, each with at least 4 cores and 8 GB of memory.
Log on to the
Alibaba Cloud E-MapReduce Console Interactive Work.
Click New notebook or File > New notebook.
Input the name and select the default type. The associated cluster is optional. Click OK to create a notebook.
Three types are supported. Spark can be used to write scala spark codes. Spark SQL can be used to write SQL statements supported by Spark. Hive can be used to write SQL statements supported by Hive.
An associated cluster must be a created cluster of EMR-2.3 or later, and has no less than three nodes, each with at least 4 cores and 8 GB of memory. You can also associate the cluster before running the task.
Up to 20 interactive tasks can be created in one account.
A paragraph is the smallest unit for running a notebook. For a notebook, you can fill in multiple paragraphs. Each paragraph can start with
%hive, indicating that this paragraph is a Scala spark code paragraph, spark SQL paragraph, or Hive SQL paragraph. The type prefix is segregated by a blank space or by line feed and actual content. If the type prefix is not specified, the default type of the interactive task will be used as the run type of this paragraph.
The following is an example showing how to create a temporary Spark table:
Paste the following code in the section and a red * symbol is displayed, indicating that this notebook has been changed. You can click the save paragraph button or run button to save the modifications to the paragraph. Click + below the paragraph to create a new paragraph. Up to 30 paragraphs can be created in one notebook.
// load bank data
val bankText = sc.parallelize(
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
s => Bank(s(0).toInt,
Before running a notebook, you must associate it to a created cluster. If a created notebook is not associated with a cluster, Not associated is displayed in the upper right corner of the page. You can click to select a cluster in the list of available clusters. Note that the associated cluster must be EMR-2.3 or later, and has no less than three nodes, each with at least 4 cores and 8 GB of memory.
Click the Run button to automatically save the current paragraph and run the content. If this paragraph is the last one, a new paragraph is automatically created.
After running, the current running status is displayed. PENDING means the paragraph has not run yet and RUNNING means the paragraph is running. FINISHED means the running process has finished. ERROR means an error has occurred. The running result is displayed below the run button of the paragraph. During running, you can click Cancel below the run button to cancel running. END is displayed after running is canceled.
The paragraph can be run multiple times and only the result of the last running is retained. You cannot modify the entered content of the paragraph during running. The content can be modified only after running of the paragraph is finished.
For a notebook, you can click Run All on the menu bar to run all paragraphs. The paragraphs are submitted sequentially for running. Different types have independent execution queues. If a notebook contains multiple paragraph types, the order for executing the paragraphs on the cluster is based on type after these paragraphs are submitted sequentially. Spark and Spark SQL support one-by-one execution. Hive supports concurrent execution and the maximum number of concurrently executed interactive paragraphs on the same cluster is 10. Note that all concurrently executed paragraphs are restricted by cluster resources. If the cluster size is small and many paragraphs need to be executed concurrently, the paragraphs still need to queue on the Yarn.
After a notebook is run in a cluster, the cluster creates a process for catching of some context running environments to ensure quick response upon re-run. If you do not need to run other notebooks and you want to release the cluster resources occupied by caching, you can disassociate all notebooks that have been run from the associated clusters. In this way, you can release the memory resources occupied on the original associated clusters.
You can hide the paragraph results and display the entered content of the paragraph only.
Delete the current paragraph. The paragraph that is running can also be deleted.
Create a notebook and switch to the created notebook interface.
Add a new paragraph to the end of the notebook. A notebook can only have up to 30 paragraphs.
Save all modified paragraphs.
Delete the current notebook. If the cluster has been associated, it will be disassociated.
Only the entered codes for all paragraphs are displayed or both the codes and results are displayed.