All Products
Search
Document Center

DSWMagic

Last Updated: Jun 02, 2020

Alibaba Cloud continuously increases the algorithm development and model training capabilities of Data Science Workshop (DSW) of Alibaba Cloud Machine Learning Platform for AI. Our current focus is on expanding its big data analytics capability. Based on the built-in pyodps, which enables you to read MaxCompute data, DSW now supports interactive big data analytics. DSW streamlines all the big data analytics tasks, including data ingestion, data exploring and analysis, algorithm development, model training, and model deployment by using PAI EAS built-in Processors. DSW now offers the best all-in-one and interactive experience for algorithm development and data analytics.

Features

After DSW is upgraded, you can write SQL statements in DSW. The built-in SQL editor supports multiple functions, such as syntax highlighting, auto lookup, and auto completion. After the configuration, you can run these SQL statements to read data from MaxCompute tables in different projects and then display the analysis results in SQL charts.

How to use DSW to analyze big data

1. Load dswmagic.

dswmagic is a built-in notebook magic command in DSW. To use the big data features provided by DSW, you only need to install the relevant package, and then load the dswmagic command.

  1. %load_ext dswmagic

2. Switch the edit mode to SQL.

After you load the dswmagic command, add a cell to the .ipynb file, and select the SQL editor for the cell. The cell is then switched to SQL edit mode.

3. Specify data sources and endpoints.

Before you start to write SQL statements, you must specify the projects of the source MaxCompute tables, the endpoint of your DSW instance, and your AccessKey information. You can reuse the configuration in subsequent big data analytics tasks. Click the Plus (+) icon next to New DataSource to open the Config DataSource dialog box, and then enter the required information. The specified data source is then added to the drop-down DataSource list. You can reference a data source by selecting it from the list.

Parameter descriptions:

  1. AccessKey ID: your Alibaba Cloud AccessKey ID.
  2. AccessKey Secret: your Alibaba Cloud AccessKey secret.
  3. Endpoint of DSW P100 instances deployed in China (Beijing) and DSW M40 instances deployed in China (Shanghai): http://service-all.ext.odps.aliyun-inc.com/api
  4. Endpoint of other DSW instances: http://service.cn.maxcompute.aliyun.com/api

4. Write and run SQL statements.

After you prepare the data sources, you can start to write SQL statements in DSW. You can use the SQL editor to run one or more SQL statements. If you need to run multiple SQL statements at the same time, separate them with semicolons (;). The execution results are output line by line. DSW provides multiple methods for you to output the execution results of SQL statements, including the EXCEL format, histogram chart, pie chart, line chart, and scatter chart. You can click the Settings icon in the upper-right corner of these charts to adjust the X-axis and Y-axis, or click the WebExcel icon to edit the execution results in Excel mode. The execution results are saved to variable df0. df0.values is a standard Pandas DataFrame. DSW has optimized values returned by Pandas DataFrames, allowing you to view the execution results in WebExcel mode or charts.

The big data analytics features provided by DSW offer an easier way to ingest data, a better experience to write SQL statements, and a powerful tool to analyze data. DSW can convert execution results of SQL statements to standard Pandas DataFrame. Trained models can be deployed as services much faster than they used to be. This will continuously improve algorithm development efficiency.