This topic describes how to use a PyODPS node to process NULL values.
Prerequisites
- MaxCompute is activated. For more information, see Activate MaxCompute.
- DataWorks is activated. For more information, see Activate DataWorks.
- A workflow is created in the DataWorks console. In this example, a workflow is created for a DataWorks workspace in basic mode. For more information, see Create a workflow.
Procedure
- Prepare test data.
- Log on to the DataWorks console.
- Create a table and import data to the table. For more information, see Create tables and import data.Execute the following statement to create a table named pytable2:
Import the data in the pytable2.txt file to the pytable2 table. The pytable2.txt file contains the following data:CREATE TABLE `pytable2` ( `id` string, `name` string, `f1` double, `f2` double, `f3` double, `f4` double ) ;
0, name1, 1.0, NaN, 3.0, 4.0 1, name1, 2.0, NaN, NaN, 1.0 2, name1, 3.0, 4.0, 1.0, NaN 3, name1, NaN, 1.0, 2.0, 3.0 4, name1, 1.0, NaN, 3.0, 4.0 5, name1, 1.0, 2.0, 3.0, 4.0 6, name1, NaN, NaN, NaN, NaN
- In the left-side navigation pane, click Workspaces.
- On the page that appears, find the workspace that you want to manage and click Data Analytics in the Actions column.
- On the DataStudio page, right-click a workflow and choose New > MaxCompute > PyODPS 2.
- In the New node dialog box, specify the Node name parameter and click Submit.
- On the configuration tab of the PyODPS 2 node, enter the code of the node in the code
editor.In this example, enter the following code:
df2 = DataFrame(o.get_table('pytable2')) # Call the dropna method to delete the rows that contain NULL values. print df2.dropna(subset=['f1','f2','f3','f4']).head() # If you do not want to delete the rows that also contain non-NULL values, specify how='all' for the dropna method. print df2.dropna(how='all', subset=['f1','f2','f3','f4']).head() print df2.dropna(thresh=3, subset=['f1', 'f2', 'f3', 'f4']).head() # Call the fillna method to replace NULL values with a specified constant or values in an existing column. print df2.fillna(100, subset=['f1','f2','f3','f4']).head() # Replace NULL values with the values in an existing column. print df2.fillna(df2.f2, subset=['f1','f2','f3','f4']).head() # Replace each NULL value with the value in the same column of the previous row. print df2.fillna(method='bfill', subset=['f1', 'f2', 'f3', 'f4']).head() # Replace each NULL value with the value in the same column of the following row. print df2.fillna(method='ffill', subset=['f1', 'f2', 'f3', 'f4']).head()
- Click the Run icon in the toolbar.
- View the running result of the PyODPS 2 node on the Run Log tab.In this example, the following information appears on the Run Log tab:
Sql compiled: CREATE TABLE tmp_pyodps_d0c7d8c2_be38_4d48_b0eb_e89bae5bde01 LIFECYCLE 1 AS SELECT * FROM WB_BestPractice_dev.`pytable2` t1 WHERE (((IF(t1.`f1` IS NOT NULL, 1, 0) + IF(t1.`f2` IS NOT NULL, 1, 0)) + IF(t1.`f3` IS NOT NULL, 1, 0)) + IF(t1.`f4` IS NOT NULL, 1, 0)) >= 4 Instance ID: 20191010071154980g2hic292 id name f1 f2 f3 f4 0 0 name1 1.0 NaN 3.0 4.0 1 1 name1 2.0 NaN NaN 1.0 2 2 name1 3.0 4.0 1.0 NaN 3 3 name1 NaN 1.0 2.0 3.0 4 4 name1 1.0 NaN 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0 6 6 name1 NaN NaN NaN NaN Sql compiled: CREATE TABLE tmp_pyodps_49b46768_f589_48f6_be8a_b7139f31f6f2 LIFECYCLE 1 AS SELECT * FROM WB_BestPractice_dev.`pytable2` t1 WHERE (((IF(t1.`f1` IS NOT NULL, 1, 0) + IF(t1.`f2` IS NOT NULL, 1, 0)) + IF(t1.`f3` IS NOT NULL, 1, 0)) + IF(t1.`f4` IS NOT NULL, 1, 0)) >= 1 Instance ID: 20191010071159759g0dk9592 id name f1 f2 f3 f4 0 0 name1 1.0 NaN 3.0 4.0 1 1 name1 2.0 NaN NaN 1.0 2 2 name1 3.0 4.0 1.0 NaN 3 3 name1 NaN 1.0 2.0 3.0 4 4 name1 1.0 NaN 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0 6 6 name1 NaN NaN NaN NaN Sql compiled: CREATE TABLE tmp_pyodps_7f941800_1539_415b_9257_283ebeb893a6 LIFECYCLE 1 AS SELECT * FROM WB_BestPractice_dev.`pytable2` t1 WHERE (((IF(t1.`f1` IS NOT NULL, 1, 0) + IF(t1.`f2` IS NOT NULL, 1, 0)) + IF(t1.`f3` IS NOT NULL, 1, 0)) + IF(t1.`f4` IS NOT NULL, 1, 0)) >= 3 Instance ID: 20191010071204544giyswx7 0 0 name1 1.0 NaN 3.0 4.0 1 1 name1 2.0 NaN NaN 1.0 2 2 name1 3.0 4.0 1.0 NaN 3 3 name1 NaN 1.0 2.0 3.0 4 4 name1 1.0 NaN 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0 6 6 name1 NaN NaN NaN NaN Sql compiled: CREATE TABLE tmp_pyodps_16d6ea6d_5195_4e4c_8346_644a395852f7 LIFECYCLE 1 AS SELECT t1.`id`, t1.`name`, IF(t1.`f1` IS NULL, 100, t1.`f1`) AS `f1`, IF(t1.`f2` IS NULL, 100, t1.`f2`) AS `f2`, IF(t1.`f3` IS NULL, 100, t1.`f3`) AS `f3`, IF(t1.`f4` IS NULL, 100, t1.`f4`) AS `f4` FROM WB_BestPractice_dev.`pytable2` t1 Instance ID: 20191010071209190gyl56292 id name f1 f2 f3 f4 0 0 name1 1.0 NaN 3.0 4.0 1 1 name1 2.0 NaN NaN 1.0 2 2 name1 3.0 4.0 1.0 NaN 3 3 name1 NaN 1.0 2.0 3.0 4 4 name1 1.0 NaN 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0 6 6 name1 NaN NaN NaN NaN Sql compiled: CREATE TABLE tmp_pyodps_40755ebd_2d2a_482e_b360_3f3da0d5422c LIFECYCLE 1 AS SELECT t1.`id`, t1.`name`, IF(t1.`f1` IS NULL, t1.`f2`, t1.`f1`) AS `f1`, IF(t1.`f2` IS NULL, t1.`f2`, t1.`f2`) AS `f2`, IF(t1.`f3` IS NULL, t1.`f2`, t1.`f3`) AS `f3`, IF(t1.`f4` IS NULL, t1.`f2`, t1.`f4`) AS `f4` FROM WB_BestPractice_dev.`pytable2` t1 Instance ID: 20191010071213970gbp66792 id name f1 f2 f3 f4 0 0 name1 1.0 NaN 3.0 4.0 1 1 name1 2.0 NaN NaN 1.0 2 2 name1 3.0 4.0 1.0 NaN 3 3 name1 NaN 1.0 2.0 3.0 4 4 name1 1.0 NaN 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0 6 6 name1 NaN NaN NaN NaN Sql compiled: CREATE TABLE tmp_pyodps_d39fcce1_d8a9_4cc2_8aff_2ed1e9c6bb1b LIFECYCLE 1 AS SELECT pyodps_udf_1570691538_d9441c59_c666_4a5d_8154_67d8bc8c24ad(t1.`id`, t1.`name`, t1.`f1`, t1.`f2`, t1.`f3`, t1.`f4`) AS (`id`, `name`, `f1`, `f2`, `f3`, `f4`) FROM WB_BestPractice_dev.`pytable2` t1 Instance ID: 20191010071219627goqv9292 id name f1 f2 f3 f4 0 0 name1 1.0 3.0 3.0 4.0 1 1 name1 2.0 1.0 1.0 1.0 2 2 name1 3.0 4.0 1.0 NaN 3 3 name1 1.0 1.0 2.0 3.0 4 4 name1 1.0 3.0 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0 6 6 name1 NaN NaN NaN NaN Sql compiled: CREATE TABLE tmp_pyodps_3f190cf0_f9fb_4e06_a942_ab31c0241cd3 LIFECYCLE 1 AS SELECT pyodps_udf_1570691566_0330848b_82d3_411c_88e1_cbbcc6adb9c1(t1.`id`, t1.`name`, t1.`f1`, t1.`f2`, t1.`f3`, t1.`f4`) AS (`id`, `name`, `f1`, `f2`, `f3`, `f4`) FROM WB_BestPractice_dev.`pytable2` t1 Instance ID: 20191010071247729gt776792 id name f1 f2 f3 f4 0 0 name1 1.0 1.0 3.0 4.0 1 1 name1 2.0 2.0 2.0 1.0 2 2 name1 3.0 4.0 1.0 1.0 3 3 name1 NaN 1.0 2.0 3.0 4 4 name1 1.0 1.0 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0 6 6 name1 NaN NaN NaN NaN