This topic describes how to reference a third-party package in a PyODPS node.
Prerequisites
- MaxCompute is activated. For more information, see Activate MaxCompute.
- DataWorks is activated. For more information, see Activate DataWorks.
Procedure
- Download the packages that are listed in the following table.
Package File Resource file python-dateutil python-dateutil-2.6.0.zip python-dateutil.zip pytz pytz-2017.2.zip pytz.zip six six-1.11.0.tar.gz six.tar.gz pandas pandas-0.20.2-cp27-cp27m-manylinux1_x86_64.whl pandas.zip scipy scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.whl scipy.zip scikit-learn scikit_learn-0.18.1-cp27-cp27m-manylinux1_x86_64.whl sklearn.zip Note You must manually compress the resource files in the pandas, scipy, and scikit-learn packages in the same format as those in the Resource file column. - Log on to the DataWorks console.
- Create a workflow.
- On the DataStudio page, right-click Business Flow and select Create Workflow.
- In the Create Workflow dialog box, specify Workflow Name and click Create.
- Create and commit resources.
- On the DataStudio page, move the pointer over the
icon and choose MaxCompute > Resources > Archive.
Alternatively, you can unfold Business Flow, right-click a workflow, and choose Create > MaxCompute > Resource > Archive. - In the Create Resource dialog box, click Upload and select the python-dateutil-2.6.0.zip file.
- Enter python.dateutil.zip in the Resource Name field and click OK.
- Click the
icon to complete the upload.
- Repeat the preceding steps to create and commit the resource files named pytz.zip, six.tar.gz, pandas.zip, sklearn.zip, and scipy.zip.
- On the DataStudio page, move the pointer over the
- Create a PyODPS 2 node.
- Right-click the workflow that you created and choose Create > MaxCompute > PyODPS 2.
- In the Create Node dialog box, specify Node Name and click Commit.
- On the tab of the PyODPS 2 node, enter the code of the node in the code editor.Sample code:
def test(x): from sklearn import datasets, svm from scipy import misc import numpy as np iris = datasets.load_iris() assert iris.data.shape == (150, 4) assert np.array_equal(np.unique(iris.target), [0, 1, 2]) clf = svm.LinearSVC() clf.fit(iris.data, iris.target) pred = clf.predict([[5.0, 3.6, 1.3, 0.25]]) assert pred[0] == 0 assert misc.face().shape is not None return x from odps import options hints = { 'odps.isolation.session.enable': True } libraries = ['python-dateutil.zip', 'pytz.zip', 'six.tar.gz', 'pandas.zip', 'scipy.zip', 'sklearn.zip'] iris = o.get_table('pyodps_iris').to_df() print iris[:1].sepallength.map(test).execute(hints=hints, libraries=libraries)
- Click the
icon.
- View the running result of the PyODPS 2 node on the Run Log tab.
Sql compiled: CREATE TABLE tmp_pyodps_a3172c30_a0d7_4c88_bc39_434168263897 LIFECYCLE 1 AS SELECT pyodps_udf_1576485276_94d9d978_af66_4e27_a874_e787022dfb3d(t1.`sepallength`) AS `sepallength` FROM WB_BestPractice_dev.`pyodps_iris` t1 LIMIT 1 Instance ID: 20191216083438175gcv6n4pr2 Log view: http://logview.odps.aliyun.com/logview/?h=xxxxxx sepallength 0 5.1
Note For more information about best practices, see Use a PyODPS node to segment Chinese text based on Jieba.