This topic describes how to reference a third-party package in a PyODPS node.

Prerequisites

Procedure

  1. Download the packages that are listed in the following table.
    Package File Resource file
    python-dateutil python-dateutil-2.6.0.zip python-dateutil.zip
    pytz pytz-2017.2.zip pytz.zip
    six six-1.11.0.tar.gz six.tar.gz
    pandas pandas-0.20.2-cp27-cp27m-manylinux1_x86_64.whl pandas.zip
    scipy scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.whl scipy.zip
    scikit-learn scikit_learn-0.18.1-cp27-cp27m-manylinux1_x86_64.whl sklearn.zip
    Note You must manually compress the resource files in the pandas, scipy, and scikit-learn packages in the same format as those in the Resource file column.
  2. Log on to the DataWorks console.
  3. Create a workflow.
    1. On the DataStudio page, right-click Business Flow and select Create Workflow.
    2. In the Create Workflow dialog box, specify Workflow Name and click Create.
  4. Create and commit resources.
    1. On the DataStudio page, move the pointer over the Create icon icon and choose MaxCompute > Resources > Archive.
      Alternatively, you can unfold Business Flow, right-click a workflow, and choose Create > MaxCompute > Resource > Archive.
    2. In the Create Resource dialog box, click Upload and select the python-dateutil-2.6.0.zip file.Upload a file
    3. Enter python.dateutil.zip in the Resource Name field and click OK.Set the resource name
    4. Click the Commit icon to complete the upload.Commit the resource
    5. Repeat the preceding steps to create and commit the resource files named pytz.zip, six.tar.gz, pandas.zip, sklearn.zip, and scipy.zip.
  5. Create a PyODPS 2 node.
    1. Right-click the workflow that you created and choose Create > MaxCompute > PyODPS 2.
    2. In the Create Node dialog box, specify Node Name and click Commit.
    3. On the tab of the PyODPS 2 node, enter the code of the node in the code editor.
      Sample code:
      def test(x):
          from sklearn import datasets, svm
          from scipy import misc
          import numpy as np
      
          iris = datasets.load_iris()
          assert iris.data.shape == (150, 4)
          assert np.array_equal(np.unique(iris.target),  [0, 1, 2])
      
          clf = svm.LinearSVC()
          clf.fit(iris.data, iris.target)
          pred = clf.predict([[5.0, 3.6, 1.3, 0.25]])
          assert pred[0] == 0
      
          assert misc.face().shape is not None
      
          return x
      
      from odps import options
      
      hints = {
          'odps.isolation.session.enable': True
      }
      libraries = ['python-dateutil.zip', 'pytz.zip', 'six.tar.gz', 'pandas.zip', 'scipy.zip', 'sklearn.zip']
      
      iris = o.get_table('pyodps_iris').to_df()
      
      print iris[:1].sepallength.map(test).execute(hints=hints, libraries=libraries)
                                  
  6. Click the Run icon.
  7. View the running result of the PyODPS 2 node on the Run Log tab.
    Sql compiled:
    CREATE TABLE tmp_pyodps_a3172c30_a0d7_4c88_bc39_434168263897 LIFECYCLE 1 AS
    SELECT pyodps_udf_1576485276_94d9d978_af66_4e27_a874_e787022dfb3d(t1.`sepallength`) AS `sepallength`
    FROM WB_BestPractice_dev.`pyodps_iris` t1
    LIMIT 1
    
    Instance ID: 20191216083438175gcv6n4pr2
      Log view: http://logview.odps.aliyun.com/logview/?h=xxxxxx
    
       sepallength
    0          5.1
    Note For more information about best practices, see Use a PyODPS node to segment Chinese text based on Jieba.