This topic describes how to reference a third-party package in a PyODPS node.

Prerequisites

The following operations are completed:

Procedure

  1. Download the files that are listed in the following table.
    Package name File name Resource name
    python-dateutil python-dateutil-2.6.0.zip python-dateutil.zip
    pytz pytz-2017.2.zip pytz.zip
    six six-1.11.0.tar.gz six.tar.gz
    pandas pandas-0.20.2-cp27-cp27m-manylinux1_x86_64.whl pandas.zip
    scipy scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.whl scipy.zip
    scikit-learn scikit_learn-0.18.1-cp27-cp27m-manylinux1_x86_64.whl sklearn.zip
  2. Log on to the DataWorks console.
  3. Create a workflow.
    1. On the homepage of the DataWorks console, click Workspaces in the left-side navigation pane. On the Workspaces page, find the target workspace and click Data Analytics in the Actions column. On the Data Development tab, right-click Business process and select New business process.
    2. In the New business process dialog box, set the Business Name parameter and click New.
  4. Create and commit resources.
    1. Right-click the workflow that you created in the Business process section and choose New > MaxCompute > Resource > Archive.
    2. In the New resource dialog box, click Click Upload and upload the python-dateutil-2.6.0.zip package.Upload a file
    3. Set the Resource Name parameter to python.dateutil.zip and click Confirm.Set the resource name
    4. Click the Submit icon in the toolbar.Commit the resource
    5. Use the same method to create and commit the resources named pytz.zip, six.tar.gz, pandas.zip, sklearn.zip, and scipy.zip.
  5. Create a PyODPS 2 node.
    1. Right-click the workflow that you created and choose New > MaxCompute > PyODPS 2.
    2. In the New node dialog box, set the Node name parameter and click Submit. We recommend that you set the Node name parameter to a value that describes the use of third-party packages.
    3. On the configuration tab of the PyODPS 2 node, enter the code of the node in the code editor.
      In this example, enter the following code:
      def test(x):
          from sklearn import datasets, svm
          from scipy import misc
          import numpy as np
      
          iris = datasets.load_iris()
          assert iris.data.shape == (150, 4)
          assert np.array_equal(np.unique(iris.target),  [0, 1, 2])
      
          clf = svm.LinearSVC()
          clf.fit(iris.data, iris.target)
          pred = clf.predict([[5.0, 3.6, 1.3, 0.25]])
          assert pred[0] == 0
      
          assert misc.face().shape is not None
      
          return x
      
      from odps import options
      
      hints = {
          'odps.isolation.session.enable': True
      }
      libraries = ['python-dateutil.zip', 'pytz.zip', 'six.tar.gz', 'pandas.zip', 'scipy.zip', 'sklearn.zip']
      
      iris = o.get_table('pyodps_iris').to_df()
      
      print iris[:1].sepallength.map(test).execute(hints=hints, libraries=libraries)
                                  
  6. Click the Run icon in the toolbar.
  7. View the running result of the PyODPS 2 node on the Run Log tab.
    Sql compiled:
    CREATE TABLE tmp_pyodps_a3172c30_a0d7_4c88_bc39_434168263897 LIFECYCLE 1 AS
    SELECT pyodps_udf_1576485276_94d9d978_af66_4e27_a874_e787022dfb3d(t1.`sepallength`) AS `sepallength`
    FROM WB_BestPractice_dev.`pyodps_iris` t1
    LIMIT 1
    
    Instance ID: 20191216083438175gcv6n4pr2
      Log view: http://logview.odps.aliyun.com/logview/?h=xxxxxx
    
       sepallength
    0          5.1