This topic describes how to reference a third-party package in a PyODPS node.

Prerequisites

Procedure

  1. Download the packages that are listed in the following table.
    Package Download path Resource name
    python-dateutil python-dateutil-2.6.0.zip python-dateutil.zip
    pytz pytz-2017.2.zip pytz.zip
    six six-1.11.0.tar.gz six.tar.gz
    pandas pandas-0.20.2-cp27-cp27m-manylinux1_x86_64.whl pandas.zip
    scipy scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.whl scipy.zip
    scikit-learn scikit_learn-0.18.1-cp27-cp27m-manylinux1_x86_64.whl sklearn.zip
    Note
    • You must manually change the file name extensions of the pandas, scipy, and scikit-learn packages from .whl to .zip. Rename the resource package that you downloaded to be the same as those listed in the Resource name column of the preceding table.
    • Upload the preceding resource files as resource files of the Archive storage class.
  2. Log on to the DataWorks console.
  3. Create a workflow.
    1. In the left-side navigation pane, click Workspaces.
    2. On the Workspaces page, select a project and click Data Analytics in the Actions column.
    3. On the DataStudio page, right-click Business Flow and select Create Workflow.
    4. In the Create Workflow dialog box, enter a workflow name in the Workflow Name field and click Create.
  4. Create and commit resources.
    1. On the DataStudio page, move the pointer over the Create a workflow icon and choose MaxCompute > Resources > Archive.
      You can also unfold Business Flow, right-click a workflow, and then choose Create > MaxCompute > Resource > Archive.
    2. In the New resource dialog box, click Click Upload and select the python-dateutil-2.6.0.zip file from your on-premises machine. Upload a file
    3. Enter python-dateutil.zip in the Resource Name field and click Confirm. Rename the resource file
      Note For more information about the naming rules of resource files, see Step 1.
    4. Click the Commit icon. In the Commit dialog box, enter the change description and click Commit. Commit the resource
    5. Repeat the preceding steps to create and commit the resource files named pytz.zip, six.tar.gz, pandas.zip, sklearn.zip, and scipy.zip.
  5. Create a PyODPS node.
    1. On the DataStudio page, right-click the created workflow and choose MaxCompute > Data Development > Create > PyODPS 2.
    2. In the Create Node dialog box, enter a node name in the Node Name field and click Commit.
    3. In the PyODPS node, enter code and click the Run icon.
      Sample code:
      def test(x):
          from sklearn import datasets, svm
          from scipy import misc
          import numpy as np
      
          iris = datasets.load_iris()
          assert iris.data.shape == (150, 4)
          assert np.array_equal(np.unique(iris.target),  [0, 1, 2])
      
          clf = svm.LinearSVC()
          clf.fit(iris.data, iris.target)
          pred = clf.predict([[5.0, 3.6, 1.3, 0.25]])
          assert pred[0] == 0
      
          assert misc.face().shape is not None
      
          return x
      
      from odps import options
      
      hints = {
          'odps.isolation.session.enable': True
      }
      libraries = ['python-dateutil.zip', 'pytz.zip', 'six.tar.gz', 'pandas.zip', 'scipy.zip', 'sklearn.zip']
      
      iris = o.get_table('pyodps_iris').to_df()
      
      print iris[:1].sepallength.map(test).execute(hints=hints, libraries=libraries)
                                  
  6. View the running result of the node on the Run Log tab.
    In this example, the following information appears on the Run Log tab:
    Sql compiled:
    CREATE TABLE tmp_pyodps_a3172c30_a0d7_4c88_bc39_434168263897 LIFECYCLE 1 AS
    SELECT pyodps_udf_1576485276_94d9d978_af66_4e27_a874_e787022dfb3d(t1.`sepallength`) AS `sepallength`
    FROM WB_BestPractice_dev.`pyodps_iris` t1
    LIMIT 1
    
    Instance ID: 20191216083438175gcv6n4pr2
      Log view: http://logview.odps.aliyun.com/logview/?h=xxxxxx
    
       sepallength
    0          5.1
    Note For more information about best practices, see Use a PyODPS node to segment Chinese text based on Jieba.