Limits on pre-installed Python resource packages - Dataphin

When you develop Python compute nodes, you may need to install resource packages for your business scenario. Dataphin provides pre-installed resource packages that you can use by adding an import <package_name> statement at the beginning of your code, such as import configparser.

List of built-in resource packages

The following table lists the resource packages that are built into Dataphin. You can also run the pip list command in a Shell compute node to view the built-in package modules.

Resource Package	Version	Scenarios
configparser	>=3.5.0	Read configuration files.
DateTime	None	Data processing.
hdfs	>=2.1.0	Use HDFS with the Hadoop compute engine.
jumpssh	None	Connect to a server using a jump server.
mysql-connector-python	>=8.0.11	Connect to and operate MySQL.
numpy	None	Basic algorithm processing.
pandas	None	Basic algorithm processing.
psycopg2	>=2.7.4	Connect to and operate PostgreSQL.
pyhdfs	>=0.2.1	Use HDFS with the Hadoop compute engine.
pyhs2	>=0.6.0	Connect to and operate HDFS.
pyodps	>=0.7.16	Perform ODPS operations. Suitable for ODPS.
pyspark	>=2.3.1	Use Spark with the Hadoop compute engine.
requests	>=2.4.0	Basic algorithm processing.
scikit-learn	None	Basic algorithm processing.
scipy	None	Basic algorithm processing.
setuptools	>=3.0	Basic Python feature library.
yarn-api-client	>=0.2.3	Yarn API client.
Matplotlib	None	Basic algorithm processing.

Use PyHive and PyOdps in Dataphin.

In Dataphin, you can use PyHive and PyOdps locally or through object handles. To use object handles, import them using the from dataphin import odps and from dataphin import hivec statements. Using object handles helps you avoid the following issues that may occur with local usage:

If Hive uses username and password authentication, developers must obtain the credentials, which increases the risk of leaks. Additionally, if the username and password are changed, you must sync the program or variables.
It simplifies Kerberos authentication for Hive. The standard implementation is complex and introduces risks, such as keytab file leaks and modification issues.
Dataphin fails to authenticate users, and its permission system can be bypassed.
Queries on logical tables are not supported.

PyHive and PyOdps syntax scope

Querying physical tables.
Querying logical tables.
Calling system functions and user-defined functions (UDFs).
Replacing spatial variable names. For example, ${project_name} and ${LD_name} are replaced with their actual names.
Using global and local variables.
Using Data Manipulation Language (DML) statements on physical and logical tables.
Using Data Definition Language (DDL) statements on physical tables.
Using object authentication.

Methods supported by hivec

You can use the following code to view the methods of hivec.

from dataphin import hivec

print(dir(hivec))

The methods of hivec are the same as the correspondingly named methods of pyhive.hive.Cursor:

Execute SQL statements: execute
Fetch all query results: fetchall
Fetch a specified number of query results: fetchmany
Fetch one query result: fetchone
Close the cursor and connection: close

Usage examples

Hadoop compute engine

Outside Dataphin, you typically install PyHive and then import the hive package using import to connect to and interact with Hive. The following code is an example:

# Load the package
from pyhive import hive

# Establish a connection
conn = hive.connect(host = '100.100.***.100',      # HiveServer
                    port = 10000,                  # Port 
                    username = 'xxx',           # Username
                    database = 'xxx',            # Database
                    password = 'xxx')           # Password
                    
# Query
cursor = conn.cursor()
cursor.execute('select * from table limit 10')
for result in cursor.fetchall():
    print(result)
    
# Close the connection
cursor.close()
conn.close()

In Dataphin, you can operate on Hive directly in a Python compute node. To do this, import the pre-installed resource package using the from dataphin import hivec statement. The following code shows an example:

# Import the package
from dataphin import hivec

# Execute the SQL statement
hivec.execute("SELECT * FROM ${project_dev}.table WHERE ds != 0")

# Print the SQL results
for result in hivec.fetchall():
	print(result)

MaxCompute compute engine

When using the MaxCompute compute engine, you can operate on MaxCompute in a Python compute node. To do this, import the pre-installed resource package using the from dataphin import odps statement. The following code shows an example:

# Load the package
from dataphin import odps
 
# Execute the SQL statement
odps.run_sql('SELECT * FROM ${project_dev}.table WHERE ds != 0')

# The following code prints the results
with odps.execute_sql('select 1').open_reader() as reader:
    for record in reader:
        print(record)