When you develop Python compute nodes, you may need to install resource packages for your business scenario. Dataphin provides pre-installed resource packages that you can use by adding an import <package_name> statement at the beginning of your code, such as import configparser.
List of built-in resource packages
The following table lists the resource packages that are built into Dataphin. You can also run the pip list command in a Shell compute node to view the built-in package modules.
Resource Package | Version | Scenarios |
configparser | >=3.5.0 | Read configuration files. |
DateTime | None | Data processing. |
hdfs | >=2.1.0 | Use HDFS with the Hadoop compute engine. |
jumpssh | None | Connect to a server using a jump server. |
mysql-connector-python | >=8.0.11 | Connect to and operate MySQL. |
numpy | None | Basic algorithm processing. |
pandas | None | Basic algorithm processing. |
psycopg2 | >=2.7.4 | Connect to and operate PostgreSQL. |
pyhdfs | >=0.2.1 | Use HDFS with the Hadoop compute engine. |
pyhs2 | >=0.6.0 | Connect to and operate HDFS. |
pyodps | >=0.7.16 | Perform ODPS operations. Suitable for ODPS. |
pyspark | >=2.3.1 | Use Spark with the Hadoop compute engine. |
requests | >=2.4.0 | Basic algorithm processing. |
scikit-learn | None | Basic algorithm processing. |
scipy | None | Basic algorithm processing. |
setuptools | >=3.0 | Basic Python feature library. |
yarn-api-client | >=0.2.3 | Yarn API client. |
Matplotlib | None | Basic algorithm processing. |
Use PyHive and PyOdps in Dataphin.
In Dataphin, you can use PyHive and PyOdps locally or through object handles. To use object handles, import them using the from dataphin import odps and from dataphin import hivec statements. Using object handles helps you avoid the following issues that may occur with local usage:
If Hive uses username and password authentication, developers must obtain the credentials, which increases the risk of leaks. Additionally, if the username and password are changed, you must sync the program or variables.
It simplifies Kerberos authentication for Hive. The standard implementation is complex and introduces risks, such as keytab file leaks and modification issues.
Dataphin fails to authenticate users, and its permission system can be bypassed.
Queries on logical tables are not supported.
PyHive and PyOdps syntax scope
Querying physical tables.
Querying logical tables.
Calling system functions and user-defined functions (UDFs).
Replacing spatial variable names. For example,
${project_name}and${LD_name}are replaced with their actual names.Using global and local variables.
Using Data Manipulation Language (DML) statements on physical and logical tables.
Using Data Definition Language (DDL) statements on physical tables.
Using object authentication.
Methods supported by hivec
You can use the following code to view the methods of hivec.
from dataphin import hivec
print(dir(hivec))The methods of hivec are the same as the correspondingly named methods of pyhive.hive.Cursor:
Execute SQL statements:
executeFetch all query results:
fetchallFetch a specified number of query results:
fetchmanyFetch one query result:
fetchoneClose the cursor and connection:
close
Usage examples
Hadoop compute engine
Outside Dataphin, you typically install PyHive and then import the hive package using import to connect to and interact with Hive. The following code is an example:
# Load the package
from pyhive import hive
# Establish a connection
conn = hive.connect(host = '100.100.***.100', # HiveServer
port = 10000, # Port
username = 'xxx', # Username
database = 'xxx', # Database
password = 'xxx') # Password
# Query
cursor = conn.cursor()
cursor.execute('select * from table limit 10')
for result in cursor.fetchall():
print(result)
# Close the connection
cursor.close()
conn.close()In Dataphin, you can operate on Hive directly in a Python compute node. To do this, import the pre-installed resource package using the from dataphin import hivec statement. The following code shows an example:
# Import the package
from dataphin import hivec
# Execute the SQL statement
hivec.execute("SELECT * FROM ${project_dev}.table WHERE ds != 0")
# Print the SQL results
for result in hivec.fetchall():
print(result)MaxCompute compute engine
When using the MaxCompute compute engine, you can operate on MaxCompute in a Python compute node. To do this, import the pre-installed resource package using the from dataphin import odps statement. The following code shows an example:
# Load the package
from dataphin import odps
# Execute the SQL statement
odps.run_sql('SELECT * FROM ${project_dev}.table WHERE ds != 0')
# The following code prints the results
with odps.execute_sql('select 1').open_reader() as reader:
for record in reader:
print(record)