This topic describes how to create a DataFrame object. After this object is created, it can be used to reference a data source.

Background information

To use a DataFrame project, you must familiarize yourself with the operations on the Collection (DataFrame), Sequence, and Scalar objects. Collection indicates a tabular data structure (two-dimensional structure). Sequence indicates a column (one-dimensional structure). Scalar indicates a scalar object.

If you use Pandas data to create a DataFrame object, the object contains the actual data. If you use a MaxCompute table to create a DataFrame object, the object does not contain the actual data. The object contains only data operations. After a DataFrame object is created, MaxCompute can store and compute data.

Procedure

DataFrame object is the only Collection object that you must create. A DataFrame object can be used to reference MaxCompute tables, MaxCompute partitions, Pandas DataFrame objects, and SQLAlchemy tables (database tables). The reference operations for these data sources are the same. You can process data without the need to modify code. You only need to change the input and output pointers. This way, you can migrate small amounts of test code that is running locally to MaxCompute. The accuracy of the migration is ensured by PyODPS DataFrame.

To create a DataFrame object, you only need to call the required MaxCompute table, Pandas DataFrame object, or SQLAlchemy table.

from odps.df import DataFrame

# Create a DataFrame object by using a MaxCompute table.
 iris = DataFrame(o.get_table('pyodps_iris'))
 iris2 = o.get_table('pyodps_iris').to_df()  # Use the to_df method for the table.

# Create a DataFrame object by using a MaxCompute partition.
 pt_df = DataFrame(o.get_table('partitioned_table').get_partition('pt=20171111'))
 pt_df2 = o.get_table('partitioned_table').get_partition('pt=20171111').to_df()  # Use the to_df method for the partition.

# Create a DataFrame object by using a Pandas DataFrame object.
 import pandas as pd
 import numpy as np
 df = DataFrame(pd.DataFrame(np.arange(9).reshape(3, 3), columns=list('abc')))

# Create a DataFrame object by using an SQLAlchemy table.
 engine = sqlalchemy.create_engine('mysql://root:123456@localhost/movielens')
 metadata = sqlalchemy.MetaData(bind = engine) # Bind the metadata to a database engine.
 table = sqlalchemy.Table('top_users', metadata, extend_existing=True, autoload=True)
 users = DataFrame(table)
If you create a DataFrame object by using a Pandas DataFrame object, note the following points during the initialization of the DataFrame object:
  • PyODPS DataFrame attempts to infer the NUMPY OBJECT or STRING data type. If a column is empty, an error is returned. To avoid these errors, set unknown_as_string to True, and convert the data type of this column to STRING.
  • You can use the as_type parameter to forcibly convert the data type. If a basic data type is used, this type is forcibly converted when you create a PyODPS DataFrame object. If a Pandas DataFrame object contains a LIST or DICT column, the system does not infer the data type of this column. You must manually set as_type to DICT.
df2 = DataFrame(df, unknown_as_string=True, as_type={'null_col2': 'float'})
df2.dtypes
odps.Schema {
  sepallength           float64
  sepalwidth            float64
  petallength           float64
  petalwidth            float64
  name                  string
  null_col1             string   # The data type cannot be identified. You can set unknown_as_string to True to convert the data type to STRING.
  null_col2             float64  # The data type is forcibly converted to FLOAT.
}
 df4 = DataFrame(df3, as_type={'list_col': 'list<int64>'})
 df4.dtypes
odps.Schema {
  id        int64
  list_col  list<int64>  # The data type cannot be identified or automatically converted. You must specify as_type.
}
Note PyODPS DataFrame does not allow you to upload Object Storage Service (OSS) or Tablestore (OTS) external tables to MaxCompute.