This topic describes how to create a DataFrame object. After this object is created, it can be used to reference a data source.
Background information
To use a DataFrame project, you must familiarize yourself with the operations on the
Collection
(DataFrame
), Sequence
, and Scalar
objects. Collection indicates a tabular data structure (two-dimensional structure).
Sequence indicates a column (one-dimensional structure). Scalar indicates a scalar
object.
If you use Pandas data to create a DataFrame object, the object contains the actual data. If you use a MaxCompute table to create a DataFrame object, the object does not contain the actual data. The object contains only data operations. After a DataFrame object is created, MaxCompute can store and compute data.
Procedure
DataFrame object is the only Collection object that you must create. A DataFrame object can be used to reference MaxCompute tables, MaxCompute partitions, Pandas DataFrame objects, and SQLAlchemy tables (database tables). The reference operations for these data sources are the same. You can process data without the need to modify code. You only need to change the input and output pointers. This way, you can migrate small amounts of test code that is running locally to MaxCompute. The accuracy of the migration is ensured by PyODPS DataFrame.
To create a DataFrame object, you only need to call the required MaxCompute table, Pandas DataFrame object, or SQLAlchemy table.
from odps.df import DataFrame
# Create a DataFrame object by using a MaxCompute table.
iris = DataFrame(o.get_table('pyodps_iris'))
iris2 = o.get_table('pyodps_iris').to_df() # Use the to_df method for the table.
# Create a DataFrame object by using a MaxCompute partition.
pt_df = DataFrame(o.get_table('partitioned_table').get_partition('pt=20171111'))
pt_df2 = o.get_table('partitioned_table').get_partition('pt=20171111').to_df() # Use the to_df method for the partition.
# Create a DataFrame object by using a Pandas DataFrame object.
import pandas as pd
import numpy as np
df = DataFrame(pd.DataFrame(np.arange(9).reshape(3, 3), columns=list('abc')))
# Create a DataFrame object by using an SQLAlchemy table.
engine = sqlalchemy.create_engine('mysql://root:123456@localhost/movielens')
metadata = sqlalchemy.MetaData(bind = engine) # Bind the metadata to a database engine.
table = sqlalchemy.Table('top_users', metadata, extend_existing=True, autoload=True)
users = DataFrame(table)
- PyODPS DataFrame attempts to infer the NUMPY OBJECT or STRING data type. If a column
is empty, an error is returned. To avoid these errors, set
unknown_as_string
to True, and convert the data type of this column to STRING. - You can use the
as_type
parameter to forcibly convert the data type. If a basic data type is used, this type is forcibly converted when you create a PyODPS DataFrame object. If a Pandas DataFrame object contains a LIST or DICT column, the system does not infer the data type of this column. You must manually setas_type
to DICT.
df2 = DataFrame(df, unknown_as_string=True, as_type={'null_col2': 'float'})
df2.dtypes
odps.Schema {
sepallength float64
sepalwidth float64
petallength float64
petalwidth float64
name string
null_col1 string # The data type cannot be identified. You can set unknown_as_string to True to convert the data type to STRING.
null_col2 float64 # The data type is forcibly converted to FLOAT.
}
df4 = DataFrame(df3, as_type={'list_col': 'list<int64>'})
df4.dtypes
odps.Schema {
id int64
list_col list<int64> # The data type cannot be identified or automatically converted. You must specify as_type.
}