Alibaba Cloud MaxCompute SDK for Python (PyODPS) DataFrame optimizes the execution process of each operation. You can use the visualization feature to display the entire computation process.
DataFrame visualization
>>> df = iris.groupby('name').agg(id=iris.sepalwidth.sum())
>>> df = df[df.name, df.id + 3]
>>> df.visualize()

groupby
operation and the column filtering operation. >>> df = iris.groupby('name').agg(id=iris.sepalwidth.sum()).cache()
>>> df2 = df[df.name, df.id + 3]
>>> df2.visualize()

View compiled results at the MaxCompute SQL backend
compile
method to view compiled results of SQL statements at the MaxCompute SQL backend.
>>> df = iris.groupby('name').agg(sepalwidth=iris.sepalwidth.max())
>>> df.compile()
Stage 1:
SQL compiled:
SELECT
t1.`name`,
MAX(t1.`sepalwidth`) AS `sepalwidth`
FROM test_pyodps_dev.`pyodps_iris` t1
GROUP BY
t1.`name`
Perform local debugging by using the pandas computation backend
- Select all or some rows of data from a non-partitioned table, or filter the column data, and then calculate the number of specified rows. The column filtering operation does not compute column values.
- Select all or some rows of data from all or the first several partition columns that are specified in a partitioned table, or filter the column data, and then calculate the number of the specified rows.
iris
DataFrame object uses a non-partitioned MaxCompute table as the source. The following
operation uses the MaxCompute Tunnel service to download data: >>> iris.count()
>>> iris['name', 'sepalwidth'][:10]
Assume that the DataFrame object uses a partitioned table that includes the ds
, hh
, and mm
partition fields. The following operation uses the MaxCompute Tunnel service to download
data:
>>> df[:10]
>>> df[df.ds == '20160808']['f0', 'f1']
>>> df[(df.ds == '20160808') & (df.hh == 3)][:10]
>>> df[(df.ds == '20160808') & (df.hh == 3) & (df.mm == 15)]
In this case, you can use the to_pandas
method to download the data to a local directory for debugging.
>>> DEBUG = True
>>> if DEBUG:
>>> df = iris[:100].to_pandas(wrap=True)
>>> else:
>>> df = iris
At the end of compiling, set DEBUG
to False to complete computation on MaxCompute.