Basic operations on PySpark - E-MapReduce - Alibaba Cloud Documentation Center

PySpark is an API for Python in Spark. PySpark provides the DataFrame API to implement different computing logic. This topic describes the basic operations on PySpark.

Procedure

Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
Run the following command to go to the interactive environment of PySpark:
```
pyspark
```
You can run the pyspark --help command to view more command line parameters.

Initialize a Spark session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Create a DataFrame.

from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row

df = spark.createDataFrame([
    (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
    (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
    (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
],schema='a long, b double, c string, d date, e timestamp')

After a DataFrame is created, you can use various types of transform operators for data computing.

Run the following commands to display the DataFrame and its schema:
```
df.show()
df.printSchema()
```