PySpark is an API for Python in Spark. PySpark provides the DataFrame API to implement different computing logic. This topic describes the basic operations on PySpark.
Procedure
Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
Run the following command to go to the interactive environment of PySpark:
pysparkYou can run the
pyspark --helpcommand to view more command line parameters.Initialize a Spark session.
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate()Create a DataFrame.
from datetime import datetime, date import pandas as pd from pyspark.sql import Row df = spark.createDataFrame([ (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)), (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)), (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0)) ],schema='a long, b double, c string, d date, e timestamp')After a DataFrame is created, you can use various types of transform operators for data computing.
Run the following commands to display the DataFrame and its schema:
df.show() df.printSchema()