All Products
Search
Document Center

E-MapReduce:Basic operations on PySpark

Last Updated:Apr 26, 2024

PySpark is an API for Python in Spark. PySpark provides the DataFrame API to implement different computing logic. This topic describes the basic operations on PySpark.

Procedure

  1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.

  2. Run the following command to go to the interactive environment of PySpark:

    pyspark

    You can run the pyspark --help command to view more command line parameters.

  3. Initialize a Spark session.

    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.getOrCreate()
  4. Create a DataFrame.

    from datetime import datetime, date
    import pandas as pd
    from pyspark.sql import Row
    
    df = spark.createDataFrame([
        (1, 2., 'string1', date(2000, 1, 1), datetime(2000, 1, 1, 12, 0)),
        (2, 3., 'string2', date(2000, 2, 1), datetime(2000, 1, 2, 12, 0)),
        (3, 4., 'string3', date(2000, 3, 1), datetime(2000, 1, 3, 12, 0))
    ],schema='a long, b double, c string, d date, e timestamp')

    After a DataFrame is created, you can use various types of transform operators for data computing.

  5. Run the following commands to display the DataFrame and its schema:

    df.show()
    df.printSchema()