Friday, July 24, 2020

Load Parquet file data into a Spark Dataframe using Scala

In this article we will load PARQUET file data into a Spark Dataframe using Scala.

The ORC file is available here - https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet

After downloading the PARQUET file, simply rename the file to "data.parquet" and place it in "c:\data".

The SBT library dependencies are shown below for reference.

scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"

The Scala program is provided below.
However, remember to use Jdk 1.8 with Spark 2.3.0.

import org.apache.spark.sql.SparkSession

object ParquetReader extends App {

  val spark = SparkSession.builder()
    .master("local")
    .appName("ParquetFileReader")
    .getOrCreate()

  import spark.implicits._

  val df = spark    .read
    .format("parquet")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("C:\\data\\data.parquet")

  df.show()
}

Here is the output after running the program.

That's all!

No comments:

Post a Comment