Wednesday, July 15, 2020

Load ORC file data into a Spark Dataframe using Scala

In this article we will load ORC file data into a Spark Dataframe using Scala.

The ORC file is available here -> https://github.com/Teradata/kylo/blob/master/samples/sample-data/orc/userdata1_orc

After downloading the ORC file, simply rename the file to "data.orc".

The SBT library dependencies are shown below for reference.

scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"

The Scala program is provided below.


import org.apache.spark.sql.SparkSession

object ORCReader extends App {

  val spark = SparkSession.builder()
    .master("local")
    .appName("ORCFileReader")
    .config("spark.sql.orc.impl", "native")
    .getOrCreate()

  import spark.implicits._

  val df = spark    .read
    .format("orc")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("C:\\data\\data.orc")

  df.show()
}
Here is the output after running the program.


Thanks. That is all for now!

No comments:

Post a Comment