Monday, July 27, 2020

Load AVRO data into a Spark Dataframe using Scala

In this article we will load AVRO data into a Spark Dataframe using Scala.

The AVRO file can be downloaded from: https://github.com/Teradata/kylo/blob/master/samples/sample-data/avro/userdata1.avro

Simply rename the downloaded file to "data.avro" before using it with the below code.

The SBT library dependencies are shown below for reference.

scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
libraryDependencies += "com.databricks" %% "spark-avro" % "3.2.0"

The Scala program is provided below.

import org.apache.spark.sql.SparkSession

object AVROReader extends App {
  System.setProperty("hadoop.home.dir","C:\\intellij.winutils")

  val spark = SparkSession.builder()
    .master("local")
    .appName("XMLFileReader")
    .getOrCreate()

  import spark.implicits._

  val df = spark.read
    .format("com.databricks.spark.avro")
    .load("C:\\data\\data.avro")

  df.show()
}
Here is the output after running the program.



Thanks. That is all for now!

No comments:

Post a Comment