namaste-data: Load Nested XML data into a Spark Dataframe using Scala

Tuesday, July 28, 2020

Load Nested XML data into a Spark Dataframe using Scala

In this article we will load a nested XML data into a Spark Dataframe using Scala.

The nested XML file is provided below for reference.

The SBT library dependencies are shown below for reference.

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"

libraryDependencies += "com.databricks" %% "spark-xml" % "0.6.0"

The Scala program is provided below.

import org.apache.spark.sql.SparkSession

object NestedXMLReader extends App {

  System.setProperty("hadoop.home.dir","C:\\intellij.winutils")

  val spark = SparkSession.builder()
    .master("local")
    .appName("XMLFileReader")
    .getOrCreate()

  val df = spark.read
    .format("xml")
    .option("rowTag", "person")
    .load("C:\\data\\nested-data.xml")

  df.select("Id", "Age", "Name.FirstName", "Name.LastName").show()

}


Here is the output after running the program.

Thanks. That is all for now!

Tuesday, July 28, 2020

Load Nested XML data into a Spark Dataframe using Scala

No comments:

Post a Comment