The input AVRO file has been taken from: https://github.com/Teradata/kylo/blob/master/samples/sample-data/avro/userdata1.avro
The SBT library dependencies are shown below for reference.
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
The Scala program is provided below.libraryDependencies += "com.databricks" %% "spark-avro" % "3.2.0"
import org.apache.spark.sql.{SaveMode, SparkSession} object AvroToORCConverter extends App { System.setProperty("hadoop.home.dir","C:\\intellij.winutils") val spark = SparkSession.builder() .master("local") .appName("AvroToORCConverter") .config("spark.sql.orc.impl", "native") .getOrCreate() val inputFile = "C:\\data\\data.avro" val outputFile = "C:\\data\\out_data_avro2orc" import spark.implicits._ val df = spark .read .format("com.databricks.spark.avro") .load(inputFile) df .write .mode(SaveMode.Overwrite) .option("header","true") .orc(outputFile) }
The converted ORC file is shown below.
That's all!
No comments:
Post a Comment