namaste-data: Convert ORC to Parquet using Scala

Sunday, July 26, 2020

Convert ORC to Parquet using Scala

In this article we will see how to convert an ORC file to Parquet file format using a Spark Dataframe in Scala.

The SBT library dependencies are shown below for reference.

scalaVersion := "2.11.12"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"

The Scala program is provided below.

import org.apache.spark.sql.{SaveMode, SparkSession}

object ORCToParquetConverter extends App {
  val spark = SparkSession.builder()
    .master("local[*]")
    .appName("ParquetToORCConverter")
    .config("spark.sql.orc.impl", "native")
    .getOrCreate()

  import spark.implicits._

  val inputFile = "C:\\data\\data.orc"  val outputFile = "C:\\data\\out_data_orc2parquet"
  val df = spark    .read
    .format("orc")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(inputFile)

  df    .write
    .mode(SaveMode.Overwrite)
    .option("header","true")
    .parquet(outputFile)

}

The converted file is shown below.

That's all!

Sunday, July 26, 2020

Convert ORC to Parquet using Scala

No comments:

Post a Comment