Friday, July 24, 2020

Convert TAB Separated to ORC using Scala

In this article we will see how to convert a TAB separated file to an ORC file using a Spark Dataframe in Scala.

The input text file is shown below.



The SBT library dependencies are shown below for reference.

scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"

The Scala program is provided below.

import org.apache.spark.sql.{SaveMode, SparkSession}

object TSVToORCConverter extends App {
  val spark = SparkSession.builder()
    .master("local")
    .appName("TSVToORConverter")
    .config("spark.sql.orc.impl", "native")
    .getOrCreate()

  val inputTextFile = "C:\\data\\data.tsv"

  val outputORCFile = "C:\\data\\out_data_tsv"
  val df = spark    .read
    .format("csv")
    .option("delimiter","\t")     //TAB delimited file    .option("header", "true")
    .load(inputTextFile)

  df    .write
    .mode(SaveMode.Overwrite)
    .option("header","true")
    .orc(outputORCFile)
}

The converted ORC file is shown below.



That's all!

No comments:

Post a Comment