Saturday, July 25, 2020

Convert Any Character Delimited File to Parquet Format using Scala

In this article we will see how to convert any character delimited file to a Parquet file format using a Spark Dataframe in Scala.

The input text file is shown below.



The SBT library dependencies are shown below for reference.

scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"

The Scala program is provided below.

import org.apache.spark.sql.{SaveMode, SparkSession}

object DelimitedToParquetConverter {

  def main(args: Array[String]): Unit = {
    val inputFile = "C:\\data\\delimited.txt"    val outputFile = "C:\\data\\out_data_delimited2parquet"    val delimiter = "|"     //Pipe character    ConvertFile(inputFile, outputFile, delimiter)
  }

  def ConvertFile(inputFile: String, outputFile: String, delimiter: String) {
    val spark = SparkSession.builder()
      .master("local")
      .appName("DelimitedToParquetonverter")
      .getOrCreate()

    val df = spark
      .read
      .format("csv")
      .option("delimiter", delimiter) //TAB delimited file      .option("header", "true")
      .load(inputFile)

    df
      .write
      .mode(SaveMode.Overwrite)
      .option("header", "true")
      .parquet(outputFile)
  }
}

The converted Parquet file is shown below.



That's all!

No comments:

Post a Comment