Let's connect
Let's connect

Table schemas in data pipelines Spark: How to handle large, nested & growing ones

Picture of Michał Mstowski, Big Data Software Engineer

Michał Mstowski

Big Data Software Engineer

12 minutes read

1_uipFZ0sga9fYPVPk1EqFmw

bash

| — products: array (nullable = true)
| | — element: struct (containsNull = true)
| | | — created: string (nullable = true)
| | | — description: string (nullable = true)
| | | — fulfilment: array (nullable = true)
| | | | — element: struct (containsNull = true)
| | | | | — id: string (nullable = true)
| | | | | — status: string (nullable = true)
| | | | | — quantity: struct (nullable = true)
| | | | | | — lineIds: array (nullable = true)
| | | | | | | — element: string (containsNull = true)
| | | | | | — number: string (nullable = true)
| | | | | | — measure: string (nullable = true)
| | | | | | — size: string (nullable = true)
| | | | | — products: array (nullable = true)
| | | | | | — element: struct (containsNull = true)
| | | | | | | — id: string (nullable = true)
| | | | | | | — quantity: struct (nullable = true)
| | | | | | | | — ids: array (nullable = true)
| | | | | | | | | — element: string (containsNull = true)
| | | | | | | | — number: string (nullable = true)
| | | | | | | | — measure: string (nullable = true)
| | | | | | | — reason: string (nullable = true)
| | | | | — tracking: struct (nullable = true)
| | | | | | — number: string (nullable = true)
| | | — gtin: string (nullable = true)
| | | — lines: array (nullable = true)
| | | | — element: struct (containsNull = true)
| | | | | — created: string (nullable = true)
| | | | | — id: string (nullable = true)
| | | | | — lastUpdated: string (nullable = true)

java

case class DataSource(???)
case class DataTarget(???)
 
def readData(source:String): Array[DataSource] = {
 // read data from externa source or from hdfs
}
 
def mapData(sourceDS: Dataset[DataSource]): Dataset[DataTarget] = {
 // change column names, add ingestion timestamp etc.
}
 
import spark.implicits._
val source = ???
val targeTable = ???
 
val sourceData = readData(source)
val targetDS = mapData(sourceData.toDS)
targetDS.writeTable(targetTable)
1_YZIYq3-NmgGjeamzHqLVEAAA

Liked the article?

Share it with others!

explore more on

Take the first step to a sustained competitive edge for your business

Let's connect

VirtusLab's work has met the mark several times over, and their latest project is no exception. The team is efficient, hard-working, and trustworthy. Customers can expect a proactive team that drives results.

Stephen Rooke
Stephen RookeDirector of Software Development @ Extreme Reach

VirtusLab's engineers are truly Strapi extensions experts. Their knowledge and expertise in the area of Strapi plugins gave us the opportunity to lift our multi-brand CMS implementation to a different level.

facile logo
Leonardo PoddaEngineering Manager @ Facile.it

VirtusLab has been an incredible partner since the early development of Scala 3, essential to a mature and stable Scala 3 ecosystem.

Martin_Odersky
Martin OderskyHead of Programming Research Group @ EPFL

The VirtusLab team's in-depth knowledge, understanding, and experience of technology have been invaluable to us in developing our product. The team is professional and delivers on time – we greatly appreciated this efficiency when working with them.

Michael_Grant
Michael GrantDirector of Development @ Cyber Sec Company