Let's connect
Let's connect

Table schemas in data pipelines Spark: How to handle large, nested & growing ones

Picture of Michał Mstowski, Big Data Software Engineer

Michał Mstowski

Big Data Software Engineer

12 minutes read

1_uipFZ0sga9fYPVPk1EqFmw

bash

| — products: array (nullable = true)
| | — element: struct (containsNull = true)
| | | — created: string (nullable = true)
| | | — description: string (nullable = true)
| | | — fulfilment: array (nullable = true)
| | | | — element: struct (containsNull = true)
| | | | | — id: string (nullable = true)
| | | | | — status: string (nullable = true)
| | | | | — quantity: struct (nullable = true)
| | | | | | — lineIds: array (nullable = true)
| | | | | | | — element: string (containsNull = true)
| | | | | | — number: string (nullable = true)
| | | | | | — measure: string (nullable = true)
| | | | | | — size: string (nullable = true)
| | | | | — products: array (nullable = true)
| | | | | | — element: struct (containsNull = true)
| | | | | | | — id: string (nullable = true)
| | | | | | | — quantity: struct (nullable = true)
| | | | | | | | — ids: array (nullable = true)
| | | | | | | | | — element: string (containsNull = true)
| | | | | | | | — number: string (nullable = true)
| | | | | | | | — measure: string (nullable = true)
| | | | | | | — reason: string (nullable = true)
| | | | | — tracking: struct (nullable = true)
| | | | | | — number: string (nullable = true)
| | | — gtin: string (nullable = true)
| | | — lines: array (nullable = true)
| | | | — element: struct (containsNull = true)
| | | | | — created: string (nullable = true)
| | | | | — id: string (nullable = true)
| | | | | — lastUpdated: string (nullable = true)

java

case class DataSource(???)
case class DataTarget(???)
 
def readData(source:String): Array[DataSource] = {
 // read data from externa source or from hdfs
}
 
def mapData(sourceDS: Dataset[DataSource]): Dataset[DataTarget] = {
 // change column names, add ingestion timestamp etc.
}
 
import spark.implicits._
val source = ???
val targeTable = ???
 
val sourceData = readData(source)
val targetDS = mapData(sourceData.toDS)
targetDS.writeTable(targetTable)
1_YZIYq3-NmgGjeamzHqLVEAAA

Liked the article?

Share it with others!

explore more on

Take the first step to a sustained competitive edge for your business

Get your free consultation

VirtusLab's work has met the mark several times over, and their latest project is no exception. The team is efficient, hard-working, and trustworthy. Customers can expect a proactive team that drives results.

Stephen Rooke
Stephen RookeDirector of Software Development @ Extreme Reach

VirtusLab's engineers are truly Strapi extensions experts. Their knowledge and expertise in the area of Strapi plugins gave us the opportunity to lift our multi-brand CMS implementation to a different level.

facile logo
Leonardo PoddaEngineering Manager @ Facile.it

VirtusLab has been an incredible partner since the early development of Scala 3, essential to a mature and stable Scala 3 ecosystem.

Martin_Odersky
Martin OderskyHead of Programming Research Group @ EPFL

The VirtusLab team's in-depth knowledge, understanding, and experience of technology have been invaluable to us in developing our product. The team is professional and delivers on time – we greatly appreciated this efficiency when working with them.

Michael_Grant
Michael GrantDirector of Development @ Cyber Sec Company