Get to know better #VLteam! – The Data Team

In this article, we want to shed some light on how we work in the Data Space and how we managed to use our Scala experience to become proficient in Apache Spark and extend our capabilities to Python and Machine Learning over the last couple of years.

header_data_team-min

Get to know the Data Team!

At VirtusLab we have a long history with the Scala language. In 2013 we organized the first Scala-centric meetup in Krakow, and since then we have been creating and contributing to multiple Scala libraries. Recently we even combined forces with the Scala Center and Martin Odersky to work on Scala 3. So it seemed obvious for us to focus on one of the largest Scala-based (and for quite some time even the most active) open source projects: Apache Spark. And so we did around 2016, which can be seen as the beginning of our Data Space department. 

In this article, we want to shed some light on how we work in the Data Space and how we managed to use our Scala experience to become proficient in Apache Spark and extend our capabilities to Python and Machine Learning over the last couple of years.

After-work drink on the roof of our Krakow’s office 

Some data about The Data

To be somehow objective, we’ve asked some of our engineers from different projects to give voice to their thoughts and impressions. But before that, let’s first analyze some data about the Data Space.

Currently, our Team consists of almost 50 engineers, working across 10 projects for 3 external clients, using for most of the time 2 programming languages: Scala and Python. Our biggest open source contribution is the pandas-stubs library, with more than 95k downloads per month from PyPI. We co-organize or take part as speakers in multiple meetups and conferences across Europe, such as Sphere.it, Datamass, Data KRK, Tesco Data Mashup, Krakow Scala Users Group, Data Science Rzeszow or MLKRK.

As mentioned before, our space initially focused on Data Engineering work mainly using Apache Spark and Scala on a large, on-prem, Hadoop cluster to create multiple domain aggregates for one of the largest UK retailers. Apart from building new teams and projects in that area (such as the creation of new analytical data profiles or tools for exploring metadata on Hadoop), we extended our cooperation with the same client by creating systems parallelising, scaling and productionising Machine Learning Python code using pySpark. We also got the responsibility of creating new, efficient ML pipelines in Azure. As a part of that work, we gained a lot of expertise in Python and, as usual at VirtusLab, we worked with the community to improve the ecosystem by publishing new libraries. Many of our ML engineers have Scala experience, so we worked hard to make this language more stable and reliable (see the pandas-stubs library above, type safety FTW!)

As the next step in our evolution, we started cooperating with two completely different clients:

  • A modern and highly-funded insurance company from the UK for whom we create a generic, cloud-based (AWS EMR) framework based on Spark. We use that framework to ingest, transform, validate and share data from various external sources. As there is a lot of focus on Scala code quality and using Functional Programming, we also cooperate with some world-known Scala consultancies.
  • A global automotive manufacturer and supplier, for whom we provide consultancy in the Machine Learning and Data Science areas. Recently, we optimised an extensive industrial line using ML.

Apart from consultancy and working on open source projects, we also like to learn and teach about different aspects of Computer Science, Engineering and Machine Learning. We have a couple of certified Spark and Scala trainers, a couple of people in the middle of or with a PhD and lots of people with different certifications (mostly Scala, Azure and Spark). Thanks to Virtusity (our own VirtusLab training division) anyone from the team can get a certification or even become a trainer.

That’s enough of the measurable aspects of our work.

image for article: Get to know better #VLteam! – The Data Team

Maciej, Software Engineer

What are you working on currently?

I’m a software developer in a greenfield project that is delivering a new framework for onboarding data from various sources into the client’s platform. I get to work with many exciting technologies like Kubernetes, Kafka, Spark and most of my tasks include working on new features with a lot of freedom in how the solution is implemented — of course learning a lot of things the hard way. Also since it’s a general framework to be used across the enterprise there is an opportunity to learn about many domains our client is involved in.

What was your biggest challenge in the Data space?

I think the most challenging was the start of my current project — we had to come up with the architecture, infrastructure and tooling (all of which wasn’t standardised in the company yet) while convincing the leadership that its advantages are worth the significant time investment.

What do you do in your free time?

image for article: Get to know better #VLteam! – The Data Team

Joanna, Machine Learning Engineer/Team Lead

What are you working on currently?

As a software engineer, my main focus is developing a framework used for data forecasting. I am also responsible for leading a small team. I do everything — from weekly work planning to ensuring we meet client needs.

What was your biggest challenge in the Data space?

Currently, our team is dealing with deploying our framework and data science pipelines onto Azure/Databricks architecture. It includes adapting our designs and changing our workflows for the cloud environment.

What do you do in your free time?

image for article: Get to know better #VLteam! – The Data Team

Michał, Data Engineer

What are you working on currently?

Our team is getting all product-related data from many different sources into one analytical platform for a major grocery retailer. Technologies used: Scala, Spark, Kafka and Akka Streams.

What was your biggest challenge in the Data space?

One of the most interesting problems was combining multiple streams of events to create an easy-to-use by analysts history of product entities. It required performance tuning of Spark jobs to process billions of events in a timely fashion.

What do you do in your free time?

image for article: Get to know better #VLteam! – The Data Team

Łukasz, Data Engineer

What are you working on currently?

I’m creating a large business aggregate using Spark

What was your biggest challenge in the Data space?

As a data engineer, I don’t really have one particular hard challenge… I have many small ones,  which all seem to be related. Together they create complexity and make understanding a system or data hard. The biggest one, and something which I continue to try to get better at, is decomplecting my problems, simplifying them and solving the smaller parts.

What do you do in your free time?

image for article: Get to know better #VLteam! – The Data Team

Wojciech, Machine Learning Engineer

What are you working on currently?

As a Machine Learning Engineer, I am responsible for keeping everything up and running as well as for developing new features. On a daily basis we are maintaining a cross-platform ML-driven project for the pricing domain for one of the biggest British groceries retailer. Additionally, I am also working on industrial line optimisation for a German automotive supplier.

What was your biggest challenge in the Data space?

The most challenging thing was to learn the cloud and move to production in a short time. Also finding out of box ML model application and validation for the production line was a real challenge.

What do you do in your free time?

After work, I like to meet with friends, read books, and go to the cinema or theatre. From time to time, I like to play some video games (great sentiment to Gothic).

Fancy a Data job at VirtusLab?

Now that you know how we work and who we are, you should consider joining us in the Data Space, as we are growing considerably! Here are some open positions you may find interesting:

Written by

Mikolaj-kromka
Mikołaj Kromka Principal Software Engineer Aug 12, 2021