Scaling up Python to parallel ML pipelines
We used cutting edge tools existing in Python, and also transferred best programming practices & software development processes known from other programming language ecosystems.
The client had a data science-based application in Python for computing financial forecasts. The existing solution required many manual steps during the deployment of a forecast for only one customer store out of thousands. The client’s team of data scientists needed to scale up their machine learning models written in Python to support multiple contexts, quickly productionize new models and be able to maintain the codebase on their own. Since Python’s ecosystem is less focused on production-grade solutions, creating such a system in the approach pre-taken by the client would quickly lead to a situation in which maintenance is expensive and development is very slow or even impossible.
The development of new forecasts and expansion of existing ones have been an ongoing process, so we had to create an application architecture that would help in building new components based on already implemented functionalities, while maintaining the stability of already productionized forecasts. To achieve that, we used cutting edge tools existing in Python, and also transferred best programming practices & software development processes known from other programming language ecosystems, which are usually used in production systems (like Java, C#, Scala).
We created a framework using the Apache Spark’s Python API (pySpark) and Hadoop, which allows the rapid implementation and productionization of ML models of finance forecasts (1), ensuring their correctness at a level unknown in this environment, thanks to static type checking, workflow generation, dependency injection, and applying them to new problems (2). In addition, we have speeded up the calculations of the models themselves (3). The use of state-of-the-art approaches to the software production process has multiplied the number of pipelines that can be maintained and developed in production (4).
The rapid development of new ML model by Data Scientists – from weeks to days (the model with tests in a day and productionization in a matter of hours)
Possibility to add new forecasting pipelines in about 1-2 weeks (compared to months)
Scaling up the processing of ML models 140 times (radical speedup from 24h to 10 minutes in the initial version)
Possibility to maintain and safely improve 70+ production pipelines written in Python
Our main plan is to increase the scale of forecasts by an order of magnitude. We also want to use this battle-tested solution in the client’s other projects written in Python.
- Our solution can be adapted to any data science project using Python and Hadoop, regardless of the amount of data
- We have knowledge & tools to develop very large Python-based production projects