



This is a story of four software engineers who spent two months optimizing the performance of a web application in the cloud and succeeded in doing it. How did we get there? The situation may sound familiar to many of you.
This is a story of four software engineers who spent two months optimizing the performance of a web application in the cloud and succeeded in doing it. How did we get there? The situation may sound familiar to many of you.
There is a web service being a piece of a larger system in a big corporation. The service can handle the current load (below 100 RPS) but really soon a significant increase in traffic (up to 1000-2000 RPS) is expected due to more parts of the system being migrated to a new service-based architecture. Our performance tests, as well as some incidents in the past, have proven that we will not be able to get even close to the expected performance goal, which is to have P99 of the response time below 200 ms at the expected request rate.
Our solution to the problem? Take four software engineers out of their core teams and make them work for a couple of months exclusively on improving the performance of the application. Did this decision pay off? What were we able to figure out and fix during this time period? The goal of the article is to answer these questions.
First day of our work in the new teams: all four engineers putting their hands on the code and working full time analyzing performance bottlenecks and trying to fix them. Is this what has happened? Wrong. Quite the opposite, we’ve put the brakes on and spent over two weeks working exclusively on implementing processes to automate all the repeatable tasks: spinning up new test environments, running tests in a reliable and repeatable way, collecting and exposing test results.
What was the alternative? In fact, we could potentially save us the effort of doing all these. We had two spare environments (pre-production and development), we had a Jenkins instance that could potentially spin up additional environments, we had a basic performance test suite (OK, this one had to be improved anyway), we also had a monitoring infrastructure set up to collect logs from all running application instances. Why spend several weeks just rewriting all these? When doing any serious performance analysis, you’ll most likely need to run dozens or hundreds of tests, and the last thing you’d like to experience is to constantly wait for these few shared resources to become available and be able to run just one test at a time. (Note: each test would usually take many minutes to run and would have to be repeated in case of any problem or any change of the requirements.) Hundreds of test results, dozens of dimensions (different cluster sizes, computer architectures, multiple code improvements) – humans are able to keep the record of all these test runs and analyze them, but, believe me, computers are much better at doing it.
Eventually, we have applied a couple of automation techniques. Note: the choice of technologies used was heavily dependent on the business decisions made by our company to buy particular solutions (the cloud provider, the monitoring tool), not on our personal preferences.
The table below shows a snippet of the monitoring dashboard created by us. Each row in the table represents a specific test scenario, and each column represents a particular test execution identified by a unique number and a name assigned by us. The table shows basic performance metrics per test execution and scenario; for example, “5133 ms – 160 rps” means that 99th percentile of the execution time was 5133 milliseconds and the achieved response rate was 160 responses per second.
Fig. 1 Splunk table showing test results for two different test runs
Developing this test framework consumed several weeks of work by four engineers, and running tests in parallel required several application instances being deployed (multiple cloud VMs per instance), all of which came at a significant cost. Did it pay off? It’s hard to estimate how much engineering time (and frustration) was saved by being able to quickly set up, run, and analyze multiple tests with arbitrary parameters. But other than that, achieving our goal to make our application meet the performance requirements and saving money by choosing the right instance types for our VMs wouldn’t be possible in such a short time without having our test framework in place.
As human beings, we love numbers. Numbers are needed by us to show that we know what we are doing and prove that we are right or someone else is wrong. As engineers, we really need numbers to be data-driven and make informed decisions. However, we must bear in mind that these cannot be just “any numbers”, these must be “the numbers”, otherwise they may cause more harm than good.
Our application was written following good engineering standards and it was collecting and reporting different types of monitoring data, including application processing time, database time, and total response time. So far so good, but why the numbers reported by the application appeared to differ by two orders of magnitude from those reported by Gatling? For example, our monitoring dashboard would happily inform us that we’ve met our performance goal of P99 time below 200 ms, while Gatling report would show the response times constantly exceeding 1000 ms or even timing out. On the other hand, our logs would complain that database response times would be as high as 1000 ms or 2000 ms, while the database (MongoDB) profiler logs would not report any operation exceeding 100 ms and the CPU consumption of database instances would be below 10%.
To understand what was actually going on we had to first create in our heads the performance model of our application. Simplifying things a bit, request processing in our application could be described by the following pipeline.
Fig. 2 Request processing pipeline in the application
The first big revelation for us was the fact that not all parts of the process were included in the reported request processing time. I/O communication was handled by the framework, so the measurements taken in the application code did not take the I/O time into account. Moreover, response preparation, which included HTTP compression, appeared to be much more time consuming than we originally thought; as we’ll see later, optimizing this part of the process brought us a significant performance gain. Regarding database processing times, we’ve realized that we should not solely blame the database instances for these times being too high, as the bottleneck on the MongoDB client connection pool might bring additional delays perceived and reported by us as database latency.
Later on, when optimizing individual parts of the processing pipeline, we could observe how bottlenecks could be shifted from one element to another. For example, by optimizing application performance we could increase the request processing rate and therefore put an extra load on the database, which would result in the increase of the actual database processing time.
In order to get numbers we could really trust, we’ve made several improvements to our monitoring framework.
“All performant applications are alike; each application beyond its breaking point is broken in its own way.” If Leo Tolstoy had been a software developer, he would have started “Anna Karenina” with this sentence. What happens if your application cannot handle the load you are putting on it? It depends. The response time may grow several times, eventually timing out. The response time may be lower, but without reaching the expected response rate. The requests may start failing due to insufficient resources. The database may start failing or increase its response times. Which of these outcomes is preferred? Should we call a test run more successful if the response times were over 2000 ms, or if response times were below 1000 ms but 2% of requests have failed? The most honest answer to this question would be “I don’t know” – they are both just bad, the application should be fixed or the load should be limited. It’s safe to say that the results of tests executed beyond the breaking point of the application cannot be compared.
When establishing our performance goals and creating performance test scenarios we often realized that the goals were non-realistic, at least until we would be able to come up with a proper performance improvement. How did we solve the problem of keeping our goals while still being able to test our application? We applied a simple workaround by establishing two types of test scenarios, which we called limit and goal scenario, respectively. The goal scenario was our original scenario; the limit scenario was performed with the test setup allowing any decently performing version of the application to reach the established response rate. If we came up with a proper optimization solution, then the test would pass in both scenarios; if we came up with something completely misguided, in both scenarios it would fail. In all the other cases, we could use the limit scenario to compare different solutions in a reliable way.
For example, the scenario PerformanceTestScenario1Goal
would run the simulation with the request rate of 1000 RPS, while the scenario PerformanceTestScenario1Limit
would run the same scenario with the request rate of 500 RPS. Again, the open source version of Gatling appeared to be sufficient to implement this approach in a simple way.
The database is an obligatory and sometimes even a central part of almost every serious computer application. The key advantage of using a database is to delegate the responsibility of data management, which is a complex and tedious task common for various application types, to a dedicated software component that will perform it simply better than any solution handcrafted by you. But no matter how awesome your database may be, you cannot treat it like a magic black box that will always deliver maximum performance irrespective of your data model or the query pattern applied by you. You should show to your database something that is commonly referred to as mechanical sympathy, so that you can get the maximum out of it and not fall into one of many pitfalls that are surely lurking around.
The first discovery that we’ve made may seem obvious at first glance, but for us, it wasn’t obvious at all and helped us gain our first significant performance boost. It’s elementary that to get any reasonable performance of a query you may need to specify the proper index(es) supporting that query. What may not be so clear is that for a query that spans multiple fields you may need to specify a compound index rather than multiple single indexes. For example, a MongoDB index { "field1" : 1, "field2": 1 }
cannot be replaced by two indexes
{ "field1" : 1}
and { "field2": 1 }
. By introducing the missing compound index we were able to decrease the response times in a certain test scenario from 969 ms to 26 ms.
The second discovery came soon afterward and was closely related to the first one. By misinterpreting MongoDB documentation about the key order in the compound index I erroneously assumed that the field used as a sorting key in the query should be put in the compound index in the first position. I.e., that we should specify index{ "field2": 1, "field1" : 1 }
if the query involves sorting by field2
. When profiling the MongoDB query and performing further tests, we realized that the opposite was true. What really mattered was the key used for filtering, not sorting. Therefore, if the query was filtering on field1
and sorting by field2
, then the optimal compound index definition would be { "field1" : 1, "field2": 1 }.
The third discovery related to indexes revealed a problem in our data model. One of the limitations of compound indexes in MongoDB is that not more than one key in the index can be an array. This limitation prevented us from defining yet another index for a query over fields scalarType
,arrayType1
and
arrayType2
, because two of these fields were array fields. Being aware of this limitation of MongoDB at the time of designing our data model could potentially bring us to an alternative design. The field arrayType1
in our system almost always had just one value; therefore, defining this field as a scalar, and specifying an additional scalar field for these rare cases that required more than one value, would allow us to specify an optimal compound index for the problematic query.
It is worth mentioning that certain databases may provide even stricter limitations regarding indexes. For example, at the time of writing, CosmosDB does not allow any nested fields or arrays in either compound or unique indexes.
Another discovery that we’ve made when profiling our MongoDB queries was that count queries may not necessarily be faster than find queries. One of our most problematic service methods performed data listing with pagination: given offset, limit, and query parameters, the method would return a data chunk together with the total number of results. For example, query field1=value1&field2=value2&offset=20&limit=10
would return items 20 to 29 with information that there were 173 items in total. Query profiling results have shown that providing the total number of stores (a count
operation) was more time consuming than data listing (a find
operation)
Unless you are a Jedi master of the highest rank, letting your actions be driven solely by intuition can be a risky business. One of the lessons that any engineer working on a serious optimization task will learn is that every decision you make must be supported by data. In our work we’ve encountered several situations when the experimental results were significantly different or even totally opposite to our original expectations.
One of the first cases, when I jumped to conclusions based on incomplete data, was when we discovered a discrepancy between the times reported in performance test results and by our application (see chapter “Trust your numbers” ). As at the same time we’ve also observed some performance issues of the load tester machine (Gatling was running on a slow computer and had trouble generating requests at 2000 RPS), I quickly concluded that all the difference should be attributed to the client (load tester) overreporting response times while the server (our server) was doing just fine. A simple test performed by my colleague, who simulated our server by setting up an nginx server serving fake data, proved that Gatling was perfectly capable of handling even 4000 RPS. It was definitely our service that was causing problems.
Another interesting case when real data verified my original assumptions was when we’ve got a basic understanding of how requests were processed in our application and observed problems with service methods with large response payloads. Basically, the HTTP response message could have a size of up to 50 KiB. My first thought was that increasing the compression level would help to fix the issue: a higher compression rate would mean a smaller output which would result in a shorter response time. Experiments have shown that indeed, changing the compression level decreased the response times, but this parameter had to be decreased, not increased: a lower compression level meant less time spent on output compression, which gave our application a real performance boost.
A typical computer system has dozens or hundreds of knobs and switches to be tuned. In the HTTP server settings, you can manipulate the data frame size, the output compression level, or the header size, among many others. The database client may give you an opportunity to modify the connection pool size or connection timeouts, and the framework on which your application is based will surely give you an opportunity to make many other adjustments to the default settings.
When running performance tests you will commonly recognize that a particular part of the system (network interface, database, framework) is the performance bottleneck of a certain operation. Your natural reaction will be to look at the library settings, trying to find a certain parameter whose change may improve the application performance. Will it work?
Our experience is rather disappointing. When tuning our application, we’ve tried changing the default values of several parameters: HTTP output compression level, number of parallel streams, data chunk size, MongoDB connection pool size, number of verticles (the level of parallelism in the toolkit used by us). Out of all these attempts, only changing the compression level appeared to be successful and was applied in production. What was the problem with the other attempts? Either they appeared to bring no benefit (or even make things worse), improved the performance of certain operations while worsening one of the others, or showed some benefit but were too risky for us to apply.
Does it mean that you should stop messing with default values and treat them as a holy book that should never be touched? By no means. But we do want to say that it is a moment when having a powerful and reliable experimental framework comes in handy (see chapter “Automate everything”). Having multiple parameters to be tuned in various ways and multiple operations to be tested, you will most likely need to perform dozens of experiments before you can assess whether a certain change can be applied or not. You just have to remember that the default values have been established by some wise guys based on their extensive experience and that the chances are high that they are the best ones possible also in your case. Don’t get disappointed if you discover exactly that.
Starting with a disclaimer: given the fact that we have been brought to the task of performance tuning from our regular software development positions and were learning how to do the stuff by actually doing it, our knowledge of available analytical tools was rather limited. However, since we were able to achieve reasonable performance improvements by discovering the capabilities of our everyday tools and using one or two new ones, we thought that sharing our experience here could be beneficial. Having said that, we can start.
As already mentioned in the chapter “Automate everything”, having a reliable monitoring system to collect and visualize performance data is a must. Configure the monitoring tool already used in your system to collect all the metrics that you need and visualize the results of the test runs side-by-side for comparison. Of all the tools applied by us, the monitoring dashboard was undoubtedly the one most commonly used and probably also the most useful one.
For database monitoring, configure your database to log performance data that could help you assess whether the root cause of a given problem was actually in the database or somewhere else (as shown in the chapter “Trust your numbers”, the answer to this question is not always obvious). MongoDB should by default log profiler logs for queries that take more than 100 ms. These logs were sufficient for us to analyze database performance, find missing indexes, and optimize queries.
Fig. 3: Sample MongoDB profiler log
The figure above shows a log of an inefficient MongoDB query, which does filtering over three fields (region.isoCountryCode
, types
, facilities.type
) while using an index on just one field ({ _id: 1 }
). The reason for this index being used is that a suitable compound index does not exist so the database has to use a simple one instead. (As mentioned earlier in the chapter “Know your database”, MongoDB does not allow creating a compound index if more than one field is an array, and both types and facilities.type
are arrays.)
Fig. 4: Experimenting with MongoDB indexes
The listing above shows a sample MongoDB query executed four times, each time using a different index. Depending on the index used, the database engine performs a different number of seeks and examines a different number of keys, which has a direct impact on query performance.
Probably the most advanced tool that we found useful in our research was async-profiler. Installed on the application instance VM, the tool can monitor the application for an established time and measure how much time was spent in each particular method. A typical way in which an application profile can be visualized is via a flame graph. A flame graph has a form of a flame representing stacks of nested method calls, with the root method call at the bottom and the most deeply nested calls at the top. Each method call is represented by a horizontal bar whose length is proportional to the amount of time spent in that method.
Fig. 5 A flame graph of a sample application run
A flame graph can be analyzed by looking for unusually long horizontal bars, which may indicate that certain methods take more time than expected.
Async-profiler helped us find and fix two performance problems. The first one was related to JSON deserialization. The profiler made us aware that JSON deserialization was taking particularly much time, and a careful code analysis showed that we performed a two-step conversion between BSON and Jackson representation while it was possible to use Jackson method com.fasterxml.jackson.databind.ObjectMapper.convertValue
to perform the conversion in a single step. Thanks to this optimization, several of our service methods started reaching their performance goals.
Fig. 6 HTTP output compression visualized in a flame graph
The second performance problem was related to HTTP output compression. As shown in the figure above, we were able to notice that over 40% of the request processing time was spent compressing the HTTP response. The problematic service method was producing particularly large outputs (up to 50 KiB) consisting of raw UUID, which made them particularly hard to compress. Decreasing the HTTP compression level from 6 (the default value) to 4 significantly improved the performance of the method.
Additional insight into this case was given to us by something so simple as the Unix top command, which can also be used as a profiling tool. When running the experiment on a quad-core processor, we could observe the CPU usage reaching only 50% even though the service method was clearly missing its performance goals. The reason for this low resource utilization was the fact that HTTP output compression was performed on the single thread of the event loop, and as a consequence of Amdahl’s Law, it could not be optimized by parallelization (unlike other parts of request processing, such as I/O or data processing). A possible solution would be to configure our application to use more than one event loop, but due to the reasons discussed in the next chapter, we decided to hold back that optimization for the time being.
One of my favorite quotes states that there may be fewer constraints in the problem domain than you may first think, and therefore you can often be creative when searching for possible solutions rather than sticking with the most conservative approach. Giving our actual limitations a second thought, we were able to come up with ideas that could eventually fix almost all our performance problems. None of them could be applied yet, but they have been tested and remain as an outcome of our research to be potentially worked on in the future.
The first idea that we had was quite common and widely applicable in many computer systems: static data is easy to cache. The problem that we’ve encountered with compound indexes in MongoDB prevented us from creating an index to optimize a count operation in a certain service method (see chapter “Know your database”). However, due to the fact that our data was effectively static (we only expected a few updates per day) and the count operation was easy to cache (a single number to be stored for each query), we came to the conclusion and proved experimentally that a simple cache would fix the performance problems with the problematic method.
Caching appeared to be the right solution also in a more complicated case of the find
operation. A typical result of a find
operation in our application could have many kilobytes of data, which would make the cache consume too many resources, given the large variety of possible queries. However, careful profiling of the problematic operation helped us come up with another creative solution. As the results have shown, the most time-consuming part of the operation was deserializing MongoDB data, while the communication with the database and sending the response was reasonably fast. Given that the total size of our data was well below 1 GiB (another special characteristic of our application), we could just load all the data into memory and use them to construct the response. Handling a request would still involve a database find
operation (it’s better to rely on the database mechanisms than reinvent the query engine), but the results of this operation would just contain object ids, while the object bodies (already deserialized) would be loaded from the memory.
Experiments have shown that constructing caches for both count and find would solve our remaining performance problems. Ok, but what about cache invalidation? Our data was effectively static, but still could be modified from time to time, and while getting repopulated the caches would not be used and we would suffer the original performance problems (P99 of response time over 1000 ms). Again, our “box” appeared to be large enough to accept the risk: the performance requirement for our application was to handle sudden request spikes of over 1000 RPS, not a constantly high load (the average traffic was expected to be rather low). Given that the probability of these infrequent spikes coinciding with even less frequent and relatively short cache repopulation periods was rather low, we could safely accept that risk and apply a solution involving a static cache.
Unfortunately, knowing the box can often mean accepting the fact that the box is as small (or even smaller) than we originally assumed. Yet another promising idea of ours had to be archived because of that. The idea was to apply a basic database mechanism to speed up our queries: the projection. A single domain object (usually a store) was represented in our database as a document with the size of up to several kilobytes. Apart from basic store properties (name, address, type), the document would contain dozens of deprecated properties used by legacy systems as well as large blobs of data such as the history of the store’s opening hours. Since most requests were to get the basic properties, an obvious optimization would be to use projection to limit the database query to those requested fields only; the reduction of the amount of data to be read, transferred, and deserialized would result in a significant performance improvement, which we were able to prove experimentally.
The cause of why we failed to apply our solution was in the design of the data access layer of our application. Due to certain design decisions regarding data security implementation, we opted for a solution with a generic data repository returning a common domain object for an arbitrary low-level query, rather than an approach more commonly used in DDD with a specialized repository returning different object types for different abstract queries. Applying projections in our model would require redesigning the way in which data security was implemented in our system, and therefore we decided to put it off as a potential future improvement.
Another improvement that we could potentially apply was related to the number of verticles. A verticle in the Vert.x toolkit is similar to an actor in the Actor Model and acts as a unit of parallelism: we are guaranteed that all the operations performed by a single verticle will be scheduled on a single event loop thread, and therefore will be executed sequentially. Since the very beginning, the number of verticles used in our service was limited to just one, in spite of the service being executed on multi-core machines. Our experiments have clearly shown that increasing the number of verticles would solve the performance problems with one particular service method. However, since our service was developed assuming that the verticle code was single-threaded, we couldn’t guarantee that the service was truly thread-safe and that it could be safely parallelized by adding more verticles. Until the thread-safety of the service could be proved or otherwise guaranteed, we also had to postpone this improvement.
Two months have passed, our small team moved to work on other challenges. What have we managed to achieve? The table below summarizes the performance improvements in different scenarios.
Fig. 7 Comparison of performance results in 16 test scenarios at the beginning and the end of the performance team optimization
Out of ten scenarios that had shown critical performance problems (seven of them critical), we were able to optimize seven and get some visible improvements in the remaining three. Moreover, the longer-term improvements proposed by us (caching) should fix the remaining issues (maybe except for one particularly nasty scenario), as we were able to show in our performance experiments. The changes applied by us in the production system helped us safely survive the increased load on our service, including some incidents involving our clients accidentally flooding us with requests. Apart from these measurable gains, the knowledge that we gathered about our system and the technologies used would be hard to obtain in a different way.
I can imagine that the path we took would not be applicable to everybody. Teams regularly suffer from acute workforce shortages and time pressures, performance gaps are often much wider and harder to bridge than in our system, or they are just not so relevant in comparison to bugs or missing features so that it makes sense to invest so much engineering time in fixing them. However, if your planning perspective goes beyond the current fire to extinguish or the next urgent task to be delivered by yesterday, investing some time in thorough performance analysis and optimization of your application can be a valuable exercise, both for your team and for your own personal development.