Revisiting Scala Native performance

Scala Native is an ahead of time compiler and standalone runtime allowing to compile Scala code to machine language without the need for the JVM. The recent release of Scala Native 0.4.0 makes a great opportunity to revisit the project and check how it compares to other technologies available in 2021. In this article, we will focus on its performance aspects in comparison with execution on the JVM and its closest competitor — Graal Native Image binaries. We will use the same set of Scala sources for each of them and check how the programs behave in terms of performance and peak memory usage. These aspects are fundamental in the context of cloud and serverless runtime environments.

Setting up the environment

Scala Native

For Scala Native, I’ve decided to use the locally-built version from its current latest commit hash 21bfe909, to which I would refer as 0.4.1-SNAPSHOT. I did it for the sake of fixed performance regressions caused by the 0.4.0 release when compared to the 0.4.0-M2. In some cases, it was responsible for even 40% slower execution of prepared binaries. For each run, I’ve used the default settings, including Immix GC, with the only exception of changing build mode to full release and enabling Link Time Optimization to ThinLTO.

Scala Native comes with 3 build modes:

debug — includes only a minimal set of optimizations, is the default mode used by Scala Native
release fast — takes longer and applies most of the optimizations that make a significant impact on the final performance
release full — applies all possible optimizations, but at a price of much longer, not parallelized, optimization process.

Scala Native needs LLVM installed, on my machine I’m using LLVM 10. In my case, I have not needed to install any additional libraries or tools, but, according to Scala Native documentation, on some platforms, it might be needed. If you’d like to create a Dockerized environment for working with Scala Native, you can base it on the Dockerfile used in the CI of that project. In most cases it uses the oldest compatible versions of needed libraries, so don’t hesitate to update them.

JVM

I’ve used two different Java distributions for compilation and execution of JVM tests: Adopt OpenJDK and GraalVM Community Edition 21. Both of them are based on Java 11. For the convenience of usage and re-running tests, I’ve used the `sbt-assembly` plugin to create fat jars.

In all cases, unless stated otherwise, I’ve limited the maximum and initial memory allocation pool to 2GB, which was still much more than needed for each benchmark. I have used the default stack size of 1MB, however, changes to this parameter had no visible effect on the final results, especially in terms of recursive benchmarks like N-queens.

Native Image

Graal Native Image is an ahead-of-time compiler like Scala Native, but it works in a bit different way. This technology operates on already compiled Java bytecode, instead of Scala sources making it much more generic. It also comes with its runtime called SubstrateVM with variants for Java 8 or Java 11.

To be honest, it was my first attempt to use this tool, and in fact, I’ve encountered a few problems that might be worth sharing since I needed to spend a bit of time to set it up properly to work with Scala 2.13.

The first issue I’ve encountered was quite surprising to me. It might happen that Native Image will not be able to create a truly standalone binary. In such a case, it might produce some kind of “hybrid binary” that needs JRE somewhere in the path at runtime. You can only imagine how surprised I was when I once tried to run such a program in a Docker container and received a message about missing Java. Native Image shows warnings when it creates “hybrid binary”, but I think it’s easy to miss, especially if we try to use it in a CI environment with Java installed. From now on, I was always passing the ` — no-fallback` flag to Native Image to make sure that I would always receive a genuinely standalone binary.

When I’ve finally managed to create a standalone binary program, another problem emerged, this time at runtime. Scala 2.13 and 2.12.12+ use some JVM constructs that are not fully supported by Native Image. This may lead to exceptions at runtime. Luckily in my case it happened at the very beginning of the main method initialization. It can be quite easily omitted if you’re using the `sbt-native-image` plugin — it includes a fix used to rewrite some of the unsupported usages at compile time. Unfortunately, it was not as easy when I tried to use Native Image on a project that used `mill` as a build tool. In that case, it was easier and faster to migrate it to sbt.

The last problem I encountered happened when I tried to use an HTTP(S) client in my side project. If you want to use those protocols you need to enable it explicitly with a ` — enable-url-protocols=http(s)` flag when using Native Image, otherwise, it will throw an exception at runtime. It’s not directly related to this text, but I think it’s worth sharing it as well. It was strange to me that such usages could not have been reported at compile time. After dealing with these issues, my testing environment was ready and working fine without any further problems.

Since I had no license for Enterprise Edition of Native Image I was not able to further optimize created binaries using profile guided optimizations which I was hyped on.

Platform specification

All tests were executed on my local machine with the setup listed below.

CPU Intel Core i7–8565U, 8 cores, 1.8–4.6 GHz
RAM 24GB DDR4, 2400MHz
OS Ubuntu 20.04.2 LTS
Scala 2.13.4
JDK Adopt OpenJDK 11.0.9
Graal Native Image 21.0.0.2 CE based on Java 11
Scala-Native 0.4.1-SNAPSHOT
commit: 21bfe909c84ed45316531d13c1a7ce577c75da4a
CLang / LLVM 10.0.0

Scala Native benchmarks suite

Let’s start with a benchmarks suite included in the Scala Native project. We can find plenty of classic, well-known benchmarks used in multiple languages and platforms, like Mandelbrot, KMeans, Delta Blue or N Queens problem. Of course, in this case, they’re all written using Scala 2. What’s quite important, at least for me, is an included set of tools and configurations allowing to test performance across multiple releases or other runtimes compilable from Scala sources. In total, it contains 17 benchmarks with single-threaded implementations.

For each benchmark, we’re performing 20 runs with 2000 iterations each. Each of them shares the same sources, which should allow us to check actual runtime performance characteristics. The execution time of each iteration is measured with an accuracy of nanoseconds and stored as plaintext. You can find a link to these raw outputs at the end of this document.

Benchmarks stability

I have observed that in each of the tested technologies average execution times between consecutive runs can differ quite significantly. Even after discarding results from warmup iterations relative standard deviation could reach up to 10%. On average results were much stable with ~2% for Scala Native and Native Image and twice as much for execution on the JVM. As you can see in the chart below in almost every benchmark there was at least one runtime that deviated significantly from the others. As an example, the Mandelbrot benchmark, taking the most time of all benchmarks, when executed on the GraalVM could differ for over 5% between runs when all other technologies have not exceeded the threshold of 0.1%.

The same applies when checking the stability of execution times based on the results from all benchmark runs. In this case, execution on the OpenJDK gave the most stable results, but comparable with GraalVM and Scala Native, with differences no larger than 2%. In this comparison Native Image got the worst result, with an average of 13% relative standard deviation in execution times, mostly to enormous instability in the Rsc benchmark, reaching up to 160%. After exclusion of this single benchmark, it would be comparable with other technologies with an average of 4% relative standard deviation in execution time.

Performance after initial warmup

Since the average times needed for the execution of benchmarks single iteration may vary in the range from tens of microseconds up to over a hundred milliseconds, all results are represented by relative values compared to Scala Native. In the chart below, you can see a comparison of Scala Native with other technologies based on the 90th percentile of the execution times. To mitigate the effects of JVM warmup, and create a fair comparison with other targets, results of the first 500 initial iterations from each run were discarded.

In the next chart, you can see Scala Native compares with each other of tested technologies. In most cases, binaries created using Scala Native gave better results than the programs created using other tools. For the majority of benchmarks, we can observe an improvement of at least 20% when compared to programs compiled using Native Image. What more, in some benchmarks time needed for execution was even cut in half! Scala Native has also managed to give better performance results than execution on the JVM in the 6 benchmarks. There were only a few cases where execution on the JVM was superior to the Scala Native binary.

However, we can observe that in a single benchmark (N-queens) execution was much slower than on the others. In this case, JVM and Native Image assemblies were even 5 times faster! I can only guess that currently, Scala Native might still contain some bug involving large performance regression for recursive calls, which are heavily used in the n-queens benchmark. I hope they would be quickly fixed in such cases as it may allow for a large overall speedup of this platform.

Performance without warmup

Even though the presented results were quite optimistic for Scala Native, they do not present its true potential when compared to execution on the JVM — lack of warm-up and instant peak performance. In multiple use cases, eg. serverless environments, these two can be a real game-changer, potentially saving us tons of precious milliseconds wasted on class-loading and just-in-time optimization on the JVM.

On the next chart, I’ve presented the results from the same benchmark runs, but this time without skipping the initial 500 warm-up iterations. In this scenario, Scala Native has managed to improve its performance on average by 20% when compared to execution on the OpenJDK and 35% when compared to GraalVM.

Overall performance summary

Overall if we would like to sum up all the results and determine the average relative performance for each execution runtime based on these benchmarks at first glance, it might not be so optimistic. With only a 12% speedup compared to Native Image, Scala Native does not present itself well. Especially since its competitor operates on bytecode instead of its own intermediate representation, allowing it to reuse most of Java libraries. On the other hand, if we assume that the gross error introduced in Queen’s benchmark is a bug and would be fixed soon, investing time and effort into Scala Native development seems a lot more optimistic. Such a scenario would result in an almost 40% increase in performance relative to Native Image results and would be on a par with the JVM. Whatmore for short living applications Scala Native even now stands at similar performance as GraalVM and manages to outperform OpenJDK. As soon as the problems with recursive calls would be fixed it might be possible that this project would be able to present better performance than execution on the JVM by additional 10 percentage points.

Peak memory usage

The next area I wanted to check was the peak memory usage by each of the targets. To measure this value, I have used the Linux build-in tool `/bin/time -v`. However, it looks like a well-known shell built-in `time`, it adds a few additional features, other than measuring elapsed time. In its verbose mode, this little program shows a lot of additional information, and one of them is the maximal amount of memory used by the specified process.

It is not a surprise that both Scala Native and Native Image executables should use a lot less memory than execution on the JVM. What’s much more interesting is how much memory we can save. Based on my results, in some benchmarks memory usage was even 100x lower than on the JVM in some cases. On average execution on the OpenJDK took 35x more memory, and 38x more memory on the GraalVM compared to Scala Native. The minimal amount of memory used by the JVM was ~50MB, which we can assume as its minimal runtime memory overhead. Only in two benchmarks, Scala Native has managed to exceed this threshold. Even in the most complex and memory-consuming RSC benchmark peak memory usage was 6 times smaller when compared to the JVM, and almost 4 times smaller when compared to the Native Image. We can assume that in the larger applications the difference between these two can be lower, but we might need some real-world examples to prove that.

I expected that memory usage on the Native Image would be similar to Scala Native, however, it used on average 15 times more memory than Scala Native. In fact, in some cases, it even used more memory than execution on the JVM. In terms of memory usage, I think the winner is evident with the impressive result of Scala Native.

It’s hard to tell why the difference was so significant. If I were to guess, I would be betting on two candidates. The first one is the garbage collector used by default — Immix GC. It’s a relatively new mark-region GC that has become the subject of multiple academic papers. It also has a concurrent variant called Commix which may also increase performance in some cases. The second one, in my opinion, is escape analysis performed in the optimizer. Based on its results, Scala Native can replace some class allocations for numerical types and pointers with primitive values, which may provide massive savings in terms of used memory.

Binaries size

The last parameter I’ve checked was the actual size of the final executable file or fat jar in the case of JVM. Since sources for all benchmarks were kept in a common directory, the resulting jar remains constant in all cases. In theory, we might have omitted this metric and focus only on executables, but this way we may treat it as a reference value.

When we’ll take a look at the chart with sizes, we can see that in all cases but one, Scala Native produces on average 3 times smaller executables than Native Image and 2 times smaller than JVM fat jar. However, in the last case we need to take into account all transitive dependencies. One of them is the Scala standard library, with a size of 14MB. In case if we would not take it into account and summarize only the size of actual benchmark classes, most of them would not take more than a few kilobytes of storage. Only the RSC benchmark needed 2MB of storage for its actual bytecode.

I guess that the difference between the size of executables created by Scala Native and Native Image might be based on more aggressive dead code removal, but I don’t have any actual proofs for this theory. Only in one case binary created using Scala Native was ~50% larger than the one created using Native Image. This difference might have been caused by excessive duplication used as one of the optimization techniques in Scala Native.

Third-party benchmarks

Currently, Scala Native is not yet very popular and it’s hard to find some open source apps created with it allowing to make some additional real-world comparison. However, I’ve managed to find two projects containing cross-platform benchmarks. Let’s take a look at how Scala Native works on them.

Databrics / sjsonnet

The first library is sjsonnet. It’s a Scala implementation of Google Jsonnet — a superset of the JSON format. One of its usages is providing centralized configurations that can be used with Terraform, Packer or Kubernetes. This benchmark is mostly focused on reading files from disk and parsing and interpreting its content, and its measure of performance is the quantity of full read/interpret loops over 36 different files. For this benchmark I’ve used only a single Java distribution, in this case it was OpenJDK, however, after I’ve observed that it was using more than 1 core of my CPU I’ve added a test case where the Java process was limited to only a single core.

As you can see on the chart below, this time I have not skipped warmup iteration. Based on this we can observe that reaching peak performance took JVM around 20 seconds for single-threaded execution and 5 seconds when it was limited in terms of CPU cores. After the initial warm-up, they’re both reaching the same performance level.

Both Scala Native and Native Image, as expected, reach their peak performance almost instantly after startup.

Overall, Scala Native reached on average 24% more iterations per second compared to Native Image, yet it was still almost 40% less than execution on the JVM.

Simultaneously, Scala Native consumed only a fraction of the memory used by other targets, with its peak performance around 50MB, while execution on Native Image and single-threaded JVM took 5 times more and even two times more on JVM unlimited in terms of allowed cores.

scalades

The last one on the list is scalades — Scala Discrete Event Simulation library created by Bjorn Regnell and Christian Nyberg at Lund University. Compared to the previous benchmark, this one is not using any IO operations, instead, it’s primarily focused on passing signals and handling them in its queuing system. For each platform, I’ve set default simulation with 9 created customers per second to simulate a period of 0.1 seconds. In the charts below, you can see averaged results from 10 runs for each platform.

In this use case, again Scala Native showed astonishing performance again. On 10 executions on the JVM and using Native Image binary took from 50% to 80% more time. Additionally, by switching to Commix GC we were able to increase Scala Native performance by an additional 7%. Changing GC also had an impact on the peak memory usage, which was on average 3 times higher than on default Immix GC, however, it still used 10x less memory than binary created using Native Image.

Summary

As you could have seen, Scala Native seems to be an excellent target for executing Scala applications in terms of memory usage. In every test case, it allowed us to significantly reduce the amount of peak memory usage. It still struggles with some performance issues that need to be fixed, although in some cases, it is already superior to other runtime targets. When combined with instant startup, it makes a great candidate for serverless environments, allowing us to cut operational costs down as we potentially might pay less for memory provisioning and time wasted on starting up the JVM. It also makes a lot of sense to use Scala Native as a target for CLI applications, especially as it allows for much better performance results than Native Image.

Personally, I’m very satisfied with the current status of Scala Native. Since Scala Center took the lead under this project last year Scala Native managed to catch up with the latest Scala distributions, allowing to run all benchmarks smoothly using Scala 2.13. It still needs some effort to make it suitable for commercial usage, but even now shows large potential. The main issue Scala Native is struggling with right now in terms of commercial usages is probably not being able to depend on Java libraries, such as numerous SDKs and network frameworks. This makes the creation of native applications a bit challenging. On the other hand, it allows for much better integration with native libraries written in C, C++ or Rust by using common C ABI. It allows for relatively easy integration, and in some cases, Java SDKs eg. AWS SDK could be replaced with bindings to native libraries when provided enough effort. Also, it’s missing proper support of first-class multithreading, however, it should change soon. Overall, I’m super excited and looking forward to seeing how this project develops in the future.