How to mine Scala 3 compiler metadata using TASTy files

Picture of Andrzej Ratajczak, Kotlin, Scala Developer

Andrzej Ratajczak

Kotlin, Scala Developer

15 minutes read

Natural Language Processing and Machine Learning are gaining more and more popularity. Of course, VirtusLab had to have a closer look in terms of Scala. As the leader in Scala development, we decided to conduct small machine-learning experiments on Scala 3. 

The initial obstacle was how to gather data for training in an efficient and scalable way. Scala 3, as many contemporary programming languages, lets us write compiler plugins that we could use. However, they are uncomfortable to use, since they require us to: 

  • gather source codes from different sources
  • set up projects
  • compile them

A very tedious and nearly impossible task. Decompilers, however, allow parsing binaries without the need for the original project’s source code setup. Nevertheless, this approach has limitations in various languages as decompilation may produce similar but not identical source code. 

Enter Scala 3, introducing a robust decompiler that grants access to nearly all compiler internals, mimicking the compilation process from source code. In this brief blog post, we share our experience of extracting a vast collection of Scala 3 compiler metadata and demonstrate how you can achieve the same in just 150 lines of code.

Explore TASTy files for comprehensive code analysis

The key is to utilize TASTy files, which are included in target jars alongside binaries uploaded to Maven Central. TASTy files are binary serialized files containing the complete structure of Scala source code usage. They store abundant metadata, such as scaladoc comments, for all declared classes, objects, methods, and more. These files are serialized as AST trees, allowing us to navigate them as if we were compiling the source code.

TASTy files and the challenges

We aim to utilize the TASTy format. But how can we extract valuable information from it? Let’s break it down into smaller challenges:

  1. How can we obtain the list of Scala 3 libraries and their Maven coordinates?
  2. How can we efficiently fetch all of them, including their dependencies?
  3. How do we parse and retrieve the desired data?

In reality, these questions are simpler to address than they appear. Let’s go through them chronologically.

1 Obtaining Scala 3 libraries and their Maven coordinates

Let’s turn to the scaladex service. It is a comprehensive resource for Scala libraries that enables user-friendly browsing of Scala libraries. Instead of relying on direct Maven integration or the internal scaladex model, we opt to scrape the necessary data from a dedicated scaladex web page. With the help of the JSoup dependency, we effortlessly obtain the Maven coordinates of the newest Scala 3 libraries in just 40 lines of code.

The pipeline primarily involves a few transformations on the scraped HTML pages. Let’s have a look at the code itself:

java

val elems = (1 to 64).par
 .flatMap { page =>
   Jsoup
     .connect(
       s"https://index.scala-lang.org/search?sort=stars&languages=3.x&q=*&page=$page"
     )
     .get()
     .select("h4")
     .eachText
     .asScala
 }
 .flatMap { header =>
   Try(
     Jsoup
       .connect(s"https://index.scala-lang.org/$header/artifacts/version")
       .get()
   ).toOption
     .map { page =>
       val version = page.select(".head-last-version").text.trim
       page.select("option").eachText.asScala.map((_, (header, version)))
     }
 }
 .flatten
 .flatMap { case (name, (header, version)) =>
   Try {
     val text = Jsoup
       .connect(
         s"https://index.scala-lang.org/$header/artifacts/$name/$version?binary-versions=_3"
       )
       .get()
       .select("#copy-maven")
       .text
     Jsoup.parse(text, "", Parser.xmlParser())
   }.toOption
     .filter(_.select("artifactId").text.endsWith("_3"))
     .map { doc =>
       doc.select("groupId").text + ":" + doc
         .select("artifactId")
         .text + ":" + doc.select("version").text
     }
 }

2 Fetching and downloading all libraries and dependencies

Once we have the library coordinates, we can download them using coursier, a handy Scala tool that saves us time. Coursier is a library with a user-friendly interface for fetching Maven packages. It is commonly utilized by Scala build tools or as a standalone tool in a terminal. The SDK API is easy to use, allowing us to obtain the desired jar file and its dependencies with a single call to Fetch():

java

Fetch()
 .withRepositories(repositories)
 .withDependencies(
   Seq(
     Dependency(
       Module(Organization(organization), ModuleName(module)),
       version
     )
   )
 )
 .run

3 Parsing and retrieving data with the TASTy Inspector

We will use TASTy Inspector, a Scala 3 decompiler tool, to read TASTy files. To collect methods and their Scaladoc comments, we need to override a simple procedure that acts as a callback for each file. 

Let’s define our custom inspector:

java

class MyInspector(fileOutputName: String, classpath: String) extends Inspector:
 val file = new File(fileOutputName)
 val bw = new BufferedWriter(new FileWriter(file))
 def inspect(using Quotes)(tastys: List[Tasty[quotes.type]]): Unit =
   import quotes.reflect.*
   object Traverser extends TreeAccumulator[List[DefDef]]:
     def foldTree(defdefs: List[DefDef], tree: Tree)(
         owner: Symbol
     ): List[DefDef] =
       val defdef = tree match
         case d: DefDef =>
           List(d)
         case tree =>
           Nil
       foldOverTree(defdefs ++ defdef, tree)(owner)
   end Traverser


   tastys
     .flatMap { tasty =>
       val tree = tasty.ast
       Traverser.foldTree(List.empty, tree)(tree.symbol)
     }
     .filter(_.symbol.docstring.nonEmpty)
     .flatMap { defdef =>
       val comment = Cleaner.clean(defdef.symbol.docstring.get).mkString(" ")
       Option.when(!comment.isBlank && defdef.rhs != None)(
         s"${astCode(defdef)}␟${byteCode(defdef)}␟${sourceCode(defdef, true)}␟${sourceCode(defdef, false)}␟${comment}\n"
       )
     }
     .foreach(bw.write)


   bw.close()


 extension (s: String)
   def removeNewLines: String =
     s.replaceAll("\\p{C}|\\s+|\\r$|\\\\t|\\\\n|\\\\r", " ")


 def astCode(using Quotes)(defdef: quotes.reflect.DefDef): String =
   Extractors.showTree(defdef).removeNewLines

Once the inspector is defined, we can easily run it in one line:

java

TastyInspector.inspectAllTastyFiles(
 Nil,
 List(classpath.head),
 classpath.tail.toList
)(
 new MyInspector(coordinates, classpath)
)

Note: Certain classes, like Cleaner for deobfuscating Scaladoc comments or custom Extractors, have been borrowed from the dotty repository and can be found in our repository.

How to extract source code and bytecode

To obtain the source code, we can utilize a built-in compiler printer that converts the Tree into Scala code. It incurs minimal cost since the necessary files are already loaded. This process provides us with cleaned code, complete with resolved fully-qualified names and the removal of comments, performed by the scanner and parser phases.

java

def sourceCode(using Quotes)(
   defdef: quotes.reflect.DefDef,
   fullNames: Boolean
): String =
 val sourceCode = Try(
   SourceCode
     .showTree(defdef)(SyntaxHighlight.plain, fullNames)
     .removeNewLines
 )
 sourceCode.toOption.getOrElse("NO_SOURCECODE")

If there were internal errors in recovering the source code, we just discard faulty Trees and return a placeholder “NO_SOURCECODE.”
For bytecode, we opted to use the Apache BCEL library for simplicity. As we have the classpath, which consists of the jars fetched by coursier, everything is readily available.

java

def byteCode(using Quotes)(defdef: quotes.reflect.DefDef): String =
 val reader = Try {
   SyntheticRepository
     .getInstance(ClassPath(classpath))
     .loadClass(defdef.symbol.owner.fullName.replaceAll("\\$\\.", "\\$"))
     .getMethods()
 }
 reader.toOption
   .flatMap {
     _.toList
       .find(_.getName == defdef.symbol.name)
       .map(_.getCode)
       .filter(_ != null)
       .map(x =>
         Utility.codeToString(x.getCode, x.getConstantPool, 0, -1, true)
       )
       .map(_.toString.removeNewLines)
   }
   .getOrElse("NO_BYTECODE")

Similar to source code recovery, in some cases, it was simpler to exclude unknown synthetic Scala classes and return “NO_BYTECODE” instead of searching for their correct names in bytecode class files.

By utilising these methods, you can automatically extract data from all Scala 3 libraries indexed by Scaladex.

Conclusion

We have had the opportunity to work with the Scala 3 decompiler mechanism and have found numerous benefits. For one project, we specifically required the ASTs of methods and their corresponding comments. 

However, we can utilize this mechanism to extract various data related to code structure for statistical analysis or machine learning tasks. Throughout our usage of TastyInspector, we encountered internal errors that prevented the parsing of certain libraries into the Scala AST model. To ensure the correct reading of all produced TASTy files, we suggest considering the inclusion of these scripts as an additional step in the Scala Community Build
The functional scripts can be found in our repository ScalaTastiesScrapper. Overall, our experience with the Scala 3 decompiler has provided us with valuable insights and expanded possibilities for its effective utilization. Go try it out yourself.

Curated by

Sebastian Synowiec

Liked the article?

Share it with others!

explore more on

Take the first step to a sustained competitive edge for your business

Let's connect

VirtusLab's work has met the mark several times over, and their latest project is no exception. The team is efficient, hard-working, and trustworthy. Customers can expect a proactive team that drives results.

Stephen Rooke
Stephen RookeDirector of Software Development @ Extreme Reach

VirtusLab's engineers are truly Strapi extensions experts. Their knowledge and expertise in the area of Strapi plugins gave us the opportunity to lift our multi-brand CMS implementation to a different level.

facile logo
Leonardo PoddaEngineering Manager @ Facile.it

VirtusLab has been an incredible partner since the early development of Scala 3, essential to a mature and stable Scala 3 ecosystem.

Martin_Odersky
Martin OderskyHead of Programming Research Group @ EPFL

VirtusLab's strength is its knowledge of the latest trends and technologies for creating UIs and its ability to design complex applications. The VirtusLab team's in-depth knowledge, understanding, and experience of MIS systems have been invaluable to us in developing our product. The team is professional and delivers on time – we greatly appreciated this efficiency when working with them.

Michael_Grant
Michael GrantDirector of Development @ Cyber Sec Company