Welcome to the new Gradle Dependency Cache

Introduction

A key requirements of an enterprise build system is build reproducability. Current local dependency caches, such as those implemented by Ivy or Maven, are creating many problems in this respect for repository based enterprise builds. This has been the case regardless of the build system in use, be it Ant + Ivy, Gradle, Leiningen, SBT, or Maven.

With our new cache implementation, Gradle addresses many of the caching challenges. We are very excited about our investment in this segment of the build system. In this document we want to share the reasons why we have implemented a new cache. First we would like to say a big thank you to Fred Simon from JFrog, co-founder of Artifactory, for starting the new cache project and implementing the initial version.

Problems with current caches

Status Quo

So far local dependency caches don’t take the artifact origin (specifically, the URL or other source address) properly into account. The artifact is simply stored together with its metadata (e.g. pom.xml or ivy.xml). Regardless of the build and its repository configuration, always the same artifacts are returned, based on a simple name matching pattern. In the following sections, we will describe scenarios where this behavior leads to problems and how we are solving it in Gradle. We will also describe other common problems of current dependency caches and how Gradle aims to trump those solutions.

Hiding problems due to repository changes

Imagine a new repository is introduced with a different URL and not all artifacts from the original repository have been properly migrated. For people who have successfully built the projects depending on the original repository, the build will still work since the local cache will provide all the artifacts. When a new developer checks out the projects however, they won’t build, failing on unresolved dependencies. The cache is hiding a configuration problem.

Jar’s with the same name might be different.

Imagine a developer who has worked on project Foo and is now also working on project Bar. Foo and Bar are using different repositories. Both use an inhouse library with the artifact name superlib. The Foo and the Bar repository have a superlib-1.0.jar in it. But in this case the jars are not the same. This is a messy situation. But it is also a frequent reality in the enterprise. The developer now builds Bar which is using the the superlib-1.0.jar from Foo because the local cache will return it. The build fails compilation or tests and nobody knows why. The other Bar developers can’t reproduce the problem because they are not working on Foo. The cache is creating special behaviour which is hard to debug.

Multiple latest snapshots.

Another scenario for the Foo and Bar project from above is that they use both snapshots from the latest version of Lucene. Their respective repositories have differents lucene snapshots in it. The build master of Bar has uploaded the latest snapshot from yesterday because it has a new feature the Bar team desperately need. Foo goes with a snapshot that is two weeks old because the latest Lucene snapshots don’t work for them. It causes an out of memory exception. They also don’t need any of the newest features. The developer now builds Bar which takes the latest snapshot of Lucene into the local cache. The next time she builds Foo, the tests fail with the out of memory exception from above. Her colleagues on the Foo team can’t reproduce the problem. The cache creates incorrect and difficult to debug behaviour. The strategies you can apply with dynamic revision numbers are severely affected by such a cache behaviour.

Local builds are polluting the cache

Another scenario for the Foo and Bar project using Lucene is the following: A developer is also working on the code base of the latest version of Lucene. He makes some changes to the codebase and builds it with Maven. He has a sample project that consumes his latest build of Lucene. To make the sample project work, he installs the necessary JARs into the local cache. Now the local Foo and Bar build will also pick up the locally built Lucene snapshot. Again, the cache creates incorrect special behavior.

Concurrency behaviour

The common dependency caches easily get corrupted when multiple builds run in parallel. They are not concurrency safe.

The new Gradle Dependency Cache

The objectives for our new cache are:

  • Optimize local disk usage
  • Minimize bandwidth consumption and download time
  • Identify valid artifacts
  • Prevent the creation of corrupted jars
  • Enable concurrent access to the artifact cache
  • Identify locally built artifacts
  • Identify and maintain metadata on each artifact’s origin
  • Support resolver configuration changes

Cache Structure

The new dependency cache has a per user store for artifacts (e.g. binaries like jars). In that store there is one and only one artifact stored per checksum. The metadata (e.g. pom.xml or ivy.xml) is stored in a per-repository cache which links to the corresponding artifacts. The name of the link is based on the artifact name described in the metadata. The actual file it links to (e.g. the jar) is solely identified by its checksum, much like how Git points to a blob in its object bag.

Bandwidth Efficiency

Before downloading an artifact, Gradle tries to inspect the checksum of the artifact to be downloaded. For example either by downloading the sha file or if Artifactory is used, by asking the repository manager. If the checksum can be retrieved, an artifact is only downloaded if no such artifact already exists with that checksum in the local cache.. If the checksum can’t be retrieved the artifact is always downloaded and ignored if such an artifact already exists.

Origin Validity

As described above, for each repository there is a separate metadata cache. The repository is identified by its URL, type and layout. A build will fail if the required artifacts are not in the repository specified by the build regardless whether the local cache has retrieved this artifact from a different repository. For example if you have changed the primary repository for your project, Gradle will check whether the new repository contains all the necessary artifacts. If not, it will fail. It will not download the artifacts, if the artifacts are already in the cache.

Origin Validity will isolate builds from each other in an advanced way that no build tool has done before. It is a key feature to avoid incorrect and surprising behavior of a local build.

Checksum Validity

It is possible that you have links with different names in different repository caches that point to the same artifact, or, you might have the absolute reverse of that situation. Links with the same name in different repository caches can point to different artifacts. The job of the cache is to exactly reflect the state of the repositories thus enabling reproducable builds independent of which projects are checked out and the history of the cache usage.

Checksum Validity will isolate builds from each other. It is a key feature to avoid incorrect and surprising behaviour of a local build.

Concurrency

The cache is concurrency safe.

Conclusions

The new Gradle cache prevents the local cache from hiding problems and creating mysterious and difficult to debug behavior that has been a challenge with many build tools. This new behavior is implemented in a bandwidth and storage efficient way. It enables reliable and reproducible enterprise builds which is exactly what you should and now can expect of an advanced build tool such as Gradle.

Can’t wait to see this in action; will this be going into 1.0-milestone-5?

I’m pretty jazzed to see such a difficult problem tackled in my favorite build tool.

Yes. See also our release plan toward 1.0: http://forums.gradle.org/gradle/topics/on_our_way_to_gradle_1_0

It was a pleasure to read this article. I hope you will share such thoughts regularly on this forum. And the cache itself is very cool, too, of course.

While comparing 1.0-milestone-3 with the latest 1.0-milestone-5 nightly, dependency resolution is significantly slower (adds multiple minutes) in our build because we have several different repositories and now we’re downloading lots more .pom files and duplicate JARs.

Is it possible to tie a dependency to a given repository?

I have written up a bunch of details for Tim Berglund and hope that we can come up with some way to keep those build times down while incorporating these new changes

@Eric, our next steps for improving dependency resolution performance are described in this thread on the dev list. You should end up with better performance than milestone-3, even with multiple repositories.

@Adam, that would rock! Dependency resolution slowness is the #1 complaint from my team switching to Gradle so far.

Hi, I have mixed feelings about whether these issues are really so important that we need yet another cache system (YACS). I don’t think any of those problems are so big we cannot cope with them another way. I already have ivy and maven caches on my box and really don’t need to have the same artifact in 3 places.

Hiding problems due to repository changes - this is indeed a configuration problem and has to be tacked by sysadmin who migrates the repo. He has to be aware that he must run all builds which fetch artifacts from new repo and he must clean the local cache before that.

Jar’s with the same name might be different - if that occurs the company should fix naming/versioning of that artifact instead of hiding such problem with advanced cache system.

Multiple latest snapshots - the problems with snapshots is that build is not reproducible. Offline mode is an option if I don’t want to update snapshots. Or depending on actual nightly build or whatever is better if I really don’t want some snapshot to break my build. That’s what snapshots do.

Local builds are polluting the cache - the same: if I don’t want to break some build with snapshots, I rather make it depend on specific buildnumber version. Or, which is probably more important, if I break the build of Foo and Bar, it’s a fail-fast indicator that I have to update my code because api has changed and that’s a good thing.

Concurrency behaviour - I think this can be incorporated into existing caches as well

Juraj

Many thanks for your feedback. I can understand your mixed feeling regarding yet another cache. But for what we are trying to achieve with our cache we don’t really have a choice.

Configuration changes: We prefer automated enforcement of consistency if possible.

Multiple Jars: Generally spoken, sometimes it is simply not possiblt in complex enterprise scenarios to unify a view across teams and products. There are many possible reasons for that. The philosophy of Gradle is to be the humble servant. If it is required to isolate we want to enable that. We don’t want to be the arrogant tool that tells people how there world should look like without knowing there world.

Latest Snapshots: It enables fail fast instead of fail wrongly. Snapshots enforce integration pressure. But it might be important to be able to define a context for latest. Right now the context is just the artifact id. With the new cache we add the repository towards the context. That way you can have different context for latest for the same artifact. Again, we think it is very important in the enterprise to be able to isolate projects and teams where it is ncecessary.

Local Builds: Again, I think this is a question of context.

I’m aware that some builds and companies will have more benefits from those features than others. If you are always working on a single multi-project build many of the dangers described in the posting won’t hit you. But if they hit someone they might become expensive at least in terms of debugging time. So we think a more reliable cache is simply a good thing.

We also want to use the new cache for other stuff than just external dependncies. For example in Gradle you can retrieve build scripts as source from a remote location. We want to provide it for any resource you may want to retrieve remotely. That is another reason why need our own implementation.

Thank you for taking on the concurrency issue! This will allow us to remove our single-threaded fudge in Jenkins and speed things up considerably!

Thank you for tackling one of the more complex and time consuming aspects of our client builds.

I think this form of impromptu information to your user community is invaluable and fills an information void in the community. +1 for more of these info posts from Hans and team.

Does it address the issue of the fact that checking out from source and updating dependencies are not atomic? This has always been an issue with one developer B checking out an hour later and updating dependencies and his build is broken, and so he diffs his source with developer A and they are the same source, but binaries from dependency management has changed.

ie. it would almost be nice to run dependency management and check in the jars so everycheckout is atomic. I can’t think of any other solutions, but I have seen this hit a few times.

NOTE: This is especially true when using the 1.0.+ syntax to get the latest versions of stuff with their bug fixes and such.

For best reproducibility, you will have to use fixed version numbers.

REALLY? Ivy used to generate the version numbers that ended up being used so you could still reproduce the builds. It generated an ivy.xml on publish with the REAL versions that were used at build time.

How would that help you to reproduce the build later? Wouldn’t you have to copy over the information from the ivy.xml into the build.gradle? What if you no longer had the ivy.xml (which typically isn’t under source control)? Then again, you’d need to keep the dependencies anyway to achieve reproducibility, so you might as well keep the ivy.xml.

So maybe you can get away with ranges when using Ivy repos (in terms of reproducibility; performance will suffer a lot). But as far as I know, you won’t currently get something similar for Maven repos.