Skip to main content

Advantages of monorepos

Here's a conversation I keep having:
Someone: Did you hear that Facebook/Google uses a giant monorepo? WTF!
Me: Yeah! It's really convenient, don't you think?
Someone: That's THE MOST RIDICULOUS THING I've ever heard. Don't FB and Google know what a terrible idea it is to put all your code in a single repo?
Me: I think engineers at FB and Google are probably familiar with using smaller repos (doesn't Junio Hamano work at Google?), and they still prefer a single huge repo for [reasons].
Someone: Oh that does sound pretty nice. I still think it's weird but I could see why someone would want that.
“[reasons]” is pretty long, so I'm writing this down in order to avoid repeating the same conversation over and over again.

Simplified organization

With multiple repos, you typically either have one project per repo, or an umbrella of related projects per repo, but that forces you to define what a “project” is for your particular team or company, and it sometimes forces you to split and merge repos for reasons that are pure overhead. For example, having to split a project because it's too big or has too much history for your VCS is not optimal.
With a monorepo, projects can be organized and grouped together in whatever way you find to be most logically consistent, and not just because your version control system forces you to organize things in a particular way. Using a single repo also reduces overhead from managing dependencies.
A side effect of the simplified organization is that it's easier to navigate projects. The monorepos I've used let you essentially navigate as if everything is on a networked file system, re-using the idiom that's used to navigate within projects. Multi repo setups usually have two separate levels of navigation -- the filesystem idiom that's used inside projects, and then a meta-level for navigating between projects.
A side effect of that side effect is that, with monorepos, it's often the case that it's very easy to get a dev environment set up to run builds and tests. If you expect to be able to navigate between projects with the equivalent of cd, you also expect to be able to do cd; make. Since it seems weird for that to not work, it usually works, and whatever tooling effort is necessary to make it work gets done1. While it's technically possible to get that kind of ease in multiple repos, it's not as natural, which means that the necessary work isn't done as often.

Simplified dependencies

This probably goes without saying, but with multiple repos, you need to have some way of specifying and versioning dependencies between them. That sounds like it ought to be straightforward, but in practice, most solutions are cumbersome and involve a lot of overhead.
With a monorepo, it's easy to have one universal version number for all projects. Since atomic cross-project commits are possible, the repository can always be in a consistent state -- at commit #X, all project builds should work. Dependencies still need to be specified in the build system, but whether that's a make Makefiles or bazel BUILD files, those can be checked into version control like everything else. And since there's just one version number, the Makefiles or BUILD files or whatever you choose don't need to specify version numbers.

Tooling

The simplification of navigation and dependencies makes it much easier to write tools. Instead of having tools that must understand relationships between repositories, as well as the nature of files within repositories, tools basically just need to be able to read files (including some file format that specifies dependencies between units within the repo).
This sounds like a trivial thing but, take this example by Christopher Van Arsdale on how easy builds can become:
The build system inside of Google makes it incredibly easy to build software using large modular blocks of code. You want a crawler? Add a few lines here. You need an RSS parser? Add a few more lines. A large distributed, fault tolerant datastore? Sure, add a few more lines. These are building blocks and services that are shared by many projects, and easy to integrate. … This sort of Lego-like development process does not happen as cleanly in the open source world. … As a result of this state of affairs (more speculation), there is a complexity barrier in open source that has not changed significantly in the last few years. This creates a gap between what is easily obtainable at a company like Google versus a[n] open sourced project.
The system that Arsdale is referring to is so convenient that, before it was open sourced, ex-Google engineers at Facebook and Twitter wrote their own versions of bazel in order to get the same benefits.
It's theoretically possible to create a build system that makes building anything, with any dependencies, simple without having a monorepo, but it's more effort, enough effort that I've never seen a system that does it seamlessly. Maven and sbt are pretty nice, in a way, but it's not uncommon to lose a lot of time tracking down and fixing version dependency issues. Systems like rbenv and virtualenv try to sidestep the problem, but they result in a proliferation of development environments. Using a monorepo where HEAD always points to a consistent and valid version removes the problem of tracking multiple repo versions entirely2.
Build systems aren't the only thing that benefit from running on a mono repo. Just for example, static analysis can run across project boundaries without any extra work. Many other things, like cross-project integration testing and code search are also greatly simplified.

Cross-project changes

With lots of repos, making cross-repo changes is painful. It typically involves tedious manual coordination across each repo or hack-y scripts. And even if the scripts work, there's the overhead of correctly updating cross-repo version dependencies. Refactoring an API that's used across tens of active internal projects will probably a good chunk of a day. Refactoring an API that's used across thousands of active internal projects is hopeless.
With a monorepo, you just refactor the API and all of its callers in one commit. That's not always trivial, but it's much easier than it would be with lots of small repos. I've seen APIs with thousands of usages across hundreds of projects get refactored and with a monorepo setup it's so easy that it's no one even thinks twice.
Most people now consider it absurd to use a version control system like CVS, RCS, or ClearCase, where it's impossible to do a single atomic commit across multiple files, forcing people to either manually look at timestamps and commit messages or keep meta information around to determine if some particular set of cross-file changes are “really” atomic. SVN, hg, git, etc solve the problem of atomic cross-file changes; monorepos solve the same problem across projects.
This isn't just useful for large-scale API refactorings. David Turner, who worked on twitter's migration from many repos to a monorepo gives this example of a small cross-cutting change and the overhead of having to do releases for those:
I needed to update [Project A], but to do that, I needed my colleague to fix one of its dependencies, [Project B]. The colleague, in turn, needed to fix [Project C]. If I had had to wait for C to do a release, and then B, before I could fix and deploy A, I might still be waiting. But since everything's in one repo, my colleague could make his change and commit, and then I could immediately make my change.
I guess I could do that if everything were linked by git versions, but my colleague would still have had to do two commits. And there's always the temptation to just pick a version and "stabilize" (meaning, stagnate). That's fine if you just have one project, but when you have a web of projects with interdependencies, it's not so good.
[In the other direction,] Forcing dependees to update is actually another benefit of a monorepo.
It's not just that making cross-project changes is easier, tracking them is easier, too. To do the equivalent of git bisect across multiple repos, you must be disciplined about using another tool to track meta information, and most projects simply don't do that. Even if they do, you now have two really different tools where one would have sufficed.

Mercurial and git are awesome; it's true

The most common response I've gotten to these points is that switching to either git or hg from either CVS or SVN is a huge productivity win. That's true. But a lot of that is because git and hg are superior in multiple respects (e.g., better merging), not because having small repos is better per se.
In fact, Twitter has been patching git and Facebook has been patching Mercurial in order to support giant monorepos.

Downsides

Of course, there are downsides to using a monorepo. I'm not going to discuss them because the downsides are already widely discussed. Monorepos aren't strictly superior to manyrepos. They're not strictly worse, either. My point isn't that you should definitely switch to a monorepo; it's merely that using a monorepo isn't totally unreasonable, that folks at places like Google, Facebook, Twitter, Digital Ocean, and Etsy might have good reasons for preferring a monorepo over hundreds or thousands or tens of thousands of smaller repos.

Other discussion

Thanks to Kamal Marhubi, David Turner, and Leah Hanson for extensive discussion on this topic. At least half of the ideas here come from them. Also, thanks to Leah Hanson, Mindy Preston, Chris Ball, Daniel Espeset, Joe Wilder, Nicolas Grilly, Giovanni Gherdovich, Paul Hammant, and Simon Thulbourn for finding typos and other mistakes in this post.

  1. This was even true at a hardware company I worked at which created a monorepo by versioning things in RCS over NFS. Of course you can't let people live edit files in the central repository so someone wrote a number of scripts that basically turned this into perforce. I don't recommend this system, but even with an incredibly hacktastic monorepo, you still get a lot of the upsides of a monorepo. [return]
  2. At least as long as you have some mechanism for vendoring upstream dependencies. While this works great for Google because Google writes a large fraction of the code it relies on, and has enough employees that tossing all external dependencies into the monorepo has a low cost amortized across all employees, I could imagine this advantage being too expensive to take advantage of for smaller companies. [return]

Comments

Popular posts from this blog

OWASP Top 10 Threats and Mitigations Exam - Single Select

Last updated 4 Aug 11 Course Title: OWASP Top 10 Threats and Mitigation Exam Questions - Single Select 1) Which of the following consequences is most likely to occur due to an injection attack? Spoofing Cross-site request forgery Denial of service   Correct Insecure direct object references 2) Your application is created using a language that does not support a clear distinction between code and data. Which vulnerability is most likely to occur in your application? Injection   Correct Insecure direct object references Failure to restrict URL access Insufficient transport layer protection 3) Which of the following scenarios is most likely to cause an injection attack? Unvalidated input is embedded in an instruction stream.   Correct Unvalidated input can be distinguished from valid instructions. A Web application does not validate a client’s access to a resource. A Web action performs an operation on behalf of the user without checking a shared sec

CKA Simulator Kubernetes 1.22

  https://killer.sh Pre Setup Once you've gained access to your terminal it might be wise to spend ~1 minute to setup your environment. You could set these: alias k = kubectl                         # will already be pre-configured export do = "--dry-run=client -o yaml"     # k get pod x $do export now = "--force --grace-period 0"   # k delete pod x $now Vim To make vim use 2 spaces for a tab edit ~/.vimrc to contain: set tabstop=2 set expandtab set shiftwidth=2 More setup suggestions are in the tips section .     Question 1 | Contexts Task weight: 1%   You have access to multiple clusters from your main terminal through kubectl contexts. Write all those context names into /opt/course/1/contexts . Next write a command to display the current context into /opt/course/1/context_default_kubectl.sh , the command should use kubectl . Finally write a second command doing the same thing into /opt/course/1/context_default_no_kubectl.sh , but without the use of k

标 题: 关于Daniel Guo 律师

发信人: q123452017 (水天一色), 信区: I140 标  题: 关于Daniel Guo 律师 关键字: Daniel Guo 发信站: BBS 未名空间站 (Thu Apr 26 02:11:35 2018, 美东) 这些是lz根据亲身经历在 Immigration版上发的帖以及一些关于Daniel Guo 律师的回 帖,希望大家不要被一些马甲帖广告帖所骗,慎重考虑选择律师。 WG 和Guo两家律师对比 1. fully refund的合约上的区别 wegreened家是case不过只要第二次没有file就可以fully refund。郭家是要两次case 没过才给refund,而且只要第二次pl draft好律师就可以不退任何律师费。 2. 回信速度 wegreened家一般24小时内回信。郭律师是在可以快速回复的时候才回复很快,对于需 要时间回复或者是不愿意给出确切答复的时候就回复的比较慢。 比如:lz问过郭律师他们律所在nsc区域最近eb1a的通过率,大家也知道nsc现在杀手如 云,但是郭律师过了两天只回复说让秘书update最近的case然后去网页上查,但是上面 并没有写明tsc还是nsc。 lz还问过郭律师关于准备ps (他要求的文件)的一些问题,模版上有的东西不是很清 楚,但是他一般就是把模版上的东西再copy一遍发过来。 3. 材料区别 (推荐信) 因为我只收到郭律师写的推荐信,所以可以比下两家推荐信 wegreened家推荐信写的比较长,而且每封推荐信会用不同的语气和风格,会包含lz写 的research summary里面的某个方面 郭家四封推荐信都是一个格式,一种语气,连地址,信的称呼都是一样的,怎么看四封 推荐信都是同一个人写出来的。套路基本都是第一段目的,第二段介绍推荐人,第三段 某篇或几篇文章的abstract,最后结论 4. 前期材料准备 wegreened家要按照他们的模版准备一个十几页的research summary。 郭律师在签约之前说的是只需要准备五页左右的summary,但是在lz签完约收到推荐信 ,郭律师又发来一个很长的ps要lz自己填,而且和pl的格式基本差不多。 总结下来,申请自己上心最重要。但是如果选律师,lz更倾向于wegreened,