The justbuild project
The justbuild generic build tool is the result of
me being asked in 2020
by my then employer to design a build system from scratch and lead
the development done by a small team. The project was open-sourced in
late 2022 with the first stable release in December of that same year.
I was leading the project at the technical level till
1.6.0
in mid 20205.
My main considerations during the design and development were the following.
-
In a setup (like the one in a company) where many
developers work on the same project while having common
resources (in the same trust realm), remote execution
will be dominant form of building, as otherwise the
same code will be built over and over again (once per
developer per change, instead only a single time per
change). So let's make local builds (which every build
system has to support) mimic remote execution: for
every step, create a fresh directory, hard link in the
inputs, run the command, hard link out the outputs to
content-addressable store (CAS), and dispose of the action
directory.
-
This model has the advantage that artifacts are just entries
in a CAS. There is no path attached to it. This has several
benefits, even if never using remote execution.
-
When defining an action (i.e., a build step),
we can freely chose where the inputs should
occur. Shorter paths and command lines, enough
space for toolchains etc, if needed.
-
It is not a problem, if different actions have
outputs with the same name. No overlapping-outputs
check, no "output must be in the same
module" to make the overlap checks feasible.
Also simpler output names, as we not have to
worry about making them unique.
-
A particular consequence of the just mentioned
flexibility is that there is no need for path
mangling when a repository is pulled in as
external dependency of another repository. The
actions defined by a self-contained repository
are always the same, no matter if it is
considered the main repsoitory, pulled in under
the name foo or pulled in under the
name bar.
-
No path mangling for configuration transitions either.
-
Without things being tied to paths let's consider
actions by how they are defined, rather than
by where they are defined. This is not only
the more natural notion of equality (at least for a
mathematical logician), but also allows us to be more
relaxed, e.g., when it comes to configuration transitions:
if a certain part of the code base does not depend on
the variables of a transition, it will not add aditional
actions to the action graph (remember that we don't do
path mangling for transitions).
-
For targets as well, they're not bound to a place of
definition; hence we can use any identifier for them.
This gives a nice solution for things like
protobuf, where we want to define dependency structure of
proto-files without having to know for which programming
languages users later will need language-specific
generated code: Our proto rules define abstract graph
nodes (with uninterpreted strings as rule names) and
language-specific code (that knows which rules can build
a proto library for that language) can depend on an anonymous
target named by the pair of an abstract graph
node and a rule binding for anstract rule names. Now,
if different targets use the same proto for the same
language they actually depend on the same target (named
by the same pair).
-
Since I mentioned multi repositories: we have to support
them (some companies have more repositories than others
and in the open-source world it is natural to separate
repositories at least for separate upstream). However,
agreeing on global naming is impossible, as experience
has shown. So, let every repository chose how it would
like to call its dependencies and we stitch things together
by binding these open names in a top-level graph. An
immediate advantage is that we can update from one
version of a library to another one just by changing
the binding in the top level graph, without renaming
labels everyhwere within the repository using that
library—even if our project has to use several
versions of that library simultaneously.
-
This explicit binding also gives us a better overview of
the code structure (at repository level). In particular,
we know all repositories a given repository transitively
depends upon. As we refrain from repository-name-based
path mangling, the value of a traget only depends on
itself, the supported configuration parameters, and its
dependencies—but not the consumer (as it should be!).
So if the transitively reachable repositories have not
changed, the target value is still the same. Hence we can cache
at that level. Moreover, the analysis of a target
might depend on things being equal (ensuring there is
no staging conflict), but not on things being different;
so we actually cache the extensional projection, i.e.,
the result of the build.
-
A neat side effect is that in this way, at
least for subsequent builds, we keep the
target-graph small. Hence we can afford to redo
the analysis at every invocation; no daemons
in the background.
-
These repositories are logical ones; a single
git repository can still be split
into many such logical repositories. The tree
identifier of a subdirectory identifies the
content and can still be obtained quickly.
-
Now, if we use remote execution then we download
our depdencies just to compute and walk the action
graph of the dependency a single time (because then
things end up in target-level cache). So let's make target-level
caching a service. That way, we don't even have
to download the dependencies at all; can just ask
that service (using a small key, as repositorie are
essentially described by their tree identifiers) to
give us an output description with references to the
artifacts that are in the CAS of the remote execution.
Other considerations I took into account.
-
People like to bikeshed about names; so let's
make the names of the target-description file, the
rules-description file, etc, configurable—of
coure on a per-repository basis.
-
Since we only care about how things are defined, but
not where, the source files and target descriptions (as
well as the rules and the expressions) can come from
different source roots.
-
Target descriptions should be declarative. In particular,
we want to read off the definition of a particular target
by a simple parsing of the file without file-global
evaluation (without the need for output-conflict checks,
there is also no need for the tool to look at other
targets defined in the same file). Also use a syntax
that does not look like a programming language. (In
the end, I chose JSON because every programming language can read and
write it.) To avoid magic symbols and magic names,
use structured target references and allow for the
non-path parts (i.e., everything except the module name
which is a path relative to the repository root) arbitrary
strings.
-
If people really need to make computations to target
descriptions (or rule defintions, or any other
root), then make sure that is cached properly. Computed
roots also allows people to bring their own syntax
for everything, as long as they can write a parser
for they favourite syntax (again, any programming
language can write JSON). For example, projects
with a strict source-code layout can restrict their
build description to short dependency hints and compute
the build description from essentially the directory
structure.
-
People voluntarily use git. This gives us a
quick way to obtain a suitable identifier of an artifact
or tree, without actually having to read the object.
So let's use as default protocol remote-execution based
on git identifiers; however, also support the plain
remote-execution protocol using a hash that was there
from the beginning.
I maintain a local
source mirror
that I update (fast-forward only) regularly as long as I'm involved in
the project and can accept the technical development. The project, so far,
consists of the following repositories.
-
The main repository
justbuild.
Contains the sources of the buildtool (just),
the repository fetching and setup tool (just-mr),
auxiliary tools for mainting multi-repository descriptions, and
the documentation (man pages, description of concepts, tutorial).
-
Language-specific rules for
-
A demonstration on how to use
justbuild on nix
for local builds, making good use of the fact hat
nix allows to easily set up well-defined
dependencies and a well-defined environment.
-
A bootstrappable description of a
C/C++ toolchain
consisting of compilers with appropriate libraries, linting tools, as well
as some auxilary programms (busybox, python,
make, cmake).
-
A justbuild description to build
static binaries
of just and just-mr, together with the hashes that
these binaries wil have.
The project is also
packaged
in various distributions.
Selected talks.
PS: Naming is hard and I know that the name for this project is not
well chosen. But that's what you get when a "fun" name
is used internally as a code name knowing that a comittee (which
I was not a member of) will decide on an official name before
the open sourcing—and that committee decides on the
temporary name.