Reproducible Node.js built-in snapshots, part 1 - Overview and Node.js fixes
In a recent effort to make the Node.js builds reproducible again (at least on Linux), a big remaining piece was the built-in snapshot. I wrote some notes about the journey of making the Node.js built-in startup snapshot reproducible again, hopefully it can help someone else who were trying to do the same thing (or my future self since I have a bad memory) .
Overview of the Node.js built-in snapshot and code cache
In Node.js, about half of its own code is written in JavaScript. Here is a recent summary from running cloc lib src
in the Node.js source directory:
1 | ----------------------------------------------------------------------------------- |
(There are also a lot of third-party dependencies included in the final executable. This is just the code maintained by Node.js itself).
As part of the startup optimization, during the build process, Node.js runs a few initialization scripts to generate a custom V8 startup snapshot containing all the JavaScript values essential to Node.js - for example, the process
object and various properties in the global object. This V8 snapshot, along with some Node.js specific data, is then embedded into into the final executable. The less commonly used JavaScript internals are only pre-compiled to generate V8 code cache (which mostly contains bytecode), but they won’t be actually loaded (executed) during the snapshot building process. These V8 code cache would also be included into the custom Node.js snapshot data.
When the Node.js executable is run, Node.js can just deserialize a snapshotted heap to set up its essential internals, which is a lot faster than having to compile and execute the initialization scripts from scratch to initialize the heap.
1 | $ hyperfine --warmup 3 "./node_main --no-node-snapshot ./test/fixtures/empty.js" "./node_main ./test/fixtures/empty.js" |
When the user loads other built-in modules or the internals that are not included in the snapshot, Node.js would compile these extra JavaScript internals using the embedded V8 code cache to at least reduce the compilation cost. Some time would still need to be spent on running the compiled code to set up the built-ins, but it’s a good compromise to avoid bloating the snapshot with less commonly used objects.
1 | $ hyperfine --warmup 3 "./node_main --no-node-snapshot ./benchmark/fixtures/require-builtins.js" "./node_main ./benchmark/fixtures/require-builtins.js" |
This optimization has been shipped since Node.js v12.5.0 and has been constantly improved. Recently, we’ve been looking into the reproducibility of Node.js executables again, and it turned out that the embedded snapshots and code cache have broken the reproducibility of Node.js, which wasn’t covered by any regression tests. So to make the Node.js executable reproducible again, we needed to fix the reproducibility of the built-in snapshot and the code cache first.
How the Node.js built-in snapshot is built
In ordinary release builds, a static library of Node.js’s own code but without the snapshot and code cache, libnode
, is first built and linked into a node_mksnapshot
executable, together with an empty snapshot.
Then, node_mksnapshot
is run to generate the built-in snapshot and code cache, and write a C++ file containing the data defined as static literals. In a release build built by ninja, this building process can be reproduced with:
1 | $ out/Release/node_mksnapshot out/Release/gen/node_snapshot.cc |
This file is then compiled and linked with libnode
to produce the final node
executable.
To print debugging logs from node_mksnapshot
, there are two handy environment variables: NODE_DEBUG_NATIVE=MKSNAPSHOT
for data generation and NODE_DEBUG_NATIVE=SNAPSHOT_SERDES
for data serialization (or NODE_DEBUG_NATIVE=MKSNAPSHOT,SNAPSHOT_SERDES
if you want logs for both).
The snapshot generation for release builds is currently implemented by node::SnapshotBuilder::GenerateAsSource
. The generated code roughly looks like this:
1 | // Data of the V8 snapshot blob encoded in octal string literals |
So to find the unreproducible parts in the built-in snapshots, we can just do this:
1 | $ ./configure --ninja |
Variance in Node.js snapshot data
As described above, part of the Node.js built-in snapshot data comes from V8 and the rest comes from Node.js itself. While the V8 part is a binary blob and can be more challenging to grok, the Node.js part is written as C++ aggregate initializers so are easier to grok.
During my investigation, the first moving bits in the Node.js part of the snapshot I found was the order of the embedder objects - which is this part:
1 | { |
The Node.js built-in snapshot includes some embedder objects - mostly the binding objects that are returned by internalBinding()
(internal version of the legacy process.binding()
). These are used to pass C++ function wrappers and objects created by C++ land to JavaScript land. The set of the embedder objects created by the snapshot initialization scripts were constant across runs, which was good, but it turned out that order in which they were serialized varied from run to run, breaking the reproducibility of the snapshots.
This wasn’t too hard to fix. The embedder objects were tracked by Node.js in a “cleanup queue” structure, which was implemented as a std::unordered_set<CleanupHookCallback>
(unordered_set
was used for faster insertion), where the CleanupHookCallback
structure included an index recording the insertion order. Node.js did use the recorded insertion order to clean up the embedder objects during shutdown, we just forgot to do the same during serialization and was directly iterating over the std::unordered_set
. The insertion order of the embedder objects during snapshot serialization was deterministic, so the fix was simple - just serialize them in insertion order.
Up next: fixing V8 code cache and anatomy on V8 startup snapshot
After the fixes mentioned above, in the generated node_snapshot.cc
, only the v8_snapshot_blob_data
and the code cache data were changing. We’ll look into those parts in the next post.