Reproducible Node.js built-in snapshots, part 1 - Overview and Node.js fixes

In a recent effort to make the Node.js builds reproducible again (at least on Linux), a big remaining piece was the built-in snapshot. I wrote some notes about the journey of making the Node.js built-in startup snapshot reproducible again, hopefully it can help someone else who were trying to do the same thing (or my future self since I have a bad memory) .

Overview of the Node.js built-in snapshot and code cache

In Node.js, about half of its own code is written in JavaScript. Here is a recent summary from running cloc lib src in the Node.js source directory:

-----------------------------------------------------------------------------------
Language                         files          blank        comment           code
-----------------------------------------------------------------------------------
JavaScript                         324          15056          15646          93311
C++                                175          14919           6931          79563
C/C++ Header                       206           6676           5196          30202
Markdown                             3            391              0           1308
Python                               1              0              0            154
Windows Resource File                1              6             22             39
YAML                                 1              0              3             24
JSON                                 1              0              0             22
XML                                  1              0              5             12
Assembly                             1              0              2              6
-----------------------------------------------------------------------------------
SUM:                               714          37048          27805         204641
-----------------------------------------------------------------------------------

(There are also a lot of third-party dependencies included in the final executable. This is just the code maintained by Node.js itself).

As part of the startup optimization, during the build process, Node.js runs a few initialization scripts to generate a custom V8 startup snapshot containing all the JavaScript values essential to Node.js - for example, the process object and various properties in the global object. This V8 snapshot, along with some Node.js specific data, is then embedded into into the final executable. The less commonly used JavaScript internals are only pre-compiled to generate V8 code cache (which mostly contains bytecode), but they won’t be actually loaded (executed) during the snapshot building process. These V8 code cache would also be included into the custom Node.js snapshot data.

When the Node.js executable is run, Node.js can just deserialize a snapshotted heap to set up its essential internals, which is a lot faster than having to compile and execute the initialization scripts from scratch to initialize the heap.

$ hyperfine --warmup 3 "./node_main --no-node-snapshot ./test/fixtures/empty.js" "./node_main ./test/fixtures/empty.js"
Benchmark 1: ./node_main --no-node-snapshot ./test/fixtures/empty.js
  Time (mean ± σ):      53.1 ms ±   0.3 ms    [User: 43.6 ms, System: 11.6 ms]
  Range (min … max):    52.1 ms …  53.7 ms    56 runs

Benchmark 2: ./node_main ./test/fixtures/empty.js
  Time (mean ± σ):      17.8 ms ±   0.3 ms    [User: 9.3 ms, System: 8.6 ms]
  Range (min … max):    17.1 ms …  18.7 ms    154 runs

Summary
  './node_main ./test/fixtures/empty.js' ran
    2.98 ± 0.05 times faster than './node_main --no-node-snapshot ./test/fixtures/empty.js'

When the user loads other built-in modules or the internals that are not included in the snapshot, Node.js would compile these extra JavaScript internals using the embedded V8 code cache to at least reduce the compilation cost. Some time would still need to be spent on running the compiled code to set up the built-ins, but it’s a good compromise to avoid bloating the snapshot with less commonly used objects.

$ hyperfine --warmup 3 "./node_main --no-node-snapshot ./benchmark/fixtures/require-builtins.js" "./node_main ./benchmark/fixtures/require-builtins.js"
Benchmark 1: ./node_main --no-node-snapshot ./benchmark/fixtures/require-builtins.js
  Time (mean ± σ):     101.1 ms ±   0.5 ms    [User: 88.1 ms, System: 16.0 ms]
  Range (min … max):    99.5 ms … 101.8 ms    29 runs

Benchmark 2: ./node_main ./benchmark/fixtures/require-builtins.js
  Time (mean ± σ):      35.1 ms ±   0.3 ms    [User: 24.1 ms, System: 12.4 ms]
  Range (min … max):    34.4 ms …  35.9 ms    83 runs

Summary
  './node_main ./benchmark/fixtures/require-builtins.js' ran
    2.88 ± 0.03 times faster than './node_main --no-node-snapshot ./benchmark/fixtures/require-builtins.js'

This optimization has been shipped since Node.js v12.5.0 and has been constantly improved. Recently, we’ve been looking into the reproducibility of Node.js executables again, and it turned out that the embedded snapshots and code cache have broken the reproducibility of Node.js, which wasn’t covered by any regression tests. So to make the Node.js executable reproducible again, we needed to fix the reproducibility of the built-in snapshot and the code cache first.

How the Node.js built-in snapshot is built

In ordinary release builds, a static library of Node.js’s own code but without the snapshot and code cache, libnode, is first built and linked into a node_mksnapshot executable, together with an empty snapshot.

Then, node_mksnapshot is run to generate the built-in snapshot and code cache, and write a C++ file containing the data defined as static literals. In a release build built by ninja, this building process can be reproduced with:

1	$ out/Release/node_mksnapshot out/Release/gen/node_snapshot.cc

This file is then compiled and linked with libnode to produce the final node executable.

To print debugging logs from node_mksnapshot, there are two handy environment variables: NODE_DEBUG_NATIVE=MKSNAPSHOT for data generation and NODE_DEBUG_NATIVE=SNAPSHOT_SERDES for data serialization (or NODE_DEBUG_NATIVE=MKSNAPSHOT,SNAPSHOT_SERDES if you want logs for both).

The snapshot generation for release builds is currently implemented by node::SnapshotBuilder::GenerateAsSource. The generated code roughly looks like this:

// Data of the V8 snapshot blob encoded in octal string literals
static const char *v8_snapshot_blob_data = "...";
static const int v8_snapshot_blob_size = 1579172; // Snapshot blob size in bytes
// Code cache for lib/vm.js encoded in octal string literals
static const uint8_t *vm_cache_data = reinterpret_cast<const uint8_t *>("...");
// Code cache for lib/util.js in octal string literals
static const uint8_t *util_cache_data = reinterpret_cast<const uint8_t *>("...");

const SnapshotData snapshot_data {
  SnapshotData::DataOwnership::kNotOwned,

  // Metadata:
  {
    SnapshotMetadata::Type::kDefault, // type
    "23.0.0-pre", // node_version
    "arm64", // node_arch
    "darwin", // node_platform
    3850758994, // v8_cache_version_tag
    static_cast<SnapshotFlags>(0), // flags
  },

   // V8 snapshot blob data:
  { v8_snapshot_blob_data, v8_snapshot_blob_size },

  // Per-isolate data:
  {
    // snapshot indexes of per-isolate primitive values
    { 0, 1, ...  },
    {
      // Name, macro list index and snapshot index of per-isolate v8::Templates:
      { "async_wrap_ctor_template", 0, 451 },
      ...
    }
  },

  // Main environment data:
  {
    ...,
    {  // principal realm data:
      {
        // names of built-ins loaded in the snapshot
        "async_hooks",
        ...
      },
      {
        // Name, macro list index and snapshot index of main context persistent v8::Values
        { "async_hooks_after_function", 0, 13 },
        ...
      },
      {
        // Type name, cleanup queue index and snapshot index of embedder objects
        { "modules::BindingData", 0, 41 },
        { "mksnapshot::BindingData", 1, 43 },
        ...
      },
      56,  // context index
    },
  },

  // Pre-compiled cache for built-in modules:
  {
    // Built-in name, code cache data pointer and cache size
    { "vm", {vm_cache_data, 5144, } },
    { "util", {util_cache_data, 10936, } },
    ...
  }
};

const SnapshotData* SnapshotBuilder::GetEmbeddedSnapshotData() {
  return &snapshot_data;
}

So to find the unreproducible parts in the built-in snapshots, we can just do this:

$ ./configure --ninja
$ ninja -C out/Release node_mksnapshot
$ out/Release/node_mksnapshot ./a.cc
$ out/Release/node_mksnapshot ./a.cc
$ diff a.cc b.cc

Variance in Node.js snapshot data

As described above, part of the Node.js built-in snapshot data comes from V8 and the rest comes from Node.js itself. While the V8 part is a binary blob and can be more challenging to grok, the Node.js part is written as C++ aggregate initializers so are easier to grok.

During my investigation, the first moving bits in the Node.js part of the snapshot I found was the order of the embedder objects - which is this part:

{
  // Type name, cleanup queue index and snapshot index of embedder objects
  { "modules::BindingData", 0, 41 },
  { "mksnapshot::BindingData", 1, 43 },
  ...
  // The order of the items in this list changed from run to run.
},

The Node.js built-in snapshot includes some embedder objects - mostly the binding objects that are returned by internalBinding() (internal version of the legacy process.binding()). These are used to pass C++ function wrappers and objects created by C++ land to JavaScript land. The set of the embedder objects created by the snapshot initialization scripts were constant across runs, which was good, but it turned out that order in which they were serialized varied from run to run, breaking the reproducibility of the snapshots.

This wasn’t too hard to fix. The embedder objects were tracked by Node.js in a “cleanup queue” structure, which was implemented as a std::unordered_set<CleanupHookCallback> (unordered_set was used for faster insertion), where the CleanupHookCallback structure included an index recording the insertion order. Node.js did use the recorded insertion order to clean up the embedder objects during shutdown, we just forgot to do the same during serialization and was directly iterating over the std::unordered_set. The insertion order of the embedder objects during snapshot serialization was deterministic, so the fix was simple - just serialize them in insertion order.

Up next: fixing V8 code cache and anatomy on V8 startup snapshot

After the fixes mentioned above, in the generated node_snapshot.cc, only the v8_snapshot_blob_data and the code cache data were changing. We’ll look into those parts in the next post.