Reproducible Node.js built-in snapshots, part 3 - fixing V8 startup snapshot

In the previous posts we looked into how the Node.js executable was made a bit more reproducible after the Node.js snapshot data and the V8 code cache were made reproducible, and did a bit of anatomy on the unreproducible V8 snapshot blobs. In this post let’s see how we made the V8 snapshot reproducible and finally made the Node.js executable reproducible again on Linux.

Timestamps captured in the V8 startup snapshot

After digging into the V8 snapshot blob using the steps described in the previous post, the first moving bits in the snapshot that I identified was the time origin of the Performance API binding.

The Web Performance APIs in Node.js is partly implemented in JavaScript. To allow fast data passing between C++ and JavaScript, Node.js saves the timestamp the process starts (as the time origin) in an array buffer. When this time origin value is needed (e.g. by performance.now()), the JavaScript internals would read the it from a Float64Array over this array buffer.

Obviously, this timestamp of process startup time would change from run to run. We did have some logic to reset the time origin after in the array buffer after the snapshot is deserialized, so that e.g. performance.now() calculation would be correct, but then the unrefreshed timestamp was still captured within the snapshot and it introduced indeterminism in the snapshot. Other than the time origin, this array buffer was also used to pass calculated values between C++ and JavaScript, so if any of the snapshot initialization scripts calls the Performance APIs that use this array buffer, the temporarily stored results would also be in the snapshot.

So the fix was simple - in addition to refreshing the array buffer after snapshot deserialization, also resetting them to a deterministic value before snapshot serialization, since whatever values snapshotted are going to be thrown away and re-computed on-demand with how the Performance API implementation uses them.

Embedder data in the V8 startup snapshot

After fixing the timestamps, the rest of the moving bits appeared to be all related to the embedder data stored in V8 heap objects.

There are two types of V8 embedder data that are used by Node.js:

Internal fields in v8::Objects. They can be either pointer fields or other V8 heap objects. Pointer fields are accessed with v8::Object::GetAlignedPointerFromInternalField() and v8::Object::SetAlignedPointerInInternalField(), heap objects are accessed with v8::Object::GetInternalField() and v8::Object::SetInternalField().
Embedder data in v8::Contexts. Pointer fields are accessed with v8::Object::GetAlignedPointerFromEmbedderData() and v8::Object::SetAlignedPointerInEmbedderData(), heap objects are accessed with v8::Object::GetEmbedderData() and v8::Object::SetEmbedderData().

Fixing embedder data in object internal fields

When serializing the V8 heap into a snapshot, V8 accepts a v8::SerializeInternalFieldsCallback for embedders to supply additional data (wrapped in v8::StartupData) for all the pointer internal fields it finds in the context. The embedder-supplied data would be copied into the V8 snapshot blob. At deserialization time, V8 would run another v8::DeserializeInternalFieldsCallback provided by the embedder on each pointer internal field serialized in the snapshot.

In the case of Node.js, embedder objects have the following internal fields (numbers are the indexes):

1
2
3

[ 0 ] embedder type
[ 1 ] counterpart C++ object
... (additional fields, most embedders don't have them)

The pointer field at index 0 pointed to something that allowed Node.js to identify whether the embedder object was managed by Oilpan or by Node.js’s BaseObject abstraction (note that this is subject to change in the near future, but it was still relevant at the time of the fixes). The pointer field at index 1 pointed to a counterpart C++ object - for example, the JS object returned by internalBinding('fs') had an internal pointer field at index 1 pointing to a node::fs::BindingData instance.

Before the V8 startup snapshot is generated using v8::SnapshotCreator::CreateBlob(), Node.js would do something like this:

realm->ForEachBaseObject([&](BaseObject* obj) {
  if (!obj->is_snapshotable()) return;
  SnapshotableObject* ptr = static_cast<SnapshotableObject*>(obj);
  // PrepareForSerialization() implemented by different embedder object
  // classes would
  // 1. Do some necessary cleanups (like the timestamp resets mentioned before)
  // 2. Allocate a `node::InternalFieldInfoBase` (usually called
  //    `InternalFieldInfo`
  // 3. use `v8::SnapshotCreator::AddData()` to get snapshot indexes to
  //    other JS values and save them in the `InternalFieldInfo`
  // 4. Keep this `InternalFieldInfo` alive until `ptr->Serialize()` is
  //    called in the V8 internal field serializer callback.
  if (ptr->PrepareForSerialization(context, creator)) {
    SnapshotIndex index = creator->AddData(context, obj->object());
    info->native_objects.push_back({type_name, i, index});
  }
});

Since the embedder objects all follow the same layout, Node.js doesn’t actually need the internal fields to be serialized individually. All the preparation and serialization can be done on a per-object basis.

When v8::SnapshotCreator::CreateBlob() is called, the v8::SerializeInternalFieldsCallback provided by Node.js would simply wrap the prepared InternalFieldInfo into a v8::StartupData and return it. Because the data is per-object, not per-field, we only returned this for the field at index 0. The serializer callback would simply return null data for all other fields.

This was where the indeterminism was introduced: per the API contract, if the callback returns null data for a field, the pointer field would be serialized verbatim i.e. the memory addresses would be copied into the V8 snapshot blob. And the memory addresses changed from run to run. To fix this, we needed to return some non-null data for all the pointer fields in the callback, even if they don’t have any field-specific data to serialize. We went with a simple scheme:

For slot 0, return a copy of the EmbedderObjectType (essentially a uint8_t at the time) that was designated for all the embedder objects.
For slot 1, return the prepared InternalFieldInfo, since that’s where the object pointer is actually stored, and for slot 0,

After this, the number of unreproducible bits in the snapshot was reduced. There were still some remaining bits though.

Fixing embedder data in contexts

In the section above, we mentioned that Node.js also uses the embedder data slots in v8::Context. While V8 has provided a callback for serializing embedder data for object internal fields, the customization for serializing context data had been a remaining TODO. Without any API to customize context data serialization, V8 simply copied the context data slots verbatim into the snapshot. And in the case of Node.js, there were several pointer in the context data that would change from run to run, and the copies of them in the snapshot made the snapshot unreproducible.

So the first thing to do was to implement an API in V8 to allow customization of context data serialization, which I did in this V8 CL.

After the V8 CL rolled into Node.js, another follow-up was needed to implement the context data serialization. At the time of writing, there are only 4 pointer fields in a Node.js context:

ContextEmbedderIndex::kEnvironment: index 32
ContextEmbedderIndex::kContextifyContext: index 37
ContextEmbedderIndex::kRealm: index 38
ContextEmbedderIndex::kContextTag: index 39

(The comments in src/node_context_data.h document what these pointers are for).

These are reset (through node::Environment::AssignToContext()) soon after the context is deserialized, and in general the C++ constructs being pointed to are either constant or are always created separately in C++ at deserialization time, so there is no need to preserve any special data in the snapshot for them.

Initially I sent a PR to just implement the context data callback and return null data for all the pointer fields, which didn’t improve anything, because…by the time the V8 CL came backported, I already forgot about the “pointer fields would be serialized verbatim if null data is returned” API behavior I implemented (so that it’s consistent with the internal field serializer). Instead I went with my intuition and thought “pointer fields should be serialized as null if null data is returned”. Silly me eh (:S).

After recovering from my own silliness and changing the callback to just return the indexes as the custom data for the context pointer fields, the snapshot became reproducible on some platforms, but not on others. What was this platform-dependent unreproducibility coming from?

Fixing padding in copied structs

After another round of V8 snapshot blob grokking, I noticed that the moving bits were in the v8::StartupData returned by the serializer callbacks. As described above, most Node.js-specific data in embedder objects were serialized into a InternalField struct. They have the the following layout:

1
2
3

[  uint8_t ] `type`: EmbedderObjectType of the object
[  size_t  ] `length`: Size of the struct in bytes 
[    ...   ] custom bytes, the size should be `length - sizeof(uint8_t) - sizeof(size_t)`

Per the API contract, V8 expects the v8::StartupData returned in the callbacks to carry the data in a const char*, and it calls delete [] on this pointer to release the memory. To convert a InternalField into v8::StartupData, Node.js would just allocate the memory for the InternalFields using new [] and initialize it with placement new so that V8 could call delete [] on it later, like this:

1
2
3

// T is the InternalField class here
void* buf = ::operator new[](sizeof(T));
T* result = new (buf) T;

But since the first slot was a uint8_t, and the second slot was a size_t (which can be 4 or 8 bytes), there could be some padding in the underlying memory. The padding was also copied into the snapshot blob as part of the v8::StartupData. On some platforms/with some compilers, this padding would be non-deterministic.

Initially I fixed this by enlarging the enums and adding some static assertions to ensure that there were no paddings, but this would add a bit of size overhead to the snapshot. Then I realized that since we were initializing the struct on a chunk of memory controlled by ourselves anyway, we could simply zero-initialize that chunk of memory before calling palcement new on it, so that all paddings would be zeros and deterministic.

1
2
3

void* buf = ::operator new[](sizeof(T));
memset(buf, 0, sizeof(T));
T* result = new (buf) T;

After fixing the paddings in this PR, consecutively built snapshot finally became reproducible on all supported platforms tested in the Node.js CI. On Linux at least, this also made the consecutively built executables identical.

Fixing V8 array buffers

After merging the the padding fixes and the reproducibility tests into the main branch, the snapshot reproducibility test had been happily green in the CI for a couple of days, then it started to fail on some platforms. It turned out to be a regression introduced by a later backported V8 patch that removed the --harmony-rab-gsab flag and cleaned up some logic in code related to array buffers. That patch appeared to be just a cleanup and shouldn’t be introducing observable changes, why did it break the reproducibility?

To unbreak CI, I reverted this V8 patch from the main branch of Node.js for the time being, but it still needed to be fix since that patch would be rolled into Node.js with the next bulk V8 upgrade.

I started grokking the snapshot blob again using the steps described in the previous post, and noticed that the varying part came from an empty array buffer. In Node.js’s configuration of V8 (without sandbox or pointer compression), the memory layout of an array buffer looked like this on a 64-bit platform:

0  [  8 bytes  ] map pointer
8  [  8 bytes  ] properties pointer
16 [  8 bytes  ] elements pointer
24 [  8 bytes  ] detach key pointer
32 [  8 bytes  ] byte length
40 [  8 bytes  ] max byte length
48 [  8 bytes  ] backing store pointer
56 [  8 bytes  ] extension pointer
64 [  4 bytes  ] bit fields
68 [  4 bytes  ] padding
72 [  8 bytes  ] embedder pointer 1
80 [  8 bytes  ] embedder pointer 2

The snapshot serialization code would first serialize the initial 4 pointers to other heap objects. The rest of the array buffer don’t point to anything on the V8 heap so they’d be directly copied into the snapshot. I noticed that bytes 8-16 of the copied bytes were the moving bits. Checking this position with the layout of array buffers, the varying bits would be the max byte length field.

I looked around in the V8 source code to figure out where this was initialized - well actually, it turned out that this wasn’t initialized at all when V8 allocated an empty array buffer. V8 did have some code in place to ensure that the empty array buffers were initialized and reproducible, but the max byte length field was missed since it was a later addition, and somehow this only surfaced until after the --harmony-rab-gsab removal patch. So I submitted another V8 patch to initialize this field. Together with this, the --harmony-rab-gsab removal patch can be relanded without breaking the CI.

What’s next?

The fixes mentioned have been available in Node.js v22. Now we know that at least consecutive builds on Linux are identical, there are some plans to add regression tests in the CI for the reproducibility of the entire executable.

The final executable can still vary a bit depending on the type of hardware being used for building it, but it seems to come from other third-party dependencies now, while Node.js’s own bits or its snapshot/code cache can be reproducible for more hardware configurations. Some have managed to make the executable reproducible by patching the build process of the third-party dependencies, and it may be possible to pull in similar changes into the default build to make it reproducible too.

As usual, thanks Bloomberg and Igalia for supporting my work in Node.js & V8.