Fixing Node.js vm APIs, part 2 - reworking the memory management

In the last post, I wrote about how I came to work on a memory leak in the vm compilation APIs in Node.js, attempted a fix, and reverted it because the fix turned out to be unsound and introduced a use-after-free. In this post I will write about how I got to another fix that ended up working.

A fresh look

As mentioned in the last post:

A few months later, when I moved along the path of cppgc migration and tried to figure out migration strategies for Node.js embedder objects with various memory models, I set my eyes on CompileFnEntry again…

I started wondering: maybe migrating it to a cppgc-managed object could make a difference in the leak? Out of curiosity, I tried migrating CompileFnEntry to a cppgc-managed object locally, but found that the leak persisted. Why, of course: the core to the fix should be the reference from the v8::internal::Script associated with the compiled function to the callback setting object, and cppgc would not give us any extra API to deal with this missing piece. We should still wait for the the host-defined options V8 CL to put the pieces together.

But my brain started wandering again.

The last time I looked into it, I stopped at:

The storage site of host-defined options was completely encapsulated by V8 and was subject to change, so it seemed unwise to develop a fix specific to that.

…sure, the host-defined options may not be stored in v8::internal::Script soon, but we did have some precedents of exporting these opaque structs temporarily in V8’s API for lifetime management in Node.js anyway. Maybe the V8 team would be open to an upstream CL that allows us to retrieve an opaque struct corresponding to the v8::internal::Script associated with the compiled function? Then with some other extensions to V8’s cppgc API, we could use cppgc::EphemeronPair to create a reference from this struct to a cppgc-managed node::CompileFnEntry native object, making the reference discoverable by V8’s garbage collector…

…but that would be quite a bit of extra work. Let’s think without cppgc and see if we can reduce the amount of work. cppgc::EphemeronPair is equivalent to key-value pairs in WeakMaps. What if we could implement a special kind of WeakMap in V8’s API that allows non-JS object (like the struct for v8::internal::Script) as keys, and JS objects (like the CompileFnEntry JavasCript wrappers) as values?

…wait, this would just be equivalent to adding an internal field from that public v8::internal::Script struct to the CompileFnEntry wrapper (because of “WeakMap semantics”). What about just adding an API to V8 that allows setting internal fields in these opaque structs?

…speaking of internal/private fields, how about just using private symbols? Now we couldn’t use that to store things in the opaque struct directly, because these structs are not proper JavaScript objects. But V8 allows us to store symbols (because they are primitives) in these structs. If only we can let the symbols keep our CompileFnEntry wrappers alive somehow…

…oh, right, there is a Symbol-as-WeakMap-keys proposal and it’s already implemented in V8, isn’t it? Instead of making the CompileFnEntry wrappers the keys in our callback setting WeakMap, we could make them values keyed by these symbols.

…the CompiledFnEntry wrappers were invented to be used as WeakMap keys. If we could just use symbols as keys now, why are we even keeping these wrapeprs? Why don’t we just get rid of CompiledFnEntry entirely, and use the callback setting objects as values in the WeakMap? No more problematic weak callbacks for strong global handles, V8’s garbage collector can fully understand this script -> symbol -> callback setting object -> closure -> user-provided callback link in the cycle.

Looks like we have a fix and we don’t have to wait for that V8 CL now!

Reworking the memory management

Recall from the last post that the original reference graph looked like this:

Now, switching to the symbol-based host-defined options, we could get rid of the CompiledFnEntry wrapper and the env->id_to_function_map completely:

In the new memory model, the cycle can be entirely understood by V8’s garbage collector, so the memory can be managed just right - no more leaks nor use-after-frees!

Use-after-free v.s. leak in `vm.Script` and `vm.SourceTextModule`

As mentioned in the last post, the importModuleDynamically callback option is also available to vm.Script (and its shortcuts e.g. vm.runInThisContext()) as well as vm.SourceTextModule. Their implementation managed memory in a way similar to the original model of vm.compileFunction(). Would they also run into memory issues? Well yes they did - when I did my first attempted fix, I looked around in the issue tracker and found a handful of bugs reports (1 2 3 4 5 6…) about memory leaks in vm.SourceTextModule and use-after-free crashes in vm.Script. One of the more noticeable consequences was that since the REPL uses vm.Script to run code entered by users, import() in the REPL could crash quite easily under the original memory model.

Fixing `vm.Script`

In the case of vm.Script, the graph resembled the one that caused the regression in my unsound vm.compileFunction() fix. Consider this example:

function getNested() {
  let spt = new vm.Script(
    `(function nested() { return import('fs'); })`,
    {
      importModuleDynamically(specifier) { return import(specifier); }
    });

  // Adds nested() to the current scope.
  const nested = spt.runInThisContext();
  return nested;
}
const nested = getNested();
nested();  // Calls import()

The graph looked like this:

Unlike in the case of vm.compileFunction() where the compilation result is a JavaScript function, the compilation result of vm.Script is a v8::UnboundScript which represents a script that can be bound to a context and executed later on (e.g. spt.runInThisContext()). This concept of scripts comes from the ECMA262 specification but does not correspond to any user-accessible built-in JavaScript values. Therefore, unlike CompileFnEntry which was purely invented to serve as WeakMap keys, ContextifyScript (or its subclass vm.Script) were originally invented as an abstraction over v8::UnboundScript with a user-accessible JavaScript wrapper object, and then happened to also serve as WeakMap keys.

Here, when the context of getNested() was garbage collected, the graph on the upper right could also be garbage collected since we didn’t set up enough references to keep them alive, and along with it, the node::ContextifyScript and its entry in the ID-to-Script map could also be garbage collected. Then, when user invoked nested() which called import(), Node.js tried to look up the node::ContextifyScript from the ID-to-Script map using the ID 0 provided through host-defined options, and couldn’t find it, so it would also crash due to a use-after-free, like what we saw in the regression caused by the unsound fix of vm.compileFunction().

Since we need to keep ContextifyScript as an abstraction over v8::UnouboundScript, we need to extend the lifetime of the ContextifyScript wrapper and keep it alive while nested() is alive. Similar to the new fix for vm.compileFunction(), we could switch to a symbol ID for host-defined option and first let nested() keep the callback setting object alive, then add another reference from the callback setting object to ContextifyScript wrapper. We needed to add this extra reference that vm.compileFunction() didn’t need, because in vm.compileFunction() the referrer (compiled function) happened to be kept alive by the callback setting object via a closure. The callback setting for ContextifyScript, however, didn’t happen to have a closure like this, so this referenced needed to be added manually. With that we also no longer need the ID-to-Script map for lookup in the C++ land anymore, since the referrer could be retrieved together with the callback when we lookup from the WeakMap using the host-defined option symbol.

Another leak?!

Some local testing showed that I could fix the reported vm.Script crashes after this change, so I opened another PR to fix both vm.compileFunction and vm.Script. But learning from the last regression, I decided to spend a bit more time analyzing the graph and see if the new model was truly sound this time around. And…uh oh, I found another leak.

Remember that we had to keep the ContextifyScript abstraction for v8::UnboundScript? The reference from node::ContextifyScript to v8::UnboundScript was actually created using a strong v8::Global. If you look at the diagram above again, and pay attention to the edges surrounding node::ContextifyScript, you may notice that there is a familiar pattern - node::ContextifyScript held a strong v8::Global to v8::UnboundScript, which kept the upper-right graph alive, and the destruction of node::ContextifyScript depended on a weak callback stemming out from this upper-right graph. This is a pattern similar to one that caused the original vm.compileFunction() memory leak!

At this point I noticed that this pattern also contributed to the leak in vm.SourceTextModule. There were some additional nuances, but those were more straightforward to fix. The most important part of the leak in vm.SourceTextModule still came from the weak callback pattern, so it seemed necessary to just find a proper fix for them all together.

Fixing both `vm.Script` and `vm.SourceTextModule`

To get rid of this leak, we had to again replace this strong v8::Global + out-of-band weak callback pattern with something else that V8 understands. In this case, we could not simply make that v8::Global weak because then no one would keep the v8::UnboundScript alive, and we would get use-after-free again. We should also try to create a reference from the ContextifyScript wrapper to the v8::UnboundScript with every link of it discoverable by V8. As I happened to be somewhat familiar with how V8 manages the API objects, I knew that:

A v8::UnboundScript is just a v8::internal::SharedFunctionInfo, which, like all the objects managed by the V8 heap, is a v8::internal::HeapObject.
Our ContextifyScript wrapper is just a JavaScript object created as an embedder object, so it can reserve extra slots to retain internal fields, either a v8::Value or pointers. In the case of v8::Value, the reference is understood by V8’s garbage collector.
v8::Value is just a subset of v8::internal::HeapObject that corresponds to user-accesible JavaScript values. v8::UnboundScript falls into another category of “v8::internal::HeapObject that does not correspond to user-accesible JavaScript values”. In the V8 API, these are usually represented as v8::Data. When I worked on the fix, v8::UnboundScript did not inherit from v8::Data, but it could very well do that. The inheritance in the public C++ API was probably only missing because no one found that to be useful yet.
The difference between a v8::Value and a v8::Data is mostly there for type checks to avoid exposing non-JS values to users. But in the case of internal fields for embedder objects, since the getters are only available in the embedder API anyway, it looked perfectly fine to relax the restriction and allow embedders to put v8::Data in internal fields.

Now, if we relax the restriction in the V8 C++ API a bit, and store the v8::UnboundScript inside node::ContextifyScript as an internal field, the graph should look like this:

The right-hand side of the graph is managed by V8 and once that goes away, the node::ContextifyScript goes away via deletion in the weak callback of ContextifyScript wrapper. But while the v8::internal::Script is alive, the ContextifyScript wrapper is alive via the host-defined option symbol, so the node::ContextifyScript object would not run into use-after-free either.

Upstreaming prerequisite V8 CL

With all these in mind, I uploaded a V8 CL to relax the type restriction of the internal fields. The idea was uncontroversial (or V8 actually would’ve allowed v8::Data in the first place if the API was developed today), but the V8 type check mechanisms made the corresponding getter a breaking change for the embedders that needed some trivial updates. V8 runs Blink and Node.js’s tests in its integration test in the CI, so I eneded up having to do a migration in Blink and Node.js first to adapt to the new signatures in the API, requesting rolling the updates back to V8’s CI fork of Node.js, before landing that CL in V8. Then there was also a small bug in the V8 unit test I added for the new signature, so the patch ended up being reverted first, and then relanded with a fix in the test.

Putting the fix together

After the round trip of V8 CL finished, I updated my Node.js PR and added extra tests to check that these APIs no longer leaked. The Node.js core CI seemed happy. To be extra careful, I also ran the PR through the Node.js CITGM CI - a tool Node.js uses to check ecosystem impact by running tests of popular npm packages on a custom build of Node.js.

CITGM had not been in a good shape. Even if it was run on the main branch, there would usually be a couple of packages’ tests failing due to flakes or false alarms, so finding regressions would end up being looking for new failures in 100+ existing failures, and that was much work for the CI to be usable. As a result, there was a WIP to shrink the list of packages being tested in the CI and only focus on those that could pass the tests on the main branch of Node.js. So I used that smaller list of npm packages to verify my PR. There were still a couple packages’ tests failing, but after some investigation I confirmed that those tests also failed/flaked on the main branch of Node.js, and the failures looked unrelated to my PR, so I concluded that at least my PR didn’t introduce new regressions in these packages.

Now, just when I thought everything looked alright, and the PR should be good to go, I was reminded by the maintainer of Jest (@SimenB) that it would be good to check regressions with Jest. For context, Jest used to be in the CITGM list, but it had some flakes on the main branch of Node.js at that point, so it was excluded in the smaller list that I used to verify my PR. Jest used the vm APIs heavily, and was affected by the use-after-free in vm.Script, so it would be a good target to check for regressions. I ran the Jest tests locally in the PR, and the failures I got were identical to the ones I got on the main branch of Node.js. @SimenB also verified that some known issues for Jest could be fixed by this PR, so things looked good!

…well actually, not so fast. After some more testing, we got some “bad news”, which I’ll talk about in the next post.

Fixing Node.js vm APIs, part 2 - reworking the memory management

Fixing Node.js vm APIs, part 2 - reworking the memory management

A fresh look

Reworking the memory management

Use-after-free v.s. leak in vm.Script and vm.SourceTextModule

Fixing vm.Script

Another leak?!

Fixing both vm.Script and vm.SourceTextModule

Upstreaming prerequisite V8 CL

Putting the fix together

Use-after-free v.s. leak in `vm.Script` and `vm.SourceTextModule`

Fixing `vm.Script`

Fixing both `vm.Script` and `vm.SourceTextModule`