Fixing Node.js vm APIs, part 2 - reworking the memory management
In the last post, I wrote about how I came to work on a memory leak in the vm
compilation APIs in Node.js, attempted a fix, and reverted it because the fix turned out to be unsound and introduced a use-after-free. In this post I will write about how I got to another fix that ended up working.
A fresh look
As mentioned in the last post:
A few months later, when I moved along the path of cppgc migration and tried to figure out migration strategies for Node.js embedder objects with various memory models, I set my eyes on
CompileFnEntry
again…
I started wondering: maybe migrating it to a cppgc-managed object could make a difference in the leak? Out of curiosity, I tried migrating CompileFnEntry
to a cppgc-managed object locally, but found that the leak persisted. Why, of course: the core to the fix should be the reference from the v8::internal::Script
associated with the compiled function to the callback setting object, and cppgc would not give us any extra API to deal with this missing piece. We should still wait for the the host-defined options V8 CL to put the pieces together.
But my brain started wandering again.
The last time I looked into it, I stopped at:
The storage site of host-defined options was completely encapsulated by V8 and was subject to change, so it seemed unwise to develop a fix specific to that.
…sure, the host-defined options may not be stored in v8::internal::Script
soon, but we did have some precedents of exporting these opaque structs temporarily in V8’s API for lifetime management in Node.js anyway. Maybe the V8 team would be open to an upstream CL that allows us to retrieve an opaque struct corresponding to the v8::internal::Script
associated with the compiled function? Then with some other extensions to V8’s cppgc API, we could use cppgc::EphemeronPair
to create a reference from this struct to a cppgc-managed node::CompileFnEntry
native object, making the reference discoverable by V8’s garbage collector…
…but that would be quite a bit of extra work. Let’s think without cppgc and see if we can reduce the amount of work. cppgc::EphemeronPair
is equivalent to key-value pairs in WeakMap
s. What if we could implement a special kind of WeakMap
in V8’s API that allows non-JS object (like the struct for v8::internal::Script
) as keys, and JS objects (like the CompileFnEntry
JavasCript wrappers) as values?
…wait, this would just be equivalent to adding an internal field from that public v8::internal::Script
struct to the CompileFnEntry
wrapper (because of “WeakMap semantics”). What about just adding an API to V8 that allows setting internal fields in these opaque structs?
…speaking of internal/private fields, how about just using private symbols? Now we couldn’t use that to store things in the opaque struct directly, because these structs are not proper JavaScript objects. But V8 allows us to store symbols (because they are primitives) in these structs. If only we can let the symbols keep our CompileFnEntry
wrappers alive somehow…
…oh, right, there is a Symbol-as-WeakMap-keys proposal and it’s already implemented in V8, isn’t it? Instead of making the CompileFnEntry
wrappers the keys in our callback setting WeakMap
, we could make them values keyed by these symbols.
…the CompiledFnEntry
wrappers were invented to be used as WeakMap
keys. If we could just use symbols as keys now, why are we even keeping these wrapeprs? Why don’t we just get rid of CompiledFnEntry
entirely, and use the callback setting objects as values in the WeakMap
? No more problematic weak callbacks for strong global handles, V8’s garbage collector can fully understand this script -> symbol -> callback setting object -> closure -> user-provided callback link in the cycle.
Looks like we have a fix and we don’t have to wait for that V8 CL now!
Reworking the memory management
Recall from the last post that the original reference graph looked like this:
Now, switching to the symbol-based host-defined options, we could get rid of the CompiledFnEntry
wrapper and the env->id_to_function_map
completely:
In the new memory model, the cycle can be entirely understood by V8’s garbage collector, so the memory can be managed just right - no more leaks nor use-after-frees!
Use-after-free v.s. leak in vm.Script
and vm.SourceTextModule
As mentioned in the last post, the importModuleDynamically
callback option is also available to vm.Script
(and its shortcuts e.g. vm.runInThisContext()
) as well as vm.SourceTextModule
. Their implementation managed memory in a way similar to the original model of vm.compileFunction()
. Would they also run into memory issues? Well yes they did - when I did my first attempted fix, I looked around in the issue tracker and found a handful of bugs reports (1 2 3 4 5 6…) about memory leaks in vm.SourceTextModule
and use-after-free crashes in vm.Script
. One of the more noticeable consequences was that since the REPL uses vm.Script
to run code entered by users, import()
in the REPL could crash quite easily under the original memory model.
Fixing vm.Script
In the case of vm.Script
, the graph resembled the one that caused the regression in my unsound vm.compileFunction()
fix. Consider this example:
1 | function getNested() { |
The graph looked like this:
Unlike in the case of vm.compileFunction()
where the compilation result is a JavaScript function, the compilation result of vm.Script
is a v8::UnboundScript
which represents a script that can be bound to a context and executed later on (e.g. spt.runInThisContext()
). This concept of scripts comes from the ECMA262 specification but does not correspond to any user-accessible built-in JavaScript values. Therefore, unlike CompileFnEntry
which was purely invented to serve as WeakMap
keys, ContextifyScript
(or its subclass vm.Script
) were originally invented as an abstraction over v8::UnboundScript
with a user-accessible JavaScript wrapper object, and then happened to also serve as WeakMap
keys.
Here, when the context of getNested()
was garbage collected, the graph on the upper right could also be garbage collected since we didn’t set up enough references to keep them alive, and along with it, the node::ContextifyScript
and its entry in the ID-to-Script map could also be garbage collected. Then, when user invoked nested()
which called import()
, Node.js tried to look up the node::ContextifyScript
from the ID-to-Script map using the ID 0 provided through host-defined options, and couldn’t find it, so it would also crash due to a use-after-free, like what we saw in the regression caused by the unsound fix of vm.compileFunction()
.
Since we need to keep ContextifyScript
as an abstraction over v8::UnouboundScript
, we need to extend the lifetime of the ContextifyScript
wrapper and keep it alive while nested()
is alive. Similar to the new fix for vm.compileFunction()
, we could switch to a symbol ID for host-defined option and first let nested()
keep the callback setting object alive, then add another reference from the callback setting object to ContextifyScript
wrapper. We needed to add this extra reference that vm.compileFunction()
didn’t need, because in vm.compileFunction()
the referrer (compiled function) happened to be kept alive by the callback setting object via a closure. The callback setting for ContextifyScript
, however, didn’t happen to have a closure like this, so this referenced needed to be added manually. With that we also no longer need the ID-to-Script map for lookup in the C++ land anymore, since the referrer could be retrieved together with the callback when we lookup from the WeakMap
using the host-defined option symbol.
Another leak?!
Some local testing showed that I could fix the reported vm.Script
crashes after this change, so I opened another PR to fix both vm.compileFunction
and vm.Script
. But learning from the last regression, I decided to spend a bit more time analyzing the graph and see if the new model was truly sound this time around. And…uh oh, I found another leak.
Remember that we had to keep the ContextifyScript
abstraction for v8::UnboundScript
? The reference from node::ContextifyScript
to v8::UnboundScript
was actually created using a strong v8::Global
. If you look at the diagram above again, and pay attention to the edges surrounding node::ContextifyScript
, you may notice that there is a familiar pattern - node::ContextifyScript
held a strong v8::Global
to v8::UnboundScript
, which kept the upper-right graph alive, and the destruction of node::ContextifyScript
depended on a weak callback stemming out from this upper-right graph. This is a pattern similar to one that caused the original vm.compileFunction()
memory leak!
At this point I noticed that this pattern also contributed to the leak in vm.SourceTextModule
. There were some additional nuances, but those were more straightforward to fix. The most important part of the leak in vm.SourceTextModule
still came from the weak callback pattern, so it seemed necessary to just find a proper fix for them all together.
Fixing both vm.Script
and vm.SourceTextModule
To get rid of this leak, we had to again replace this strong v8::Global
+ out-of-band weak callback pattern with something else that V8 understands. In this case, we could not simply make that v8::Global
weak because then no one would keep the v8::UnboundScript
alive, and we would get use-after-free again. We should also try to create a reference from the ContextifyScript
wrapper to the v8::UnboundScript
with every link of it discoverable by V8. As I happened to be somewhat familiar with how V8 manages the API objects, I knew that:
- A
v8::UnboundScript
is just av8::internal::SharedFunctionInfo
, which, like all the objects managed by the V8 heap, is av8::internal::HeapObject
. - Our
ContextifyScript
wrapper is just a JavaScript object created as an embedder object, so it can reserve extra slots to retain internal fields, either av8::Value
or pointers. In the case ofv8::Value
, the reference is understood by V8’s garbage collector. v8::Value
is just a subset ofv8::internal::HeapObject
that corresponds to user-accesible JavaScript values.v8::UnboundScript
falls into another category of “v8::internal::HeapObject
that does not correspond to user-accesible JavaScript values”. In the V8 API, these are usually represented asv8::Data
. When I worked on the fix,v8::UnboundScript
did not inherit fromv8::Data
, but it could very well do that. The inheritance in the public C++ API was probably only missing because no one found that to be useful yet.- The difference between a
v8::Value
and av8::Data
is mostly there for type checks to avoid exposing non-JS values to users. But in the case of internal fields for embedder objects, since the getters are only available in the embedder API anyway, it looked perfectly fine to relax the restriction and allow embedders to putv8::Data
in internal fields.
Now, if we relax the restriction in the V8 C++ API a bit, and store the v8::UnboundScript
inside node::ContextifyScript
as an internal field, the graph should look like this:
The right-hand side of the graph is managed by V8 and once that goes away, the node::ContextifyScript
goes away via deletion in the weak callback of ContextifyScript
wrapper. But while the v8::internal::Script
is alive, the ContextifyScript
wrapper is alive via the host-defined option symbol, so the node::ContextifyScript
object would not run into use-after-free either.
Upstreaming prerequisite V8 CL
With all these in mind, I uploaded a V8 CL to relax the type restriction of the internal fields. The idea was uncontroversial (or V8 actually would’ve allowed v8::Data
in the first place if the API was developed today), but the V8 type check mechanisms made the corresponding getter a breaking change for the embedders that needed some trivial updates. V8 runs Blink and Node.js’s tests in its integration test in the CI, so I eneded up having to do a migration in Blink and Node.js first to adapt to the new signatures in the API, requesting rolling the updates back to V8’s CI fork of Node.js, before landing that CL in V8. Then there was also a small bug in the V8 unit test I added for the new signature, so the patch ended up being reverted first, and then relanded with a fix in the test.
Putting the fix together
After the round trip of V8 CL finished, I updated my Node.js PR and added extra tests to check that these APIs no longer leaked. The Node.js core CI seemed happy. To be extra careful, I also ran the PR through the Node.js CITGM CI - a tool Node.js uses to check ecosystem impact by running tests of popular npm packages on a custom build of Node.js.
CITGM had not been in a good shape. Even if it was run on the main branch, there would usually be a couple of packages’ tests failing due to flakes or false alarms, so finding regressions would end up being looking for new failures in 100+ existing failures, and that was much work for the CI to be usable. As a result, there was a WIP to shrink the list of packages being tested in the CI and only focus on those that could pass the tests on the main branch of Node.js. So I used that smaller list of npm packages to verify my PR. There were still a couple packages’ tests failing, but after some investigation I confirmed that those tests also failed/flaked on the main branch of Node.js, and the failures looked unrelated to my PR, so I concluded that at least my PR didn’t introduce new regressions in these packages.
Now, just when I thought everything looked alright, and the PR should be good to go, I was reminded by the maintainer of Jest (@SimenB) that it would be good to check regressions with Jest. For context, Jest used to be in the CITGM list, but it had some flakes on the main branch of Node.js at that point, so it was excluded in the smaller list that I used to verify my PR. Jest used the vm
APIs heavily, and was affected by the use-after-free in vm.Script
, so it would be a good target to check for regressions. I ran the Jest tests locally in the PR, and the failures I got were identical to the ones I got on the main branch of Node.js. @SimenB also verified that some known issues for Jest could be fixed by this PR, so things looked good!
…well actually, not so fast. After some more testing, we got some “bad news”, which I’ll talk about in the next post.