Fixing Node.js vm APIs, part 3 - verifying the fixes
In the last post I wrote about how a new memory management model in Node.js
vm APIs was designed to fix the long-standing leaks and use-after-free issues, and we finally put together a PR that the Node.js core CI and CITGM CI were happy with. The road to landing it and releasing it to LTS still turned out to be bumpy, however.
“Leak” detected by Jest’s leak detector
As mentioned in the last post, just when I thought the new fix should be good to go since we didn’t find any obvious regressions, we got some bad news from Jest. Specifically it appeared that Jest’s
--detect-leaks found a leak when running tests locally with the PR. Was there a leak that we weren’t able to catch in our tests, but somehow detected by Jest? To find out what caused this failure, I looked into Jest’s leak detector.
At that time, Jest’s leak detector roughly looked like below - it used
FinalizationRegistry to register a callback that may be called when the monitored object was garbage collected, and attempted a garbage collection using V8’s internal
gc() that is only available with
--expose-gc (to that end Jest used
v8.setFlagsFromString() to temporarily expose that utility to the global scope), waited for a few ticks, and checked if the
FinalizationRegistry was called to determine if the monitored object was collected. If it wasn’t, Jest considered it leaked.
const tick = require('util').promisify(setImmediate);
From my experience dealing with leak detection, there were two problems in this implementation that immediately stood out to me:
FinalizationRegistryis not guaranteed to invoke the finalizer callback predictably when an object goes away, so this approach was subject to false positives. To quote the language proposal that added
If an application or library depends on GC cleaning up a WeakRef or calling a finalizer in a timely, predictable manner, it’s likely to be disappointed: the cleanup may happen much later than expected, or not at all.
This approach used
gc()in an attempt to “collect all unreachable objects in the V8 heap”, but that was not what
gc()are V8 internal APIs that were designed for V8 tests. They could be used to request the garbage collector to perform one round of work on the V8 heap, but the garbage collector is not designed to “collect all unreachable objects in the V8 heap” with just one round of work. For that purpose, multiple rounds of work and additional tweaks to the garbage collection process may be needed since some steps (e.g. invocation of weak callbacks) are not designed to be run immediately. In addition, there are some additional nuances when it comes to using
gc()to collect a graph that contains scripts. All these meant that using
gc()to “collect all unreachable objects in the V8 heap” is destined to be disappointing.
Investigating the potential “leak”
Now we know that this might be a false alarm, but we could still not rule out a real leak. We should at least figure out what is retaining the monitored
environment when the leak detector verifies
this._isReferenceBeingHeld, and see if we can disprove the leak by cutting that reference early. Fortunately the failure could be reliably reproduced locally, so I modified the leak detector a bit, added a
_runGarbageCollector() locally, and ran the reproduction again to analyze the state of the heap after
gc() was called. It turned out, however, that dumping a heap snapshot here simply made the failure go way.
Recall that I mentioned:
gc()was not intended to “collect all unreachable objects in the V8 heap”…
And that could be the cause of false positives in the leak detector? Well, as it turned out, the heap snapshot generation happened to do that job better. By design, heap snapshots are intended to capture what’s alive on the heap in a more accurate manner, so V8 would try to collect as much unreachable objects as possible to reduce the noise in the generated heap snapshot, including doing multiple rounds of garbage collection and invoking a lot more callbacks. As a result, by trying to capture a heap snapshot for analysis, we happened to force a more thorough garbage collection that the
LeakDetector needed, so the false positives were gone.
“Fixing” leak detection
As a workaround, Jest opened a PR to generate a heap snapshot as last-resort in
_runGarbageCollector(). While there could still be false positives with this approach (even the garbage collection triggered by heap snapshot generation is only done with best-effort), at least it reduced the likelihood of false positives in their leak detector.
Regression test in Node.js
Now that we confirmed that the “regression” reported by Jest was a false alarm, it’s time to get the fix landed.
Like most bug fixes in Node.js, this needed a regression test. The PR went with an approach commonly used in Node.js core’s test suites:
- Limit the size of the old-space (on details about the spaces in V8’s heap, see v8’s blog post about its garbage collector) to a small value (in this case, only 16 MB).
- Try allocating a lot of objects asynchronously (in this case, via
setTimeout) to give the GC a bit of room to kick in.
- When the object leaks, the test should crash with Out-of-Memory failures, as V8 would not be able to purge enough memory to keep the heap fit into the given limit, and this should reproduce reliably before the fix. When the object no longer leaks with the fix that we try to verify, the crash should go away.
- For debugging purposes,
--trace-gccan be used to observe if the garbage collector is able to purge memory properly.
For example, the regression test for
vm.Script for example looked like this. It crashed reliably before my fix, and no longer crashed locally or in the CI after my fix (at least when we ran the CI for the PR):
// Flags: --max-old-space-size=16 --trace-gc
Similar tests were also added for other vm APIs that used to leak. The Node.js CI seemed to be happy with them along with the fix, so we finally landed this PR, and closed a handful of issues that have been bugged many users for years 🎉.
False positives in the CI
Just when I thought this was finally fixed…some other bad news came in again. (They just kept on coming, didn’t they?)
A few days after the PR landed, we started to notice that the regression tests added for
vm.Script started flaking in the CI. They also failed somewhat more predictably in another PR that was upgrading V8 in Node.js to 11.8. Could there be another leak that we didn’t notice?
I couldn’t reproduce the flake locally with the V8 version on the main branch, but it was possible to reproduce them using the work-in-progress V8 11.8 upgrade. When I added a
v8.writeHeapSnapshot() periodically in the test to analyze the crash, however, the crash went away again, and I could not find any evidence of leak in the heap snapshot. So it seemed to be another false positive.
After some investigation, I concluded that this had something to do with the complexity of our graph:
- We rely on a
WeakMapwhich for the memory management to work correctly.
WeakMapdefinitely ends up in the old space
- The key (host-defined option symbol) is retained by both the
v8::internal::Scriptas well as the referrer, and they participate in a cycle, part of which ends up in the new space.
The garbage collection for
WeakMap (or, the ephemeron garbage collection) has been quite challenging to implement correctly in V8 (see this talk for an overview). And our model seemed to make it particularly hard for V8 to realize that the cycle could be collected when user code no longer referenced that
v8::internal::Script indirectly. Even though V8’s garbage collector could figure that out eventually, it needed multiple rounds of work. And in the newer version of V8, the garbage collection schedule was slightly altered (which happens all the time of course), so simply allocating the objects asynchronously with a timer that timed out after 1ms was putting too much pressure on the garbage collector. The garbage collector just couldn’t keep up with the heap growth to purge the memory in time, leading to a Out-of-Memory crash.
If the false positive happened because we allocated too fast, could we slow it down a bit? Indeed, if we increased the timeout a bit to give more time for the GC to run, it was possible for V8’s garbage collector to keep up with the heap growth a bit more easily, but that would increase the time spent on running the test, which might lead to a timeout in the CI on slower machines. So, that wouldn’t be the best solution.
After some local experiments, I came up with another approach that could lower the chance of false positives and could finish the test in a reasonable amount of time:
async function checkIfCollectable(
To make use of the utility:
// Flags: --max-old-space-size=16 --trace-gc
With this approach:
- We did make use of
FinalizationRegistry, but instead of relying on the absence of the finalizer invocation to detect leaks - since it was not reliable in that regard - we used the presence of finalizer invocation to determine negatives. We created many samples (in this case, 2048) instead of just counting on one sample to work. Any
vm.Scriptthat could be collected can disprove the leak, so once we detected a collected
vm.Script, we could just let the test pass early, thus preventing the test from timing out.
- We generated a snapshot at every 512 allocations - essentially pausing everything else and asking the garbage collector to work really hard to purge the memory. This was a hack and could be quite slow, real world applications should never abuse heap snapshots this way, but we were only writing a test here, so it should be fine. And we only used it for the
vm.Scripttest where this was necessary to deflake the test.
This was still not immune to false positives, but it did work well enough to deflake the tests in the Node.js CI, so we landed the changes and the tests haven’t flaked since.
On a side note: I introduced another possibly more reliable approach for leak detection in Node.js later, which I may write about in another blog post. For this series, let’s focus on the vm APIs.
Fixing the memory issues was not enough?
In some of the duplicate issues I found for the memory bugs, I saw people complaining this was blocking them from upgrading away from Node.js 16, so I thought that the memory fixes should unblock the upgrade for them. It turned out though that this was just part of the puzzle. Some users may be affected more by the memory problem, but there was another major blocker -
vm.Script compilation could be up to 100x slower in v16 in some cases.
At that point, Node.js v16 already reached End-of-Life. While fixing
vm was not really on my plate (recall that I only started working on the memory problems because I was working on something remotely related - the migration to
cppgc), I received enough requests from affected users to know that fixing this would be important to ecosystem, and would be beneficial in the greater picture of my job anyway, so I got to spend a few more cycles on this.
Let’s move on to the next post to talk about the other blocker.