Fixing Node.js vm APIs, part 3 - verifying the fixes

In the last post I wrote about how a new memory management model in Node.js vm APIs was designed to fix the long-standing leaks and use-after-free issues, and we finally put together a PR that the Node.js core CI and CITGM CI were happy with. The road to landing it and releasing it to LTS still turned out to be bumpy, however.

“Leak” detected by Jest’s leak detector

As mentioned in the last post, just when I thought the new fix should be good to go since we didn’t find any obvious regressions, we got some bad news from Jest. Specifically it appeared that Jest’s --detect-leaks found a leak when running tests locally with the PR. Was there a leak that we weren’t able to catch in our tests, but somehow detected by Jest? To find out what caused this failure, I looked into Jest’s leak detector.

At that time, Jest’s leak detector roughly looked like below - it used FinalizationRegistry to register a callback that may be called when the monitored object was garbage collected, and attempted a garbage collection using V8’s internal gc() that is only available with --expose-gc (to that end Jest used v8.setFlagsFromString() to temporarily expose that utility to the global scope), waited for a few ticks, and checked if the FinalizationRegistry was called to determine if the monitored object was collected. If it wasn’t, Jest considered it leaked.

const tick = require('util').promisify(setImmediate);

class LeakDetector {
  _isReferenceBeingHeld;
  _finalizationRegistry;

  constructor(value) {
    // When `_finalizationRegistry` is GCed the callback we set will no longer be called,
    this._finalizationRegistry = new FinalizationRegistry(() => {
      this._isReferenceBeingHeld = false;
    });
    this._finalizationRegistry.register(value, undefined);
    this._isReferenceBeingHeld = true;

    // Ensure value is not leaked by the closure created by the "weak" callback.
    value = null;
  }
  async isLeaking() {
    this._runGarbageCollector();

    // wait some ticks to allow GC to run properly
    for (let i = 0; i < 10; i++) {
      await tick();
    }
    return this._isReferenceBeingHeld;
  }
  _runGarbageCollector() {
    const isGarbageCollectorHidden = globalThis.gc == null;

    // GC is usually hidden, so we have to expose it before running.
    require('v8').setFlagsFromString('--expose-gc');
    require('vm').runInNewContext('gc')();

    // The GC was not initially exposed, so let's hide it again.
    if (isGarbageCollectorHidden) {
      require('v8').setFlagsFromString('--no-expose-gc');
    }
  }
}

// Create a detector for a Jest test environment, but don't keep it alive.
const detector = new LeakDetector(environment);
// Some time later, when environment is gone..
const leaked = await detector.isLeaking();

From my experience dealing with leak detection, there were two problems in this implementation that immediately stood out to me:

By design, FinalizationRegistry is not guaranteed to invoke the finalizer callback predictably when an object goes away, so this approach was subject to false positives. To quote the language proposal that added FinalizationRegistry to JavaScript:

If an application or library depends on GC cleaning up a WeakRef or calling a finalizer in a timely, predictable manner, it’s likely to be disappointed: the cleanup may happen much later than expected, or not at all.
This approach used gc() in an attempt to “collect all unreachable objects in the V8 heap”, but that was not what gc() was for. --expose-gc and gc() are V8 internal APIs that were designed for V8 tests. They could be used to request the garbage collector to perform one round of work on the V8 heap, but the garbage collector is not designed to “collect all unreachable objects in the V8 heap” with just one round of work. For that purpose, multiple rounds of work and additional tweaks to the garbage collection process may be needed since some steps (e.g. invocation of weak callbacks) are not designed to be run immediately. In addition, there are some additional nuances when it comes to using gc() to collect a graph that contains scripts. All these meant that using gc() to “collect all unreachable objects in the V8 heap” is destined to be disappointing.

Investigating the potential “leak”

Now we know that this might be a false alarm, but we could still not rule out a real leak. We should at least figure out what is retaining the monitored environment when the leak detector verifies this._isReferenceBeingHeld, and see if we can disprove the leak by cutting that reference early. Fortunately the failure could be reliably reproduced locally, so I modified the leak detector a bit, added a v8.writeHeapSnapshot() to _runGarbageCollector() locally, and ran the reproduction again to analyze the state of the heap after gc() was called. It turned out, however, that dumping a heap snapshot here simply made the failure go way.

Recall that I mentioned:

gc() was not intended to “collect all unreachable objects in the V8 heap”…

And that could be the cause of false positives in the leak detector? Well, as it turned out, the heap snapshot generation happened to do that job better. By design, heap snapshots are intended to capture what’s alive on the heap in a more accurate manner, so V8 would try to collect as much unreachable objects as possible to reduce the noise in the generated heap snapshot, including doing multiple rounds of garbage collection and invoking a lot more callbacks. As a result, by trying to capture a heap snapshot for analysis, we happened to force a more thorough garbage collection that the LeakDetector needed, so the false positives were gone.

“Fixing” leak detection

As a workaround, Jest opened a PR to generate a heap snapshot as last-resort in _runGarbageCollector(). While there could still be false positives with this approach (even the garbage collection triggered by heap snapshot generation is only done with best-effort), at least it reduced the likelihood of false positives in their leak detector.

Regression test in Node.js

Now that we confirmed that the “regression” reported by Jest was a false alarm, it’s time to get the fix landed.

Like most bug fixes in Node.js, this needed a regression test. The PR went with an approach commonly used in Node.js core’s test suites:

Limit the size of the old-space (on details about the spaces in V8’s heap, see v8’s blog post about its garbage collector) to a small value (in this case, only 16 MB).
Try allocating a lot of objects asynchronously (in this case, via setTimeout) to give the GC a bit of room to kick in.
When the object leaks, the test should crash with Out-of-Memory failures, as V8 would not be able to purge enough memory to keep the heap fit into the given limit, and this should reproduce reliably before the fix. When the object no longer leaks with the fix that we try to verify, the crash should go away.
For debugging purposes, --trace-gc can be used to observe if the garbage collector is able to purge memory properly.

For example, the regression test for vm.Script for example looked like this. It crashed reliably before my fix, and no longer crashed locally or in the CI after my fix (at least when we ran the CI for the PR):

// Flags: --max-old-space-size=16 --trace-gc
'use strict';

const vm = require('vm');
let count = 0;

function main() {
  // Try to reach the maximum old space size.
  new vm.Script(`"${Math.random().toString().repeat(512)}";`, {
    async importModuleDynamically() {},
  });
  if (count++ < 2 * 1024) {
    setTimeout(main, 1);
  }
}

main();

Similar tests were also added for other vm APIs that used to leak. The Node.js CI seemed to be happy with them along with the fix, so we finally landed this PR, and closed a handful of issues that have been bugged many users for years 🎉.

False positives in the CI

Just when I thought this was finally fixed…some other bad news came in again. (They just kept on coming, didn’t they?)

A few days after the PR landed, we started to notice that the regression tests added for vm.SourceTextModule and vm.Script started flaking in the CI. They also failed somewhat more predictably in another PR that was upgrading V8 in Node.js to 11.8. Could there be another leak that we didn’t notice?

I couldn’t reproduce the flake locally with the V8 version on the main branch, but it was possible to reproduce them using the work-in-progress V8 11.8 upgrade. When I added a v8.writeHeapSnapshot() periodically in the test to analyze the crash, however, the crash went away again, and I could not find any evidence of leak in the heap snapshot. So it seemed to be another false positive.

After some investigation, I concluded that this had something to do with the complexity of our graph:

We rely on a WeakMap which for the memory management to work correctly.
The WeakMap definitely ends up in the old space
The key (host-defined option symbol) is retained by both the v8::internal::Script as well as the referrer, and they participate in a cycle, part of which ends up in the new space.

The garbage collection for WeakMap (or, the ephemeron garbage collection) has been quite challenging to implement correctly in V8 (see this talk for an overview). And our model seemed to make it particularly hard for V8 to realize that the cycle could be collected when user code no longer referenced that v8::internal::Script indirectly. Even though V8’s garbage collector could figure that out eventually, it needed multiple rounds of work. And in the newer version of V8, the garbage collection schedule was slightly altered (which happens all the time of course), so simply allocating the objects asynchronously with a timer that timed out after 1ms was putting too much pressure on the garbage collector. The garbage collector just couldn’t keep up with the heap growth to purge the memory in time, leading to a Out-of-Memory crash.

So now we know that this was more of a garbage collector performance issue than a correctness issue, and this was specific to the way we put extreme pressure on the garbage collector. Real-world applications, however, would usually do more other operations - especially asynchronous ones - between the allocation of scripts, giving the garbage collector a bit more room to kick in, or allow it to do some work concurrently while JavaScript is executed on the main thread. So this performance issue is less likely to be problematic for real-world applications. Still, it was not great to have a flaky test, so we should find a more reliable way to verify the absence of a leak.

If the false positive happened because we allocated too fast, could we slow it down a bit? Indeed, if we increased the timeout a bit to give more time for the GC to run, it was possible for V8’s garbage collector to keep up with the heap growth a bit more easily, but that would increase the time spent on running the test, which might lead to a timeout in the CI on slower machines. So, that wouldn’t be the best solution.

After some local experiments, I came up with another approach that could lower the chance of false positives and could finish the test in a reasonable amount of time:

async function checkIfCollectable(
  fn, maxCount, generateSnapshotAt = Infinity, logEvery = 128) {
  let anyFinalized = false;
  let count = 0;

  const f = new FinalizationRegistry(() => { anyFinalized = true; });

  async function createObject() {
    const obj = await fn();
    f.register(obj);

    if (count++ < maxCount && !anyFinalized) {
      setImmediate(createObject);
    }
    if (count % generateSnapshotAt === 0) {
      require('v8').getHeapSnapshot().pause().read();
      console.log(`Generated heap snapshot at ${count}`);
    }

    if (count % logEvery === 0) {
      console.log(`Created ${count} objects`);
    }

    if (anyFinalized) {
      console.log(`Found finalized object at ${count}, stop testing`);
    }
  }

  createObject();
}

To make use of the utility:

// Flags: --max-old-space-size=16 --trace-gc

const vm = require('vm');
async function createContextifyScript() {
  // Try to reach the maximum old space size.
  return new vm.Script(`"${Math.random().toString().repeat(512)}";`, {
    async importModuleDynamically() {},
  });
}
// Create 2048 vm.Scripts, if none of them are collectable, they
// would not fit into 16 MB of old space so the test would crash by
// Out-of-memory failure. If we detect that any of them is collectable
// before all 2048 vm.Scripts are created, the test will pass early.
checkIfCollectable(createContextifyScript, 2048, 512);

With this approach:

We did make use of FinalizationRegistry, but instead of relying on the absence of the finalizer invocation to detect leaks - since it was not reliable in that regard - we used the presence of finalizer invocation to determine negatives. We created many samples (in this case, 2048) instead of just counting on one sample to work. Any vm.Script that could be collected can disprove the leak, so once we detected a collected vm.Script, we could just let the test pass early, thus preventing the test from timing out.
We generated a snapshot at every 512 allocations - essentially pausing everything else and asking the garbage collector to work really hard to purge the memory. This was a hack and could be quite slow, real world applications should never abuse heap snapshots this way, but we were only writing a test here, so it should be fine. And we only used it for the vm.SourceTextModule and vm.Script test where this was necessary to deflake the test.

This was still not immune to false positives, but it did work well enough to deflake the tests in the Node.js CI, so we landed the changes and the tests haven’t flaked since.

On a side note: I introduced another possibly more reliable approach for leak detection in Node.js later, which I may write about in another blog post. For this series, let’s focus on the vm APIs.

Fixing the memory issues was not enough?

In some of the duplicate issues I found for the memory bugs, I saw people complaining this was blocking them from upgrading away from Node.js 16, so I thought that the memory fixes should unblock the upgrade for them. It turned out though that this was just part of the puzzle. Some users may be affected more by the memory problem, but there was another major blocker - vm.Script compilation could be up to 100x slower in v16 in some cases.

At that point, Node.js v16 already reached End-of-Life. While fixing vm was not really on my plate (recall that I only started working on the memory problems because I was working on something remotely related - the migration to cppgc), I received enough requests from affected users to know that fixing this would be important to ecosystem, and would be beneficial in the greater picture of my job anyway, so I got to spend a few more cycles on this.

Let’s move on to the next post to talk about the other blocker.