Skip to content

Virtual File System for Node.js#61478

Open
mcollina wants to merge 118 commits intonodejs:mainfrom
mcollina:vfs
Open

Virtual File System for Node.js#61478
mcollina wants to merge 118 commits intonodejs:mainfrom
mcollina:vfs

Conversation

@mcollina
Copy link
Member

@mcollina mcollina commented Jan 22, 2026

A first-class virtual file system module (node:vfs) with a provider-based architecture that integrates with Node.js's fs module and module loader.

Key Features

  • Provider Architecture - Extensible design with pluggable providers:

    • MemoryProvider - In-memory file system with full read/write support
    • SEAProvider - Read-only access to Single Executable Application assets
    • VirtualProvider - Base class for creating custom providers
  • Standard fs API - Uses familiar writeFileSync, readFileSync, mkdirSync instead of custom methods

  • Mount Mode - VFS mounts at a specific path prefix (e.g., /virtual), clear separation from real filesystem

  • Module Loading - require() and import work seamlessly from virtual files

  • SEA Integration - Assets automatically mounted at /sea when running as a Single Executable Application

  • Full fs Support - readFile, stat, readdir, exists, streams, promises, glob, symlinks

Example

const vfs = require('node:vfs');
const fs = require('node:fs');

// Create a VFS with default MemoryProvider
const myVfs = vfs.create();

// Use standard fs-like API
myVfs.mkdirSync('/app');
myVfs.writeFileSync('/app/config.json', '{"debug": true}');
myVfs.writeFileSync('/app/module.js', 'module.exports = "hello"');

// Mount to make accessible via fs module
myVfs.mount('/virtual');

// Works with standard fs APIs
const config = JSON.parse(fs.readFileSync('/virtual/app/config.json', 'utf8'));
const mod = require('/virtual/app/module.js');

// Cleanup
myVfs.unmount();

SEA Usage

When running as a Single Executable Application, bundled assets are automatically available:

const fs = require('node:fs');

// Assets are automatically mounted at /sea - no setup required
const config = fs.readFileSync('/sea/config.json', 'utf8');
const template = fs.readFileSync('/sea/templates/index.html', 'utf8');

Public API

const vfs = require('node:vfs');

vfs.create([provider][, options])  // Create a VirtualFileSystem
vfs.VirtualFileSystem              // The main VFS class
vfs.VirtualProvider                // Base class for custom providers
vfs.MemoryProvider                 // In-memory provider
vfs.SEAProvider                    // SEA assets provider (read-only)

Disclaimer: I've used a significant amount of Claude Code tokens to create this PR. I've reviewed all changes myself.


Fixes #60021

@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/single-executable
  • @nodejs/test_runner

@nodejs-github-bot nodejs-github-bot added lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Jan 22, 2026
@avivkeller avivkeller added fs Issues and PRs related to the fs subsystem / file system. module Issues and PRs related to the module subsystem. semver-minor PRs that contain new features and should be released in the next minor version. notable-change PRs with changes that should be highlighted in changelogs. needs-benchmark-ci PR that need a benchmark CI run. test_runner Issues and PRs related to the test runner subsystem. labels Jan 22, 2026
@github-actions
Copy link
Contributor

The notable-change PRs with changes that should be highlighted in changelogs. label has been added by @avivkeller.

Please suggest a text for the release notes if you'd like to include a more detailed summary, then proceed to update the PR description with the text or a link to the notable change suggested text comment. Otherwise, the commit will be placed in the Other Notable Changes section.

@Ethan-Arrowood
Copy link
Contributor

Nice! This is a great addition. Since it's such a large PR, this will take me some time to review. Will try to tackle it over the next week.

*/
existsSync(path) {
// Prepend prefix to path for VFS lookup
const fullPath = this.#prefix + (StringPrototypeStartsWith(path, '/') ? path : '/' + path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use path.join?

validateObject(files, 'options.files');
}

const { VirtualFileSystem } = require('internal/vfs/virtual_fs');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we import this at the top level / lazy load it at the top level?

ArrayPrototypePush(this.#mocks, {
__proto__: null,
ctx,
restore: restoreFS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
restore: restoreFS,
restore: ctx.restore,

nit

* @param {object} [options] Optional configuration
*/
addFile(name, content, options) {
const path = this._directory.path + '/' + name;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use path.join?

let entry = current.getEntry(segment);
if (!entry) {
// Auto-create parent directory
const dirPath = '/' + segments.slice(0, i + 1).join('/');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use path.join

let entry = current.getEntry(segment);
if (!entry) {
// Auto-create parent directory
const parentPath = '/' + segments.slice(0, i + 1).join('/');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path.join?

}
}
callback(null, content);
}).catch((err) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}).catch((err) => {
}, (err) => {

Comment on lines +676 to +677
const bytesToRead = Math.min(length, available);
content.copy(buffer, offset, readPos, readPos + bytesToRead);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primordials?

}

callback(null, bytesToRead, buffer);
}).catch((err) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}).catch((err) => {
}, (err) => {

@avivkeller
Copy link
Member

Left an initial review, but like @Ethan-Arrowood said, it'll take time for a more in depth look

@joyeecheung
Copy link
Member

joyeecheung commented Jan 22, 2026

It's nice to see some momentum in this area, though from a first glance it seems the design has largely overlooked the feedback from real world use cases collected 4 years ago: https://github.com/nodejs/single-executable/blob/main/docs/virtual-file-system-requirements.md - I think it's worth checking that the API satisfies the constraints that users of this feature have provided, to not waste the work that have been done by prior contributors to gather them, or having to reinvent it later (possibly in a breaking manner) to satisfy these requirements from real world use cases.

@codecov
Copy link

codecov bot commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 90.79489% with 828 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.66%. Comparing base (e0928d6) to head (bc19ec8).
⚠️ Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
lib/internal/vfs/setup.js 83.25% 255 Missing ⚠️
lib/internal/vfs/providers/memory.js 82.25% 162 Missing ⚠️
lib/internal/vfs/watcher.js 86.59% 82 Missing and 3 partials ⚠️
lib/internal/vfs/providers/real.js 85.26% 56 Missing ⚠️
lib/internal/vfs/streams.js 82.20% 55 Missing ⚠️
lib/internal/vfs/file_system.js 96.58% 46 Missing ⚠️
lib/internal/vfs/provider.js 92.99% 36 Missing and 7 partials ⚠️
lib/internal/vfs/stats.js 84.88% 34 Missing ⚠️
src/node_sea.cc 64.28% 12 Missing and 8 partials ⚠️
lib/internal/vfs/file_handle.js 97.08% 15 Missing and 2 partials ⚠️
... and 10 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #61478      +/-   ##
==========================================
- Coverage   89.68%   89.66%   -0.02%     
==========================================
  Files         676      692      +16     
  Lines      206555   215350    +8795     
  Branches    39552    41167    +1615     
==========================================
+ Hits       185249   193096    +7847     
- Misses      13444    14371     +927     
- Partials     7862     7883      +21     
Files with missing lines Coverage Δ
lib/fs.js 98.53% <100.00%> (+0.34%) ⬆️
lib/internal/bootstrap/realm.js 96.21% <100.00%> (+<0.01%) ⬆️
lib/internal/fs/utils.js 99.68% <100.00%> (+<0.01%) ⬆️
lib/internal/modules/cjs/loader.js 98.20% <100.00%> (+0.05%) ⬆️
lib/internal/modules/esm/get_format.js 94.83% <100.00%> (ø)
lib/internal/modules/esm/load.js 91.47% <100.00%> (ø)
lib/internal/modules/esm/resolve.js 99.03% <100.00%> (-0.01%) ⬇️
lib/internal/modules/esm/translators.js 97.67% <100.00%> (+<0.01%) ⬆️
lib/internal/modules/helpers.js 98.73% <100.00%> (+0.01%) ⬆️
lib/internal/modules/package_json_reader.js 99.72% <100.00%> (+<0.01%) ⬆️
... and 26 more

... and 58 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jimmywarting
Copy link

jimmywarting commented Jan 22, 2026

And why not something like OPFS aka whatwg/fs?

const rootHandle = await navigator.storage.getDirectory()
await rootHandle.getFileHandle('config.json', { create: true })
fs.mount('/app', rootHandle) // to make it work with fs
fs.readFileSync('/app/config.json')

OR

const rootHandle = await navigator.storage.getDirectory()
await rootHandle.getFileHandle('config.json', { create: true })

fs.readFileSync('sandbox:/config.json')

fs.createVirtual seems like something like a competing specification

@mcollina mcollina force-pushed the vfs branch 3 times, most recently from 5e317de to 977cc3d Compare January 23, 2026 08:15
@mcollina
Copy link
Member Author

And why not something like OPFS aka whatwg/fs?

I generally prefer not to interleave with WHATWG specs as much as possible for core functionality (e.g., SEA). In my experience, they tend to perform poorly on our codebase and remove a few degrees of flexibility. (I also don't find much fun in working on them, and I'm way less interested in contributing to that.)

On an implementation side, the core functionality of this feature will be identical (technically, it's missing writes that OPFS supports), as we would need to impact all our internal fs methods anyway.

If this lands, we can certainly iterate on a WHATWG-compatible API for this, but I would not add this to this PR.

@juliangruber
Copy link
Member

Small prior art: https://github.com/juliangruber/subfs

@mcollina mcollina force-pushed the vfs branch 2 times, most recently from 8d711c1 to 73c18cd Compare January 23, 2026 13:19
@Qard
Copy link
Member

Qard commented Jan 23, 2026

I also worked on this a bit on the side recently: Qard@73b8fc6

That is very much in chaotic ideation stage with a bunch of LLM assistance to try some different ideas, but the broader concept I was aiming for was to have a VirtualFileSystem type which would actually implement the entire API surface of the fs module, accepting a Provider type to delegate the internals of the whole cluster of file system types to a singular class managing the entire cluster of fs-related types such that the fs module could actually just be fully converted to:

module.exports = new VirtualFileSystem(new LocalProvider())

I intended for it to be extensible for a bunch of different interesting scenarios, so there's also an S3 provider and a zip file provider there, mainly just to validate that the model can be applied to other varieties of storage systems effectively.

Keep in mind, like I said, the current state is very much just ideation in a branch I pushed up just now to share, but I think there are concepts for extensibility in there that we could consider to enable a whole ecosystem of flexible storage providers. 🙂

Personally, I would hope for something which could provide both read and write access through an abstraction with swappable backends of some variety, this way we could pass around these virtualized file systems like objects and let an ecosystem grow around accepting any generalized virtual file system for its storage backing. I think it'd be very nice for a lot of use cases like file uploads or archive management to be able to just treat them like any other readable and writable file system.

@jimmywarting
Copy link

jimmywarting commented Jan 23, 2026

Personally, I would hope for something which could provide both read and write access through an abstraction with swappable backends of some variety, this way we could pass around these virtualized file systems like objects and let an ecosystem grow around accepting any generalized virtual file system for its storage backing. I think it'd be very nice for a lot of use cases like file uploads or archive management to be able to just treat them like any other readable and writable file system.

just a bit off topic... but this reminds me of why i created this feature request:
Blob.from() for creating virtual Blobs with custom backing storage

Would not lie, it would be cool if NodeJS also provided some type of static Blob.from function to create virtual lazy blobs. could live on fs.blobFrom for now...

example that would only work in NodeJS (based on how it works internally)

const size = 26

const blobPart = BlobFrom({
  size,
  stream (start, end) {
    // can either be sync or async (that resolves to a ReadableStream)
    // return new Response('abcdefghijklmnopqrstuvwxyz'.slice(start, end)).body
    // return new Blob(['abcdefghijklmnopqrstuvwxyz'.slice(start, end)]).stream()
    
    return fetch('https://httpbin.dev/range/' + size, {
      headers: {
        range: `bytes=${start}-${end - 1}`
      }
    }).then(r => r.body)
  }
})

blobPart.text().then(text => {
  console.log('a-z', text)
})

blobPart.slice(-3).text().then(text => {
  console.log('x-z', text)
})

const a = blobPart.slice(0, 6)
a.text().then(text => {
  console.log('a-f', text)
})

const b = a.slice(2, 4)
b.text().then(text => {
  console.log('c-d', text)
})
x-z xyz
a-z abcdefghijklmnopqrstuvwxyz
a-f abcdef
c-d cd

An actual working PoC

(I would not rely on this unless it became officially supported by nodejs core - this is a hack)

const blob = new Blob()
const symbols = Object.getOwnPropertySymbols(blob)
const blobSymbol = symbols.map(s => [s.description, s])
const symbolMap = Object.fromEntries(blobSymbol)
const {
  kHandle,
  kLength,
} = symbolMap

function BlobFrom ({ size, stream }) {
  const blob = new Blob()
  if (size === 0) return blob

  blob[kLength] = size
  blob[kHandle] = {
    span: [0, size],

    getReader () {
      const [start, end] = this.span
      if (start === end) {
        return { pull: cb => cb(0) }
      }

      let reader

      return {
        async pull (cb) {
          reader ??= (await stream(start, end)).getReader()
          const {done, value} = await reader.read()
          cb(done ^ 1, value)
        }
      }
    },

    slice (start, end) {
      const [baseStart] = this.span

      return {
        span: [baseStart + start, baseStart + end],
        getReader: this.getReader,
        slice: this.slice,
      }
    }
  }

  return blob
}

currently problematic to do: new Blob([a, b]), new File([blobPart], 'alphabet.txt', { type: 'text/plain' })

also need to handle properly clone, serialize & deserialize, if this where to be sent of to another worker - then i would transfer a MessageChannel where the worker thread asks main frame to hand back a transferable ReadableStream when it needs to read something.

but there are probably better ways to handle this internally in core with piping data directly to and from different destinations without having to touch the js runtime? - if only getReader could return the reader directly instead of needing to read from the ReadableStream using js?

const fs = require('fs');
const assert = require('assert');

// Test that the VFS is automatically mounted at /sea
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the edge case side, can we test/document what happens when switching drive letters in the working directory at runtime on Windows? My intuition is that it would break /sea reads because the normalized path would change from c:\sea\config.json to d:\sea\config.json (assuming we switch from C to D). But I'm not sure if that is the current or expected behavior.

return this.lstatSync(path, options);
}

readdirSync(path, options) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is recursive option handled for readdir and readdirSync? Just making sure because it's missing in the @platformatic/vfs package.

@indutny
Copy link
Member

indutny commented Mar 17, 2026

Does this PR adhere to Developer's Certificate of Origin 1.1 described in CONTRIBUTING.md?

(a) Does not seem to apply:

 (a) The contribution was created in whole or in part by me and I
     have the right to submit it under the open source license
     indicated in the file; or

(b) Also doesn't apply because this is based on previous work, but no one can assert knowledge of the licensing terms of the code it is based on:

 (b) The contribution is based upon previous work that, to the best
     of my knowledge, is covered under an appropriate open source
     license and I have the right under that license to submit that
     work with modifications, whether created in whole or in part
     by me, under the same open source license (unless I am
     permitted to submit under a different license), as indicated
     in the file; or

(c) Does not apply:

 (c) The contribution was provided directly to me by some other
     person who certified (a), (b) or (c) and I have not modified
     it.

Disclaimer: I've used a significant amount of Claude Code tokens to create this PR. I've reviewed all changes myself.

I admire the aspiration behind the change, though

@mcollina
Copy link
Member Author

@indutny, based on your question, you are asking if AI-assisted development is compatible with "Developer's Certificate of Origin 1.1". I'm not a lawyer, so I'm not prepared to answer that question from a legal standpoint.

My understanding is that this PR aligns with the LF's recommendations (https://www.linuxfoundation.org/legal/generative-ai) and with the discussion in (openjs-foundation/cross-project-council#1509 (comment)).

I would prefer to keep this technical thought. Would you mind opening an issue in the Node.js or TSC repository to discuss this matter?

Is this a hard block?

@mcollina
Copy link
Member Author

If that's not clear, I assert that I've followed the DCO.

@indutny
Copy link
Member

indutny commented Mar 17, 2026

I'm not a lawyer, so I'm not prepared to answer that question from a legal standpoint.

I'm neither, but if we find it hard to answer the questions in Certificate of Origin from a legal standpoint than at least it warrants further review and discussion.

I would prefer to keep this technical thought. Would you mind opening an issue in the Node.js or TSC repository to discuss this matter?

I believe this discussion to be in direct relation to the content and nature of this contribution. I don't mind starting a separate conversation elsewhere, but we might as well have it here since in my opinion it has to be resolved before this change can be merged.

Is this a hard block?

Although I'm not a fan of stating things strongly, yes. For me it is a hard block.


(Please don't take any of my comments as a personal criticism. You have my deep respect for both the work you're doing and the interactions we had)

@mcollina
Copy link
Member Author

While anyone can see your points, this my position on the DCO:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or

This contribution was created by me with massive help from @claude. The design of this feature is mostly mine, with a lot of inspiration from @Qard's earlier work that he shared with me (https://github.com/Qard/node/tree/vfs) after I opened the PR (which I'm open to add him as a Co-Authored-By in this PR).

As the majority of the features in this project, the outcome is the result of all the contributors participating.

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or

This contribution is massively based on Node.js itself. It exposes the same API as the fs module already present.

All of this can be seen from the massive commit history in this branch.


As it seems that you have a different position on this issue, I will bring this to the attention of the @nodejs/tsc for an official vote. On a side note, I would also officially summon the topic at the next OpenJS board meeting.


On a personal note, I think this issue raises a different question: whether AI-assisted development is recognized as a practice when contributing to Open Source. And what would be the long-term impact for projects not accepting AI-assisted contributions?

@joyeecheung
Copy link
Member

joyeecheung commented Mar 17, 2026

FYI there's a different PR discussing policy regarding AI-assisted/AI-generated contributions: #62105

@indutny
Copy link
Member

indutny commented Mar 17, 2026

This contribution is massively based on Node.js itself. It exposes the same API as the fs module already present.

I agree that part of the Claude's training data is Node.js itself, but it is also well known that it contains both unlicensed source code and source code with incompatible licenses. Producing code with LLM tools requires effort on the author side through writing prompts, but I view the generated code as unattributed and unlicensed material that shouldn't become part of the Node.js. (Especially given the size and scope of this change).

As it seems that you have a different position on this issue, I will bring this to the attention of the @nodejs/tsc for an official vote. On a side note, I would also officially summon the topic at the next OpenJS board meeting.

I appreciate it!

@jasnell
Copy link
Member

jasnell commented Mar 17, 2026

Does this PR adhere to Developer's Certificate of Origin 1.1 described in CONTRIBUTING.md? ...

Yes. The DCO is correctly applied here. It doesn't matter what tool we individually use to assist in writing the code. While AI agents are more advanced that the typical auto-complete / auto-suggest mechanisms in code editors they fall into the same basic category of coding assistant tools. The DCO does not assert that every line of code in the contribution was written specifically by hand by the person opening the PR .. otherwise we wouldn't be able to take tool-generated OpenSSL configurations or other automation-generated files in any PR.

There's no issue here and no reason to block. By opening the PR, @mcollina is asserting that he is responsible for the code and has reviewed it himself, which I trust.

Simplify srcIsDir checks in cp-sync.js and cp.js to one-liner with
optional chaining and nullish coalescing. Use isPromise() instead of
duck-typing in memory provider. Use FunctionPrototypeSymbolHasInstance
instead of instanceof in vfs.js.
@indutny
Copy link
Member

indutny commented Mar 17, 2026

Yes. The DCO is correctly applied here. It doesn't matter what tool we individually use to assist in writing the code. While AI agents are more advanced that the typical auto-complete / auto-suggest mechanisms in code editors they fall into the same basic category of coding assistant tools.

By the same logic I could claim that cp -rf is a more primitive version of autocomplete and an assistive tool that I can use to submit contributions, but I don't think anyone would question that using this cp -rf tool to copy code from a GPL licensed repo wouldn't be in violation of clauses of DCO.


But anyway, I appreciate y'all initiating a vote and a formal discussion on this.

@jasnell
Copy link
Member

jasnell commented Mar 17, 2026

By the same logic I could claim that cp -rf is a more primitive version...

That's a false equivalency. Auto-complete, templated generators, etc aren't just copying. If you can point to any specific lines of code in this PR that are copied verbatim from any source that is incompatible with our license and doesn't meet the common sense reasonable use standards (e.g. you can't copyright common patterns) then those should be called out specifically.

@indutny
Copy link
Member

indutny commented Mar 17, 2026

If you can point to any specific lines of code in this PR that are copied verbatim from any source that is incompatible with our license and doesn't meet the common sense reasonable use standards (e.g. you can't copyright common patterns) then those should be called out specifically.

Doesn't DCO exist partly to make sure that the burden of identifying potential license violations lies on the submitter of the proposed change? Is "copied verbatim" the minimum bar for plagiarism that we set for the contributions? Would exactly the same code with changed variable names or swapped if/else clauses be considered a different implementation?

I think we can agree that at the very least these are highly debatable issues unlike the contributions written without LLM assistance that we have been encouraging and merging since the beginning of this project.

@jasnell
Copy link
Member

jasnell commented Mar 17, 2026

The key part of the DCO assertion is this (in bold): "The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file"

In other words, the contributor is asserting they have the right to submit. If agent output turns out to include material that violates someone's copyright, the contributor made a false certification; just like if they cp -rf'd GPL code and signed the DCO anyway. The DCO doesn't prevent bad behavior. It makes the contributor legally accountable for it. That's the same whether the tool is cp, some template generator, or an AI agent.

Take the use of AI out of it completely. What if I paid someone to write the code for me and then I opened the PR as my own submission. That would still be ok under the DCO and I would be just as liable for the IP assertion as when I write the code myself. The key bit is whether I have the right to submit it, not what tool was used.

@indutny
Copy link
Member

indutny commented Mar 17, 2026

The key part of the DCO assertion is this (in bold): "The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file"

I agree.

In other words, the contributor is asserting they have the right to submit. If agent output turns out to include material that violates someone's copyright, the contributor made a false certification; just like if they cp -rf'd GPL code and signed the DCO anyway. The DCO doesn't prevent bad behavior. It makes the contributor legally accountable for it. That's the same whether the tool is cp, some template generator, or an AI agent.

You are right, but there is also an aspect of it where the certification was done, but the contributed code is in clear violation (e.g. has copy-pasted copyright headers incompatible with the license). This would be (and always were) grounds for rejecting the contribution despite the present certification.

Take the use of AI out of it completely. What if I paid someone to write the code for me and then I opened the PR as my own submission. That would still be ok under the DCO and I would be just as liable for the IP assertion as when I write the code myself. The key bit is whether I have the right to submit it, not what tool was used.

If reviewer knows about this misrepresentation then it is at least ethically wrong to accept the contribution.

@rginn
Copy link

rginn commented Mar 17, 2026

I checked with legal and the foundation is fine with the DCO on AI-assisted contributions. We’ll work on getting this documented.

@syrusakbary
Copy link

If the Node.js community would not like to merge the improvement because is AI-driven, we will be very happy to have it on Edge.js @mcollina ❤️

I'm hoping we can all find alignment that writing code with AI assistance is becoming the norm now. If Node.js doesn't embrace it fully, it may be left behind (just because it may lack a good iteration speed vs other projects)

@jasnell
Copy link
Member

jasnell commented Mar 17, 2026

@indutny:

You are right, but there is also an aspect of it where the certification was done, but the contributed code is in clear violation (e.g. has copy-pasted copyright headers incompatible with the license). This would be (and always were) grounds for rejecting the contribution despite the present certification.

Yes. Which goes back to what I said previously: if there's any specific part of this contribution that appears to have been inappropriately copied from code with an incompatible license, then that should be called out. Blocking as a general rule because it might-be-but-we-don't-know-for-sure is not valid.

@indutny
Copy link
Member

indutny commented Mar 17, 2026

Yes. Which goes back to what I said previously: if there's any specific part of this contribution that appears to have been inappropriately copied from code with an incompatible license, then that should be called out.

This is what I'm doing here. Calling out LLM-generated code as re-phrasing of other software's code. It is known that LLMs produce verbatim copies of work they are trained on so I cannot rule out that this hasn't happened in this pull request.

@jasnell
Copy link
Member

jasnell commented Mar 17, 2026

Code Provenance Analysis: PR #61478 (Virtual File System)

Background

PR #61478 adds a Virtual File System (~19k lines across 76 files) to Node.js. The author (@mcollina) disclosed it was "created by me with massive help from @claude." A reviewer raised concerns that LLM-generated code might contain verbatim or near-verbatim copies from copyrighted code with incompatible licenses.

Methodology

Every new file and every modification in the PR was read and analyzed for:

  1. License headers, attribution comments, or references to external projects
  2. Code style inconsistencies suggesting code pasted from different sources
  3. Structural and algorithmic matches against the most relevant existing VFS/mock-fs libraries: memfs, mock-fs, unionfs, fs-monkey, and BrowserFS (graceful-fs and Yarn PnP were considered but excluded — graceful-fs is an EMFILE-retry wrapper with no VFS functionality, and PnP uses a zip-manifest approach with no shared design surface)
  4. Variable/class/function names distinctive to any known library
  5. Imports of external packages
  6. Non-Node.js idioms indicating a foreign origin

Accounting for Linter Normalization

Node.js has one of the most aggressive linters in open source. It forces:

  • All built-in API calls through primordials (50+ builtins rewritten; e.g., Array.from() becomes ArrayFrom)
  • SafeMap/SafeSet instead of Map/Set
  • Array destructuring banned (const [a,b] = x becomes const {0:a, 1:b} = x)
  • All errors through internal/errors (no raw new Error())
  • ~60 globals banned (must use explicit require())
  • __proto__: null on property descriptors and async return values (and on all object literals within test_runner/)
  • Strict formatting (2-space indent, single quotes, semicolons, trailing commas, 120-char lines)

This means surface-level style conformity with Node.js conventions is NOT evidence of originality -- the linter would force any code into that style. The analysis therefore focuses on linter-proof structural indicators: algorithm structure, data model choices, method decomposition, architecture, constants, edge case handling, and integration points.

Findings

No evidence of copied code was found.


1. Algorithm Structure Differs from All Known Libraries

Path resolution (memory.js#lookupEntry) uses recursive descent with explicit depth tracking (depth + 1 on each recursive call). memfs uses iterative array rewriting (resets loop index i = 0 and rewrites the steps array in-place). These are fundamentally different control flow structures that no linter can transform between.

Write buffer expansion (file_handle.js) uses exact-fit allocation (Buffer.alloc(writePos + length)). memfs uses geometric doubling (capacity * 2). This is a different algorithm with different performance characteristics (O(n) per append vs amortized O(1)). A copy would preserve the allocation strategy.

Symlink loop detection: The PR implements proper ELOOP detection with kMaxSymlinkDepth = 40 (matching Linux kernel's MAXSYMLINKS). memfs either lacks loop detection entirely (newer codebase) or uses a different limit of 100 (older codebase). The choice of 40 specifically indicates alignment with the Linux kernel, not with any JS library.

Algorithm Same as memfs? Key Difference
Path resolution No Recursive descent vs iterative restart
File read Mostly similar Same Buffer.copy core; memfs has more bounds checks, atime
File write No Exact-fit vs geometric doubling allocation
Directory listing Similar Same iterate-and-construct (only correct approach for Map)
Symlink resolution No PR has ELOOP (depth 40); memfs has none or uses 100
mkdtemp suffix Similar Both use Math.random; PR has larger charset

2. Data Model Is Structurally Different from All Known Libraries

Feature This PR memfs mock-fs BrowserFS
Core abstraction Single MemoryEntry class Two-class Link/Node (inode model) Item hierarchy with inheritance Inode + separate data
Symlink target field on same class Separate Link class Separate SymbolicLink subclass Separate inode type
Children Map on entry (same object) Map on Link (separate from Node, includes ./..) Map on DirectoryItem subclass Separate DirInode
Lazy population populate callback None None None
Dynamic content contentProvider function None None None

The contentProvider and populate patterns are unique to this implementation. No known library has these concepts. They are designed specifically for the SEA (Single Executable Application) use case.

3. Architecture Has No Precedent

The handler-registration pattern (vfsState.handlers checked inline inside lib/fs.js) is fundamentally different from every known approach:

Library Interception Mechanism
This PR Inline handler null-check inside each fs.* method body
memfs Standalone volume object; doesn't touch fs at all
mock-fs Replaces C++ fs binding layer
fs-monkey Monkey-patches fs module exports from outside
unionfs Chains multiple fs-like objects with try/catch ENOENT

The mount-point routing + overlay mode + virtual cwd + module loader hooks + Symbol.dispose combination exists in no other library. No linter can create or mask an architectural pattern.

4. Specific Constants Match Node.js Core, Not Any Library

  • kMaxSymlinkDepth = 40 -- matches Linux kernel MAXSYMLINKS. memfs either has no limit or uses a different limit (100 in older versions) depending on codebase version. mock-fs has no limit.
  • 5007ms default watchFile interval -- matches Node.js's own fs.watchFile default (a legacy value inherited from libev).
  • Virtual FD range starting at 10_000 -- doesn't match any library.
  • Float64Array(18) stats layout -- matches the FsStatsOffset enum in src/node_file.h. No external library uses this internal format.

5. Deliberate Omissions Inconsistent with Copying

If code were copied from a mature library and adapted, you'd expect it to retain capabilities from the source. This implementation lacks:

  • Inode numbers (always 0) -- memfs, mock-fs, BrowserFS all track these
  • Permission enforcement -- memfs has basic permission checks
  • O_EXCL, O_TRUNC, O_NOFOLLOW flag support -- memfs handles these
  • atime updates on read -- memfs does this
  • Geometric buffer allocation -- memfs does this
  • . and .. directory entries -- memfs includes these

These omissions are consistent with a minimal implementation built for a specific purpose, not with code copied from a more complete library.

6. No Distinctive Naming from Any Known Library

Zero occurrences of: vol, volume, Link (as class), Node (as fs node), patchFs, patchRequire, binding (as VFS concept), getItem, use() (as mount), IFS, createFsFromVolume, or any other distinctive identifier from the compared libraries.

7. No License Headers, Attribution, URLs, or External References

Zero copyright notices, license headers (MIT, Apache, BSD, ISC, SPDX), @author/@license/@source tags, URLs, or TODO comments referencing external projects in any VFS file. Note: this is absence of a specific red flag rather than strong positive evidence of originality — AI-generated code would typically not reproduce license headers from training data regardless of whether the output was influenced by copyrighted source.

8. Integration Uses Node.js-Internal APIs That Don't Exist Externally

Several integration points use APIs that only exist inside Node.js core, making direct copy-paste from external libraries impossible for these sections (though an AI trained on Node.js core could generate code using these APIs independently):

  • getStatsFromBinding() with the 18-element Float64Array layout matching FsStatsOffset in src/node_file.h
  • serializePackageJSON() producing the exact 6-element tuple format consumed by deserializePackageJSON in package_json_reader.js (mimicking the format normally produced by the C++ readPackageJSON binding)
  • legacyMainResolve extension index mapping (0-6 for main+ext, 7-9 for index.ext)
  • internalBinding('modules') for module resolution hooks
  • UVException with UV_ENOENT, UV_EISDIR, etc. from internalBinding('uv')

9. Vestigial Code Indicates Iterative Development, Not Copy-Paste

The dead ternary at file_system.js:282-284 (both branches identical) is likely either an AI-generation artifact or a refactoring leftover — the git history shows extensive path-handling refactors (e.g., "replace custom path helpers with standard path module," "fix Windows path handling"). Either way, it is an artifact of iterative development on this specific codebase, not a vestige from an external library. The repetitive findVFSFor* functions in setup.js (15+ functions with the same boilerplate) are consistent with AI-assisted rapid generation. All share the same vfsState.handlers-based dispatch pattern that is specific to this PR's architecture and couldn't originate from an external library.

Conclusion

After thorough analysis accounting for linter normalization, no evidence was found that this PR contains code copied verbatim or near-verbatim from any other open-source project.

The algorithm structures, data models, architectural patterns, constant values, capability omissions, user-defined identifiers (class/method/variable names, which the linter does not alter), and internal API usage are all inconsistent with code derived from memfs, mock-fs, unionfs, fs-monkey, BrowserFS, or any other known VFS library. The evidence points to original code purpose-built for Node.js core integration, with AI assistance for rapid generation of repetitive boilerplate.

The reviewer's concern that "it cannot be confirmed" the code is clean is technically true of any contribution -- human-written or AI-assisted. But the specific, concrete, linter-proof structural evidence here is that this code does not structurally resemble any known VFS library from which it could have been copied.

Update: Corrected a few factual errors:

  1. 5007 is not prime — changed "a deliberately prime number" to "a legacy value inherited from libev" (which is what Node.js's own source comment says).
  2. FSReqCallback::StatBuffer doesn't exist — changed to the actual name: FsStatsOffset enum in src/node_file.h.
  3. deserializePackageJSON is JS, not C++ — corrected to say it's in package_json_reader.js and the C++ side is the readPackageJSON binding that normally produces the tuple format.

@indutny
Copy link
Member

indutny commented Mar 17, 2026

@jasnell you can't be serious... responding to concerns about LLM use for creating large code change with several pages of LLM analysis of the PR?

@jasnell
Copy link
Member

jasnell commented Mar 17, 2026

Is anything about the analysis incorrect? I'm happy to make corrections in it as necessary.

@indutny
Copy link
Member

indutny commented Mar 17, 2026

I'll let others decide. I hope the comments provided here express my point of view succinctly, so I'll try avoid thrashing this PR further :-)

P0 fixes:
- setup.js: use fileURLToPath() instead of path.pathname for file: URLs
- file_handle.js: enforce read/write permissions and exclusive flags
- memory.js: prevent auto-creation of parent dirs in openSync
- memory.js: validate destination before removing source in renameSync

P1 fixes:
- memory.js: convert numeric O_* flags to string equivalents
- dir.js: add callback support to read() and close()
- watcher.js: fix unwatchFile leak by clearing internal listeners set
- provider.js: pass options.flag through writeFile/appendFile
- memory.js: implement recursive readdir
- watcher.js: poll children for directory watch, rescan for new files
- streams.js: support fd and start options in read/write streams

P2 fixes:
- provider.js: check R_OK/W_OK/X_OK permission bits in access()
- provider.js: check COPYFILE_EXCL in copyFile()
- stats.js: add createZeroStats() without S_IFREG for watchFile
- errors.js: add createEACCES error factory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fs Issues and PRs related to the fs subsystem / file system. lib / src Issues and PRs related to general changes in the lib or src directory. module Issues and PRs related to the module subsystem. needs-benchmark-ci PR that need a benchmark CI run. needs-ci PRs that need a full CI run. notable-change PRs with changes that should be highlighted in changelogs. semver-minor PRs that contain new features and should be released in the next minor version. test_runner Issues and PRs related to the test runner subsystem. tsc-agenda Issues and PRs to discuss during the meetings of the TSC.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement VFS (Virtual File System) Hooks for Single Executable Applications