Skip to content

fix(knowledge): infer MIME type from file extension in create/upsert tools#3651

Merged
waleedlatif1 merged 6 commits intostagingfrom
waleedlatif1/kb-mime-type-fix
Mar 18, 2026
Merged

fix(knowledge): infer MIME type from file extension in create/upsert tools#3651
waleedlatif1 merged 6 commits intostagingfrom
waleedlatif1/kb-mime-type-fix

Conversation

@waleedlatif1
Copy link
Collaborator

@waleedlatif1 waleedlatif1 commented Mar 18, 2026

Summary

  • Both create_document and upsert_document tools previously forced .txt extension and text/plain MIME type regardless of the document name
  • Now infers the correct MIME type from the file extension (html, md, csv, json, yaml, xml) and only defaults to .txt when no extension is given
  • Adds shared inferDocumentFileInfo helper in tools/knowledge/types.ts used by both tools

Test plan

  • Create a document named report.html — verify filename stays report.html and MIME is text/html
  • Create a document named data.csv — verify filename stays data.csv and MIME is text/csv
  • Create a document named notes (no extension) — verify filename becomes notes.txt and MIME is text/plain
  • Upsert a document named config.json — verify correct MIME type propagation

…tools

Both create_document and upsert_document forced .txt extension and
text/plain MIME type regardless of the document name. Now the tools
infer the correct MIME type from the file extension (html, md, csv,
json, yaml, xml) and only default to .txt when no extension is given.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cursor
Copy link

cursor bot commented Mar 18, 2026

You have used all Bugbot PR reviews included in your free trial for your GitHub account on this workspace.

To continue using Bugbot reviews, enable Bugbot for your team in the Cursor dashboard.

@vercel
Copy link

vercel bot commented Mar 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Mar 18, 2026 11:53am

Request Review

…oads

Replace duplicate EXTENSION_MIME_MAP and getMimeTypeFromExtension with
the existing, more comprehensive version from lib/uploads/utils/file-utils.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR fixes a bug where both create_document and upsert_document tools forced text/plain / .txt regardless of the document's actual name, making it impossible to correctly store HTML, CSV, JSON, YAML, XML, or Markdown documents. The fix introduces a shared inferDocumentFileInfo helper in tools/knowledge/types.ts that infers the MIME type from the file extension via the existing getMimeTypeFromExtension utility, guards against binary types with an explicit TEXT_COMPATIBLE_MIME_TYPES allowlist, and normalises unrecognised or missing extensions to .txt. The create_document tool also picks up several secondary improvements: the TextEncoder is now instantiated only once (previously called twice), the size guard now checks UTF-8 byte count instead of character count (correct for multibyte text), and the browser-side btoa path uses a loop instead of a spread operator (avoiding a potential stack-overflow for content close to the 1 MB limit).

Key changes:

  • types.ts: New inferDocumentFileInfo function with TEXT_COMPATIBLE_MIME_TYPES allowlist (text/plain, text/html, text/markdown, text/csv, application/json, application/xml, application/x-yaml); reuses getMimeTypeFromExtension from @/lib/uploads/utils/file-utils rather than duplicating the extension map.
  • create_document.ts: Switches to inferDocumentFileInfo; cleans up double TextEncoder call; fixes byte-vs-character size check; safer btoa loop.
  • upsert_document.ts: Switches to inferDocumentFileInfo; removes the inline .txt append.

Migration note: The upsert-by-filename path may silently create a duplicate document for any document that was previously stored under the old logic (e.g., data.csv would have been stored as data.csv.txt). With the new inferred filename data.csv, the upsert will not find the old record and will insert a new one instead of updating it. This is an inherent trade-off of correcting the previous bug, but users with pre-existing mis-named documents should be aware they may accumulate stale records.

Confidence Score: 4/5

  • Safe to merge — the logic is correct and all previous review concerns are addressed; the only caveat is a silent migration side-effect in upsert-by-filename for pre-existing documents.
  • The core inferDocumentFileInfo helper is well-tested against known edge cases (dotfiles, unrecognised extensions, binary MIME types), the allowlist is tight, and the secondary fixes to create_document.ts are straightforward improvements. Score is 4 rather than 5 because the upsert tool may silently create duplicate documents for records stored under the old incorrect naming scheme, which is a behavioral breaking change that users would need to be informed about.
  • apps/sim/tools/knowledge/upsert_document.ts — callers relying on upsert-by-filename for documents previously created with the old code should be aware of the migration impact.

Important Files Changed

Filename Overview
apps/sim/tools/knowledge/types.ts Adds inferDocumentFileInfo helper with a text-compatible MIME allowlist. Previous review concerns (dotfiles, binary MIME types, unrecognized extensions) are all addressed. The logic correctly handles .env.env.txt, report.v2report.txt, and maps recognized text extensions to their proper MIME types.
apps/sim/tools/knowledge/create_document.ts Integrates inferDocumentFileInfo for dynamic MIME/filename inference. Also cleans up a pre-existing double TextEncoder instantiation and replaces the spread-based btoa (stack overflow risk on large buffers) with a safer loop. The size check now correctly uses UTF-8 byte count instead of character count, which is more accurate for multibyte text.
apps/sim/tools/knowledge/upsert_document.ts Integrates inferDocumentFileInfo consistently with create_document.ts. Documents previously upserted by filename under the old code (where non-.txt extensions were renamed to .txt) will no longer be matched by the new inferred filename, silently creating a duplicate instead of updating — an inherent migration trade-off of the bug fix.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([documentName]) --> B[getFileExtension]
    B --> C{ext is non-empty?}
    C -- No --> G
    C -- Yes --> D[getUploadMimeType ext]
    D --> E{MIME in\nTEXT_COMPATIBLE_MIME_TYPES?}
    E -- Yes --> F[Return original filename\n+ recognized MIME type]
    E -- No --> G{ext non-empty AND\ndocumentName has a dot?}
    G -- Yes --> H[base = name up to last dot\neg. report.v2 → report]
    G -- No --> I[base = full documentName\neg. notes → notes]
    H --> J{base is empty?\neg. .env}
    I --> K[Return base + .txt\ntext/plain]
    J -- Yes --> L[Return documentName + .txt\neg. .env → .env.txt]
    J -- No --> K
Loading

Last reviewed commit: "fix(knowledge): hand..."

…ate_document

Same fixes as upsert_document: use loop-based String.fromCharCode
instead of spread, consolidate duplicate TextEncoder calls, and
check byte length instead of character length for 1MB limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waleedlatif1
Copy link
Collaborator Author

@greptile

…FileInfo

Use an explicit allowlist instead of only checking for octet-stream,
preventing binary MIME types (image/jpeg, audio/mpeg, etc.) from
leaking through when a user names a document with a binary extension.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waleedlatif1
Copy link
Collaborator Author

@greptile

… extensions

- Remove application/pdf and application/rtf from TEXT_COMPATIBLE_MIME_TYPES
  since these tools pass plain text content, not binary
- Normalize unrecognized extensions (e.g. report.v2) to .txt instead of
  preserving the original extension with text/plain MIME type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waleedlatif1
Copy link
Collaborator Author

@greptile

Dotfiles like .env would produce an empty base, resulting in '.txt'.
Now falls back to the original name so .env becomes .env.txt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1 waleedlatif1 merged commit ff5d90e into staging Mar 18, 2026
11 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/kb-mime-type-fix branch March 18, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant