fix: preserve raw chunk content in chunk APIs by JosefAschauer · Pull Request #13485 · infiniflow/ragflow

JosefAschauer · 2026-03-09T15:03:37Z

Summary

Preserve raw chunk content in chunk list APIs and return search snippets separately as highlights.

Root Cause

When chunk search used keywords, the API replaced the stored raw chunk body with the highlight/snippet value. That exposed tokenized or stemmed text to users instead of the original content.

Changes

always return raw content_with_weight / content
attach the search snippet as a separate highlight field
trim and normalize the highlight output
add targeted regression coverage for both web and SDK chunk list routes

Validation

env PYTHONPATH=/home/jo/rf/ragflow .venv/bin/python -m pytest --noconftest -W ignore::UserWarning test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_list_chunk_exception_branches_unit
env PYTHONPATH=/home/jo/rf/ragflow .venv/bin/python -m pytest --noconftest -W ignore::UserWarning test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py::TestDocRoutesUnit::test_list_chunks_branches
manually verified that chunk views show raw content while keyword hits appear in highlight

JosefAschauer · 2026-03-09T15:05:26Z

Related PRs for this workstream:

fix: avoid empty doc filter in knowledge retrieval #13484 fixes a separate generic chat retrieval issue where an empty selected-document list became an active empty doc_id filter.
feat: add qdrant doc store backend #13486 adds the Qdrant document-store backend.

I split these intentionally so the two generic fixes can be reviewed and merged independently of the Qdrant backend work.

gambletan

This is a well-structured change that separates raw content from highlighted content in the chunk APIs. A few specific observations:

API behavior change — potential breaking change: Previously, content_with_weight (in chunk_app.py) and content (in doc.py SDK) would return the highlighted version when a search query was present. Now they always return the raw content, with highlights in a separate highlight field. This is semantically cleaner, but any existing API consumers that relied on the highlighted markup being in the content field will break silently — they'll get raw text instead of highlighted text without any error. Is there a versioning strategy or deprecation notice planned?
The .strip() on highlight: Both endpoints apply .strip() after remove_redundant_spaces(). The original code didn't strip. This is a minor behavioral difference — just noting it for consistency awareness.
Conditional field presence: The highlight key is only added when question and id in sres.highlight. This means the field is absent (not null) when there's no search query. This is fine for dynamic languages like Python/JS, but consumers using strict schema validation (e.g., Pydantic models, TypeScript interfaces) will need to mark highlight as optional. Consider whether it should always be present (as null or "") for schema consistency.
Test coverage is thorough: The new test in test_doc_sdk_routes_unit.py (lines 883-913) correctly verifies that content contains raw text while highlight contains the cleaned highlighted text. The chunk_app test also validates the same pattern. Good coverage of the new behavior.
Test scaffolding additions (_StubLLMFactoriesService, _StubFileService, MinerUParser stub): These seem like necessary fixes for test imports that were broken independently of this PR. Might be cleaner as a separate commit, but not a blocker.

Overall the change is an improvement in API design. The main concern is the breaking change for existing consumers of the content/content_with_weight fields.

JosefAschauer · 2026-03-11T07:58:49Z

Rebased onto current main and resolved the merge conflict, so the PR is clean again.

On the behavior change: agreed that this is visible to API consumers. I still think the previous behavior was the bug, because content / content_with_weight stopped being reliable raw chunk text whenever a search query was present. This change makes that contract explicit: raw chunk text stays in content / content_with_weight, and search snippet text is returned separately in highlight.

There is not a versioned surface for these endpoints today, so I did not add a separate deprecation layer in this PR. If maintainers want, I can follow up with a short API note documenting the change.

JosefAschauer · 2026-03-14T07:51:30Z

Thanks for the detailed review. On the remaining points: the .strip() is intentional and only trims boundary whitespace from the derived highlight snippet; it does not affect the raw content / content_with_weight fields. On highlight presence, I left it absent intentionally to distinguish “no search/highlight context” from an empty snippet, but I can switch that to always emit highlight: null if maintainers would prefer schema-stable output. On the test scaffolding, agreed those are test-harness shims rather than part of the API change itself; I kept them in this PR because the route tests needed the imports stabilized, but I can split them if that would be preferable.

On the breaking change: I consider the previous behavior (returning tokenized/stemmed text as content) a bug rather than a feature contract — no consumer should rely on receiving mangled text. But happy to add a note in the changelog or release notes if the team tracks API changes that way.

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 🐖api The modified files are located under directory 'api/apps/sdk' 🐞 bug Something isn't working, pull request that fix bug. 🧪 test Pull requests that update test cases. labels Mar 9, 2026

This was referenced Mar 9, 2026

fix: avoid empty doc filter in knowledge retrieval #13484

Merged

feat: add qdrant doc store backend #13486

Open

[Feature Request]: Support Qdrant as vector DB in RagFlow #6546

Open

gambletan reviewed Mar 10, 2026

View reviewed changes

fix: preserve raw chunk content in chunk APIs

ecc1779

JosefAschauer force-pushed the fix/preserve-raw-chunk-content branch from ea73904 to ecc1779 Compare March 11, 2026 07:58

test: trim route import shims

2a25f62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve raw chunk content in chunk APIs#13485

fix: preserve raw chunk content in chunk APIs#13485
JosefAschauer wants to merge 2 commits intoinfiniflow:mainfrom
JosefAschauer:fix/preserve-raw-chunk-content

JosefAschauer commented Mar 9, 2026

Uh oh!

JosefAschauer commented Mar 9, 2026

Uh oh!

gambletan left a comment

Uh oh!

JosefAschauer commented Mar 11, 2026

Uh oh!

JosefAschauer commented Mar 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JosefAschauer commented Mar 9, 2026

Summary

Root Cause

Changes

Validation

Uh oh!

JosefAschauer commented Mar 9, 2026

Uh oh!

gambletan left a comment

Choose a reason for hiding this comment

Uh oh!

JosefAschauer commented Mar 11, 2026

Uh oh!

JosefAschauer commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JosefAschauer commented Mar 14, 2026 •

edited

Loading