Skip to content

fix: preserve raw chunk content in chunk APIs#13485

Open
JosefAschauer wants to merge 2 commits intoinfiniflow:mainfrom
JosefAschauer:fix/preserve-raw-chunk-content
Open

fix: preserve raw chunk content in chunk APIs#13485
JosefAschauer wants to merge 2 commits intoinfiniflow:mainfrom
JosefAschauer:fix/preserve-raw-chunk-content

Conversation

@JosefAschauer
Copy link
Contributor

Summary

Preserve raw chunk content in chunk list APIs and return search snippets separately as highlights.

Root Cause

When chunk search used keywords, the API replaced the stored raw chunk body with the highlight/snippet value. That exposed tokenized or stemmed text to users instead of the original content.

Changes

  • always return raw content_with_weight / content
  • attach the search snippet as a separate highlight field
  • trim and normalize the highlight output
  • add targeted regression coverage for both web and SDK chunk list routes

Validation

  • env PYTHONPATH=/home/jo/rf/ragflow .venv/bin/python -m pytest --noconftest -W ignore::UserWarning test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_list_chunk_exception_branches_unit
  • env PYTHONPATH=/home/jo/rf/ragflow .venv/bin/python -m pytest --noconftest -W ignore::UserWarning test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py::TestDocRoutesUnit::test_list_chunks_branches
  • manually verified that chunk views show raw content while keyword hits appear in highlight

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 🐖api The modified files are located under directory 'api/apps/sdk' 🐞 bug Something isn't working, pull request that fix bug. 🧪 test Pull requests that update test cases. labels Mar 9, 2026
@JosefAschauer
Copy link
Contributor Author

Related PRs for this workstream:

I split these intentionally so the two generic fixes can be reviewed and merged independently of the Qdrant backend work.

Copy link
Contributor

@gambletan gambletan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-structured change that separates raw content from highlighted content in the chunk APIs. A few specific observations:

  1. API behavior change — potential breaking change: Previously, content_with_weight (in chunk_app.py) and content (in doc.py SDK) would return the highlighted version when a search query was present. Now they always return the raw content, with highlights in a separate highlight field. This is semantically cleaner, but any existing API consumers that relied on the highlighted markup being in the content field will break silently — they'll get raw text instead of highlighted text without any error. Is there a versioning strategy or deprecation notice planned?

  2. The .strip() on highlight: Both endpoints apply .strip() after remove_redundant_spaces(). The original code didn't strip. This is a minor behavioral difference — just noting it for consistency awareness.

  3. Conditional field presence: The highlight key is only added when question and id in sres.highlight. This means the field is absent (not null) when there's no search query. This is fine for dynamic languages like Python/JS, but consumers using strict schema validation (e.g., Pydantic models, TypeScript interfaces) will need to mark highlight as optional. Consider whether it should always be present (as null or "") for schema consistency.

  4. Test coverage is thorough: The new test in test_doc_sdk_routes_unit.py (lines 883-913) correctly verifies that content contains raw text while highlight contains the cleaned highlighted text. The chunk_app test also validates the same pattern. Good coverage of the new behavior.

  5. Test scaffolding additions (_StubLLMFactoriesService, _StubFileService, MinerUParser stub): These seem like necessary fixes for test imports that were broken independently of this PR. Might be cleaner as a separate commit, but not a blocker.

Overall the change is an improvement in API design. The main concern is the breaking change for existing consumers of the content/content_with_weight fields.

@JosefAschauer JosefAschauer force-pushed the fix/preserve-raw-chunk-content branch from ea73904 to ecc1779 Compare March 11, 2026 07:58
@JosefAschauer
Copy link
Contributor Author

Rebased onto current main and resolved the merge conflict, so the PR is clean again.

On the behavior change: agreed that this is visible to API consumers. I still think the previous behavior was the bug, because content / content_with_weight stopped being reliable raw chunk text whenever a search query was present. This change makes that contract explicit: raw chunk text stays in content / content_with_weight, and search snippet text is returned separately in highlight.

There is not a versioned surface for these endpoints today, so I did not add a separate deprecation layer in this PR. If maintainers want, I can follow up with a short API note documenting the change.

@JosefAschauer
Copy link
Contributor Author

JosefAschauer commented Mar 14, 2026

Thanks for the detailed review. On the remaining points: the .strip() is intentional and only trims boundary whitespace from the derived highlight snippet; it does not affect the raw content / content_with_weight fields. On highlight presence, I left it absent intentionally to distinguish “no search/highlight context” from an empty snippet, but I can switch that to always emit highlight: null if maintainers would prefer schema-stable output. On the test scaffolding, agreed those are test-harness shims rather than part of the API change itself; I kept them in this PR because the route tests needed the imports stabilized, but I can split them if that would be preferable.

On the breaking change: I consider the previous behavior (returning tokenized/stemmed text as content) a bug rather than a feature contract — no consumer should rely on receiving mangled text. But happy to add a note in the changelog or release notes if the team tracks API changes that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐖api The modified files are located under directory 'api/apps/sdk' 🐞 bug Something isn't working, pull request that fix bug. size:M This PR changes 30-99 lines, ignoring generated files. 🧪 test Pull requests that update test cases.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants