fix: preserve raw chunk content in chunk APIs#13485
fix: preserve raw chunk content in chunk APIs#13485JosefAschauer wants to merge 2 commits intoinfiniflow:mainfrom
Conversation
|
Related PRs for this workstream:
I split these intentionally so the two generic fixes can be reviewed and merged independently of the Qdrant backend work. |
gambletan
left a comment
There was a problem hiding this comment.
This is a well-structured change that separates raw content from highlighted content in the chunk APIs. A few specific observations:
-
API behavior change — potential breaking change: Previously,
content_with_weight(inchunk_app.py) andcontent(indoc.pySDK) would return the highlighted version when a search query was present. Now they always return the raw content, with highlights in a separatehighlightfield. This is semantically cleaner, but any existing API consumers that relied on the highlighted markup being in thecontentfield will break silently — they'll get raw text instead of highlighted text without any error. Is there a versioning strategy or deprecation notice planned? -
The
.strip()on highlight: Both endpoints apply.strip()afterremove_redundant_spaces(). The original code didn't strip. This is a minor behavioral difference — just noting it for consistency awareness. -
Conditional field presence: The
highlightkey is only added whenquestion and id in sres.highlight. This means the field is absent (notnull) when there's no search query. This is fine for dynamic languages like Python/JS, but consumers using strict schema validation (e.g., Pydantic models, TypeScript interfaces) will need to markhighlightas optional. Consider whether it should always be present (asnullor"") for schema consistency. -
Test coverage is thorough: The new test in
test_doc_sdk_routes_unit.py(lines 883-913) correctly verifies thatcontentcontains raw text whilehighlightcontains the cleaned highlighted text. The chunk_app test also validates the same pattern. Good coverage of the new behavior. -
Test scaffolding additions (
_StubLLMFactoriesService,_StubFileService,MinerUParserstub): These seem like necessary fixes for test imports that were broken independently of this PR. Might be cleaner as a separate commit, but not a blocker.
Overall the change is an improvement in API design. The main concern is the breaking change for existing consumers of the content/content_with_weight fields.
ea73904 to
ecc1779
Compare
|
Rebased onto current On the behavior change: agreed that this is visible to API consumers. I still think the previous behavior was the bug, because There is not a versioned surface for these endpoints today, so I did not add a separate deprecation layer in this PR. If maintainers want, I can follow up with a short API note documenting the change. |
|
Thanks for the detailed review. On the remaining points: the On the breaking change: I consider the previous behavior (returning tokenized/stemmed text as content) a bug rather than a feature contract — no consumer should rely on receiving mangled text. But happy to add a note in the changelog or release notes if the team tracks API changes that way. |
Summary
Preserve raw chunk content in chunk list APIs and return search snippets separately as highlights.
Root Cause
When chunk search used keywords, the API replaced the stored raw chunk body with the highlight/snippet value. That exposed tokenized or stemmed text to users instead of the original content.
Changes
content_with_weight/contenthighlightfieldValidation
env PYTHONPATH=/home/jo/rf/ragflow .venv/bin/python -m pytest --noconftest -W ignore::UserWarning test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_list_chunk_exception_branches_unitenv PYTHONPATH=/home/jo/rf/ragflow .venv/bin/python -m pytest --noconftest -W ignore::UserWarning test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py::TestDocRoutesUnit::test_list_chunks_brancheshighlight