Feat: support epub parsing by seroperson · Pull Request #13650 · infiniflow/ragflow

seroperson · 2026-03-17T07:47:18Z

What problem does this PR solve?

Adds native support for EPUB files. EPUB content is extracted in spine (reading) order and parsed using the existing HTML parser. No new dependencies required.

Type of change

New Feature (non-breaking change which adds functionality)

To check this parser manually:

uv run --python 3.12 python -c "
from deepdoc.parser import EpubParser

with open('$HOME/some_epub_book.epub', 'rb') as f:
  data = f.read()

sections = EpubParser()(None, binary=data, chunk_token_num=512)
print(f'Got {len(sections)} sections')
for i, s in enumerate(sections[:5]):
  print(f'\n--- Section {i} ---')
  print(s[:200])
"

Closes infiniflow#1398

codecov · 2026-03-17T08:24:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.75%. Comparing base (1db5409) to head (0bfbcbb).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #13650      +/-   ##
==========================================
- Coverage   46.93%   46.75%   -0.18%     
==========================================
  Files          45       44       -1     
  Lines        9891    10005     +114     
  Branches      112      112              
==========================================
+ Hits         4642     4678      +36     
- Misses       5231     5309      +78     
  Partials       18       18

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds EPUB ingestion support to the parsing pipeline by introducing a dedicated DeepDoc EPUB parser and wiring it into both the “naive” chunking path and the flow Parser component.

Changes:

Introduce RAGFlowEpubParser to extract XHTML content from EPUBs (spine-ordered, with a fallback scan) and chunk via the existing HTML parser.
Register EPUB as a supported type in the flow parser (rag/flow/parser/parser.py) and naive chunker (rag/app/naive.py), and expose EpubParser from deepdoc.parser.
Extend file type detection to recognize .epub, and add unit tests for EPUB parsing behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`deepdoc/parser/epub_parser.py`	New EPUB parser implementation (spine ordering + fallback).
`deepdoc/parser/__init__.py`	Exports `EpubParser` from the parser package.
`rag/flow/parser/parser.py`	Adds EPUB support to flow parser configuration and invocation map.
`rag/app/naive.py`	Adds `.epub` handling branch in `chunk()` and imports `EpubParser`.
`api/utils/file_utils.py`	Recognizes `.epub` in `filename_type()` extension matching.
`test/unit_test/deepdoc/parser/test_epub_parser.py`	Adds unit tests covering EPUB parsing, ordering, and fallbacks.

You can also share your feedback on Copilot code review. Take the survey.

deepdoc/parser/epub_parser.py

rag/flow/parser/parser.py

api/utils/file_utils.py

test/unit_test/deepdoc/parser/test_epub_parser.py

feat: .epub parsing

0ee6efa

Closes infiniflow#1398

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 🌈 python Pull requests that update Python code 💞 feature Feature request, pull request that fullfill a new feature. 🧪 test Pull requests that update test cases. labels Mar 17, 2026

yingfeng requested a review from yongtenglei March 17, 2026 07:57

yingfeng added the ci Continue Integration label Mar 17, 2026

yingfeng marked this pull request as draft March 17, 2026 07:57

yingfeng marked this pull request as ready for review March 17, 2026 07:57

yingfeng changed the title ~~feat: .epub parsing~~ Feat: support epub parsing Mar 17, 2026

yingfeng requested a review from Copilot March 17, 2026 08:53

Copilot started reviewing on behalf of yingfeng March 17, 2026 08:54 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Mar 17, 2026

Applying Mr.Copilot edits

0bfbcbb

seroperson force-pushed the i1398-epub-filetype-support branch from 5429e70 to 0bfbcbb Compare March 17, 2026 10:08

yongtenglei approved these changes Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: support epub parsing#13650

Feat: support epub parsing#13650
seroperson wants to merge 2 commits intoinfiniflow:mainfrom
seroperson:i1398-epub-filetype-support

seroperson commented Mar 17, 2026 •

edited by yingfeng

Loading

Uh oh!

codecov bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

seroperson commented Mar 17, 2026 • edited by yingfeng Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Type of change

Uh oh!

codecov bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seroperson commented Mar 17, 2026 •

edited by yingfeng

Loading

codecov bot commented Mar 17, 2026 •

edited

Loading