Skip to content

Feat: support epub parsing#13650

Open
seroperson wants to merge 2 commits intoinfiniflow:mainfrom
seroperson:i1398-epub-filetype-support
Open

Feat: support epub parsing#13650
seroperson wants to merge 2 commits intoinfiniflow:mainfrom
seroperson:i1398-epub-filetype-support

Conversation

@seroperson
Copy link

@seroperson seroperson commented Mar 17, 2026

Closes #1398

What problem does this PR solve?

Adds native support for EPUB files. EPUB content is extracted in spine (reading) order and parsed using the existing HTML parser. No new dependencies required.

Type of change

  • New Feature (non-breaking change which adds functionality)

To check this parser manually:

uv run --python 3.12 python -c "
from deepdoc.parser import EpubParser

with open('$HOME/some_epub_book.epub', 'rb') as f:
  data = f.read()

sections = EpubParser()(None, binary=data, chunk_token_num=512)
print(f'Got {len(sections)} sections')
for i, s in enumerate(sections[:5]):
  print(f'\n--- Section {i} ---')
  print(s[:200])
"

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 🌈 python Pull requests that update Python code 💞 feature Feature request, pull request that fullfill a new feature. 🧪 test Pull requests that update test cases. labels Mar 17, 2026
@yingfeng yingfeng requested a review from yongtenglei March 17, 2026 07:57
@yingfeng yingfeng added the ci Continue Integration label Mar 17, 2026
@yingfeng yingfeng marked this pull request as draft March 17, 2026 07:57
@yingfeng yingfeng marked this pull request as ready for review March 17, 2026 07:57
@yingfeng yingfeng changed the title feat: .epub parsing Feat: support epub parsing Mar 17, 2026
@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.75%. Comparing base (1db5409) to head (0bfbcbb).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13650      +/-   ##
==========================================
- Coverage   46.93%   46.75%   -0.18%     
==========================================
  Files          45       44       -1     
  Lines        9891    10005     +114     
  Branches      112      112              
==========================================
+ Hits         4642     4678      +36     
- Misses       5231     5309      +78     
  Partials       18       18              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds EPUB ingestion support to the parsing pipeline by introducing a dedicated DeepDoc EPUB parser and wiring it into both the “naive” chunking path and the flow Parser component.

Changes:

  • Introduce RAGFlowEpubParser to extract XHTML content from EPUBs (spine-ordered, with a fallback scan) and chunk via the existing HTML parser.
  • Register EPUB as a supported type in the flow parser (rag/flow/parser/parser.py) and naive chunker (rag/app/naive.py), and expose EpubParser from deepdoc.parser.
  • Extend file type detection to recognize .epub, and add unit tests for EPUB parsing behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
deepdoc/parser/epub_parser.py New EPUB parser implementation (spine ordering + fallback).
deepdoc/parser/__init__.py Exports EpubParser from the parser package.
rag/flow/parser/parser.py Adds EPUB support to flow parser configuration and invocation map.
rag/app/naive.py Adds .epub handling branch in chunk() and imports EpubParser.
api/utils/file_utils.py Recognizes .epub in filename_type() extension matching.
test/unit_test/deepdoc/parser/test_epub_parser.py Adds unit tests covering EPUB parsing, ordering, and fallbacks.

You can also share your feedback on Copilot code review. Take the survey.

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Mar 17, 2026
@seroperson seroperson force-pushed the i1398-epub-filetype-support branch from 5429e70 to 0bfbcbb Compare March 17, 2026 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration 💞 feature Feature request, pull request that fullfill a new feature. 🌈 python Pull requests that update Python code size:XL This PR changes 500-999 lines, ignoring generated files. 🧪 test Pull requests that update test cases.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Native support of Epub filetype

4 participants