Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
Updated
Sep 12, 2025 - Python
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
A multithreaded 🕸️ web crawler that recursively crawls a website and creates a 🔽 markdown file for each page, designed for LLM RAG
Export Atlassian Confluence pages as markdown files.
Multimodal document parser for high quality data understanding and extraction
URL to Markdown API is a service that convert web content into clean, structured Markdown format through a simple HTTP GET request. It's built using FastAPI and the MarkItDown library, offering a straightforward way to convert various content types (web pages, YouTube videos, PDFs, documents) into Markdown that's optimized for Large Language Mod
✅ Parse your browser's exported HTML bookmark file to Markdown.
Turn a supported list of filetypes (e.g. .docx) into a markdown structured text file. Also optionally defangs indicators and extract texts from images. Built for threat intel use-cases.
Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.
Python script to convert Google Keep HTML note exports into Markdown (.md) files suitable for importing into Joplin.
Outillage d'extraction du contenu de l'ancien site de Geotribu (web scraping, conversion en markdown...)
a cli tool to fetch webpages main content and print it as markdown
A simplified online encyclopedia with Markdown-formatted entries. Powered by Django.
website scraper for text with conversion to markdown.md and directory structuring
A powerful CLI tool that mirrors entire websites or local directories and converts them into a single, clean Markdown file. Perfect for generating LLM context (RAG), offline documentation reading, and web archiving.
indexdoc-converter 是一款基于 Python 开发的文档转换工具库,核心功能为将主流办公文档、网页文件高效转换为 Markdown 格式。各类型文件支持格式如下: Word 文档支持 .docx ; Excel 类表格文档支持 .xlsx、.xls、.ods、.csv、.tsv ; 网页文件支持 .html、.mhtml、.htm 及网页url ; PPT 演示文档支持 .pptx 。 该工具库现已发布至 PyPI(Python Package Index),可通过 pip 包管理工具快速安装并投入使用。
🔖 Medium saved articles to Markdown with Freedium paywall bypass. Perfect for creating datasets, develop RAG projects archiving reading lists,.
Convert Telegram messages.html exports into individual Markdown files.
Extract text from images using a robust OCR model designed for accuracy and efficiency in varied visual contexts.
Add a description, image, and links to the html-to-markdown topic page so that developers can more easily learn about it.
To associate your repository with the html-to-markdown topic, visit your repo's landing page and select "manage topics."