Repository navigation

#

content-extraction

JavaScript
4221
12 小时前

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

HTML
292
3 个月前

Readability2 converts HTML to plain text.

TypeScript
108
7 年前

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

TypeScript
63
2 年前

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

Ruby
43
4 年前

DOM Based Content Extraction via Text Density

Rust
35
3 个月前

Web content extraction using machine learning

HTML
34
4 年前

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

JavaScript
31
4 个月前

Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

C++
20
5 个月前

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

Python
20
7 年前

Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!

Python
14
10 个月前

This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.

Python
14
2 年前

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

HTML
12
8 年前

The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.

TypeScript
9
1 个月前

📸 Crawell – 网页图片/正文一键提取、Markdown 转换与批量下载的浏览器扩展,本地化,免费 Crawell browser extension for one-click image & article extraction, Markdown conversion and bulk download – 100 % local processing.

TypeScript
8
20 天前

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

JavaScript
7
2 个月前