Repository navigation

#

content-extraction

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

HTML
275
1 年前

Readability2 converts HTML to plain text.

TypeScript
109
6 年前

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

TypeScript
59
1 年前

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

Ruby
43
4 年前

Web content extraction using machine learning

HTML
33
4 年前

DOM Based Content Extraction via Text Density

Rust
28
1 个月前

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

JavaScript
25
15 天前

Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

C++
20
1 个月前

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

Python
20
6 年前

Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!

Python
14
6 个月前

This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.

Python
13
2 年前

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

HTML
12
8 年前

This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…

Python
5
1 年前

This repository is implematation of 📄 DOM based content extraction via text density. Tested for Korean web pages.

Go
5
4 天前

A web application that scrapes web pages, extracts main content, and uses OpenLLaMA to convert the content into specified formats.

HTML
4
4 个月前