Repository navigation

web-crawler

Website
Wikipedia

The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data 🔥

人工智能爬虫 Markdown scraper html-to-markdown 大语言模型 scraping web-crawler ai-scraping webscraping web-scraping web-data web-data-extraction ai-agents data-extraction ai-crawler ai-search web-scraper web-search

TypeScript

61337

4962

4 小时前

ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI

scraping scraping-python automated-scraper 大语言模型人工智能 web-crawler web-scraping ai-scraping 爬虫 html-to-markdown Markdown rag

Python

21409

1832

6 小时前

apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

web-scraping web-crawling npm headless-chrome Puppeteer 自动化 apify scraping crawling 爬虫 headless scraper web-crawler JavaScript Node.js Playwright TypeScript

TypeScript

19685

1018

10 小时前

crawlab-team / crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

webcrawler scrapy crawlab spiders-management Go scrapyd-ui spider 爬虫 webspider web-crawler Docker platform crawling-tasks

11990

1871

6 天前

ssssssss-team / spider-flow

新一代爬虫平台，以图形化方式定义爬虫流程，不写代码即可完成爬虫。

spider 爬虫 jsoup xpath web-spider webspider webcrawler web-crawler spider-flow

Java

10969

2126

2 年前

BruceDone / awesome-crawler

A collection of awesome web crawler,spider in different languages

web-crawler 爬虫 web-scraper spider scraper Awesome Lists

6958

732

1 年前

apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify 自动化 beautifulsoup 爬虫 crawling headless headless-chrome pip Playwright Python scraper scraping web-crawler web-crawling web-scraping Hacktoberfest

Python

6757

482

1 天前

adithya-s-k / omniparse

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

OCR omniparser parse-server parser-library vision-transformer web-crawler

Python

6698

526

4 个月前

firecrawl / firecrawl-mcp-server

🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.

batch-processing claude content-extraction data-collection firecrawl firecrawl-ai llm-tools mcp-server model-context-protocol search-api web-crawler web-scraping javascript-rendering mcp

JavaScript

4645

492

1 天前

apache / nutch

Apache Nutch is an extensible and scalable web crawler

Java nutch web-crawler crawling hadoop apache

Java

3074

1261

16 天前

sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

C#爬虫 web-crawler Parsing spider spiders pluggable Unit testing netcore netcore2 netcore3 netstandard20 cross-platform

2289

560

1 年前

jasonxtn / Argus

The Ultimate Information Gathering Toolkit

dns-lookup information-gathering OSINT recon-tools reconnaissance virustotal web-crawler whois-lookup

Python

2274

249

1 年前

xianhu / PSpider

简单易用的Python爬虫框架，QQ交流群：597510560

爬虫 spider Python proxies web-spider multi-threading web-crawler python-spider multiprocessing

Python

1838

501

3 年前

MarginaliaSearch / MarginaliaSearch

Internet search engine for text-oriented websites. Indexing the small, old and weird web.

search-engine no-cloud small-web internet-search indexer language-processing web-crawler alt-search 自托管 Java

HTML

1497

12 小时前

gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

命令行界面 Node.js single-file web-archiving web-scraper web-scraping archiving scraping-websites 爬虫 web-crawler Deno Dockerfile

JavaScript

992

4 个月前

Algebra-FUN / WeReadScan

扫描“微信读书”已购图书并下载本地PDF的爬虫

Selenium weread web-crawler book-downloader

Python

975

169

2 年前

apache / stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

web-crawler distributed Java 爬虫

Java

933

265

6 天前

webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container

爬虫 crawling warc web-archiving web-crawler

TypeScript

882

115

1 天前

postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

spider Ruby 爬虫 Web scraper web-scraping web-spider web-crawler web-scraper

Ruby

825

109

3 个月前

scrapfly / scrapfly-scrapers

Scalable Python web scraping scripts for +40 popular domains

crawling Python 爬虫 scraping web-scraping web-scraping-python antibot 自动化 crawling-python datascraping proxies python-scraper scraper scraping-python spider twitter-scraper web-crawler webscraper webscraping

Python

665

154

1 天前