Repository navigation

#

crawling

scrapy/scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Python
54932
10 天前

Elegant Scraper and Crawler Framework for Golang

Go
24047
22 天前
apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

TypeScript
17482
1 天前

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

HTML
14497
1 个月前
Makefile
6970
4 个月前

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Python
5532
2 天前
hakluke/hakrawler

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

Go
4647
4 个月前

Apache Nutch is an extensible and scalable web crawler

Java
3004
22 天前
D4Vinci/Scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library that makes Web Scraping simple and easy again!

Python
2914
1 天前
ai-robots-txt/ai.robots.txt

A list of AI agents and robots to block.

Python
2403
15 小时前

蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统

PHP
1990
25 天前
NateScarlet/holiday-cn

📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告

Python
1442
3 天前

The complete web scraping toolkit for PHP.

PHP
1401
5 天前