Repository navigation

#

crawling

scrapy/scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Python
58425
1 天前

Elegant Scraper and Crawler Framework for Golang

Go
24706
4 天前
apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

TypeScript
19685
10 小时前

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

HTML
14800
2 个月前
D4Vinci/Scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

Python
7418
3 天前

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Python
6757
1 天前
hakluke/hakrawler

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

Go
4874
9 个月前
ai-robots-txt/ai.robots.txt

A list of AI agents and robots to block.

Python
3120
6 天前

Apache Nutch is an extensible and scalable web crawler

Java
3074
16 天前

蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统

PHP
2035
1 个月前
NateScarlet/holiday-cn

📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告

Python
1622
5 天前

The complete web scraping toolkit for PHP.

PHP
1425
12 天前