Repository navigation

#

crawling

scrapy/scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Python
57989
1 天前

Elegant Scraper and Crawler Framework for Golang

Go
24551
2 个月前
apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

TypeScript
18781
3 小时前

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

HTML
14713
6 天前
D4Vinci/Scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

Python
6453
3 天前

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Python
6181
8 小时前
hakluke/hakrawler

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

Go
4838
8 个月前

Apache Nutch is an extensible and scalable web crawler

Java
3057
1 个月前
ai-robots-txt/ai.robots.txt

A list of AI agents and robots to block.

Python
2990
5 天前

蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统

PHP
2022
2 个月前
NateScarlet/holiday-cn

📅🇨🇳中国法定节假日数据 自动每日抓取国务院公告

Python
1583
21 小时前

The complete web scraping toolkit for PHP.

PHP
1420
1 天前