Repository navigation

#

scraping

scrapy/scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Python
54931
10 天前

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

TypeScript
36264
2 分钟前

AIHawk aims to easy job hunt process by automating the job application process. Utilizing artificial intelligence, it enables users to apply for multiple jobs in a tailored way.

Python
27959
1 个月前

Elegant Scraper and Crawler Framework for Golang

Go
24047
22 天前
apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

TypeScript
17481
1 天前

A scalable web crawler framework for Java.

Java
11538
15 天前

Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)

Python
11015
10 个月前

Tabula is a tool for liberating data tables trapped inside PDF files

CSS
6998
1 个月前
Makefile
6970
4 个月前
alirezamika/autoscraper
Python
6723
6 个月前

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Python
5532
2 天前

Mechanize is a ruby library that makes automated web interaction easy.

Ruby
4416
3 个月前

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Python
4145
1 个月前

Collection of useful data science topics along with articles, videos, and code

Jupyter Notebook
4087
12 小时前

Up-to-date simple useragent faker with real world database

Python
3857
5 天前