Repository navigation

scraping

Website
Wikipedia

The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data 🔥

人工智能爬虫 Markdown scraper html-to-markdown 大语言模型 scraping web-crawler ai-scraping webscraping web-scraping web-data web-data-extraction ai-agents data-extraction ai-crawler ai-search web-scraper web-search

TypeScript

61337

4962

4 小时前

scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Python scraping crawling 框架爬虫 Hacktoberfest web-scraping web-scraping-python

Python

58425

11075

1 天前

feder-cr / Jobs_Applier_AI_Agent_AIHawk

AIHawk aims to easy job hunt process by automating the job application process. Utilizing artificial intelligence, it enables users to apply for multiple jobs in a tailored way.

自动化 Bot ChatGPT gpt job jobsearch jobseeker opeai Python resume scraper scraping application-resume Selenium Chrome human-resources jobs agent 人工智能

Python

28875

4382

4 个月前

gocolly / colly

Elegant Scraper and Crawler Framework for Golang

Go scraper 框架爬虫 scraping crawling spider

24706

1833

4 天前

ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI

scraping scraping-python automated-scraper 大语言模型人工智能 web-crawler web-scraping ai-scraping 爬虫 html-to-markdown Markdown rag

Python

21409

1832

6 小时前

apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

web-scraping web-crawling npm headless-chrome Puppeteer 自动化 apify scraping crawling 爬虫 headless scraper web-crawler JavaScript Node.js Playwright TypeScript

TypeScript

19685

1018

10 小时前

soxoj / maigret

🕵️‍♂️ Collect a dossier on a person by username from thousands of sites

OSINT social-network identification socmint sherlock investigation namechecker Python Open Source Cybersecurity scraping osint-python redteam blueteam osint-framework 命令行界面 reconnaissance pentesting

Python

17673

1225

3 天前

psf / requests-html

Pythonic HTML Parsing for Humans™

HTML scraping Python requests HTTP kennethreitz lxml pyquery css-selectors beautifulsoup

Python

13851

997

1 年前

ultrafunkamsterdam / undetected-chromedriver

Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)

chromedriver Selenium webdriver Chrome anti-detection anti-bot distil browser 自动化 scraping Python captcha navigator Testing Cloudflare cloudflare-bypass bot-detection

Python

11816

1285

3 个月前

code4craft / webmagic

A scalable web crawler framework for Java.

爬虫 Java scraping 框架

Java

11643

4164

1 个月前

D4Vinci / Scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

爬虫 crawling crawling-python Playwright Python scraping selectors stealth-game web-scraper web-scraping web-scraping-python webscraping xpath 自动化人工智能 ai-scraping data data-extraction mcp mcp-server

Python

7418

417

3 天前

lorien / awesome-web-scraping

List of libraries, tools and APIs for web scraping and data processing.

web-scraping captcha-recaptcha crawling crawling-python scraping scraping-framework scraping-python scraping-tool webscraping 爬虫 spider

Makefile

7357

825

9 个月前

tabulapdf / tabula

Tabula is a tool for liberating data tables trapped inside PDF files

pdf CSV excel tables scraping

CSS

7205

677

7 个月前

alirezamika / autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

scraping scraper scrape webscraping 爬虫 web-scraping 人工智能 Python webautomation 自动化机器学习

Python

6984

711

4 个月前

apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify 自动化 beautifulsoup 爬虫 crawling headless headless-chrome pip Playwright Python scraper scraping web-crawler web-crawling web-scraping Hacktoberfest

Python

6757

482

1 天前