Repository navigation

web-crawling

Website
Wikipedia

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

web-scraping web-crawling npm headless-chrome Puppeteer 自动化 apify scraping crawling 爬虫 headless scraper web-crawler JavaScript Node.js Playwright TypeScript

TypeScript

19685

1018

10 小时前

apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

apify 自动化 beautifulsoup 爬虫 crawling headless headless-chrome pip Playwright Python scraper scraping web-crawler web-crawling web-scraping Hacktoberfest

Python

6757

482

1 天前

omkarcloud / botasaurus

Python

3089

258

5 小时前

brightdata / brightdata-mcp

A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.

大语言模型 mcp modelcontextprotocol scraping ai-agents browser-automation data-collection data-extraction mcp-server structured-data web-crawling web-data web-scraping

JavaScript

1389

187

10 天前

cxcscmu / Craw4LLM

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

爬虫 crawling large-language-models 大语言模型 pre-training pretraining web-crawler web-crawling

Python

637

7 个月前

scrapehero-code / amazon-scraper

A simple web scraper to extract Product Data and Pricing from Amazon

web-scraping web-crawling

Python

405

160

2 年前

crwlrsoft / crawler

Library for Rapid (Web) Crawler and Scraper Development

crawling PHP scraper scraping scraping-websites web-crawler web-crawling web-scraping Hacktoberfest 爬虫 web-scraper

PHP

366

2 个月前

spyboy-productions / omnisci3nt

Omnisci3nt – See What They’ve Tried to Hide Extract deep intelligence from any domain. From subdomains to SSL certs, archived secrets to exposed ports — Omnisci3nt gives you the full picture in seconds.

ip-lookup port-scanning ssl-certificate subdomain-enumeration web-crawling web-reconnaissance whois OSINT vulnerability-scanner

Python

301

2 个月前

godkingjay / selenium-twitter-scraper

This is a Twitter Scraper which uses Selenium for scraping tweets. It is capable of scraping tweets from home, user profile, hashtag, query or search, and advanced searches.

scraper X (Twitter)twitter-scraper web-crawling Hacktoberfest hacktoberfest-accepted collaborate Selenium

Jupyter Notebook

288

6 个月前