Repository navigation

#

web-crawling

apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

TypeScript
17482
2 天前

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Python
5534
3 天前

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

Python
612
2 个月前

A simple web scraper to extract Product Data and Pricing from Amazon

Python
390
2 年前

Machine Learning Model for Sport Predictions (Football, Basketball, Baseball, Hockey, Soccer & Tennis)

Jupyter Notebook
263
8 年前

This is a Twitter Scraper which uses Selenium for scraping tweets. It is capable of scraping tweets from home, user profile, hashtag, query or search, and advanced searches.

Jupyter Notebook
252
8 天前

A simple but powerful web crawler library for .NET

C#
251
1 年前

Omnisci3nt – See What They’ve Tried to Hide Extract deep intelligence from any domain. From subdomains to SSL certs, archived secrets to exposed ports — Omnisci3nt gives you the full picture in seconds.

Python
238
5 天前

⚡ Ayakashi.io - The next generation web scraping framework

TypeScript
213
2 年前

Scrapy Training companion code

Python
174
6 年前

A web crawling framework written in Kotlin

Kotlin
128
4 年前

💵 💰 🇧 Informações sobre taxas oficiais diárias de Inflação, Selic, Poupança, Dólar, Dólar PTAX, Euro e Euro PTAX pelo site do Banco Central do Brasil

Python
124
3 年前

Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO 👉

Python
123
5 年前

Parser and database to index the terpene profile of different strains of Cannabis from online databases

Python
118
2 年前

JAW: A Graph-based Security Analysis Framework for Client-side JavaScript

JavaScript
105
4 个月前

Simple robots.txt template. Keep unwanted robots out (disallow). White lists (allow) legitimate user-agents. Useful for all websites.

86
2 个月前