Repository navigation
web-crawling
- Website
- Wikipedia
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
The All in One Framework to Build Undefeatable Scrapers
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
A simple web scraper to extract Product Data and Pricing from Amazon
Library for Rapid (Web) Crawler and Scraper Development
Machine Learning Model for Sport Predictions (Football, Basketball, Baseball, Hockey, Soccer & Tennis)
This is a Twitter Scraper which uses Selenium for scraping tweets. It is capable of scraping tweets from home, user profile, hashtag, query or search, and advanced searches.
A simple but powerful web crawler library for .NET
Omnisci3nt – See What They’ve Tried to Hide Extract deep intelligence from any domain. From subdomains to SSL certs, archived secrets to exposed ports — Omnisci3nt gives you the full picture in seconds.
⚡ Ayakashi.io - The next generation web scraping framework
A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.
Scrapy Training companion code
A web crawling framework written in Kotlin
💵 💰 🇧 Informações sobre taxas oficiais diárias de Inflação, Selic, Poupança, Dólar, Dólar PTAX, Euro e Euro PTAX pelo site do Banco Central do Brasil
Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO 👉
Parser and database to index the terpene profile of different strains of Cannabis from online databases
A web crawling programming language
JAW: A Graph-based Security Analysis Framework for Client-side JavaScript
Simple robots.txt template. Keep unwanted robots out (disallow). White lists (allow) legitimate user-agents. Useful for all websites.