Repository navigation

webscraper

Website
Wikipedia

Self-hosted webscraper.

Open Source 自托管 webscraper Docker helm Kubernetes Playwright Python scraping web-scraper web-scrapers web-scraping webscraping

TypeScript

4230

192

1 个月前

anaskhan96 / soup

Web Scraper in Go, similar to BeautifulSoup

Go webscraper webscraping beautifulsoup web-scraper html-node

2213

167

2 年前

any4ai / AnyCrawl

AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts structured SERP results from Google/Bing/Baidu/etc. Native multi-threading for bulk processing.

aitools crawl scrape webscraper ai-scraping data html-to-markdown rag scraping

TypeScript

1956

177

19 小时前

benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.

xquery XML HTML JSON xpath 命令行界面 HTTP Web REST API css-selector wget cURL httpie webscraper webscraping scraper datascraping data-processing

Pascal

812

6 个月前

scrapfly / scrapfly-scrapers

Scalable Python web scraping scripts for +40 popular domains

crawling Python 爬虫 scraping web-scraping web-scraping-python antibot 自动化 crawling-python datascraping proxies python-scraper scraper scraping-python spider twitter-scraper web-crawler webscraper webscraping

Python

606

142

1 天前

rootVIII / proxy_requests

a class that uses scraped proxies to make http GET/POST requests (Python requests)

Python requests-module requests proxy proxy-server proxy-list webscraping webscraper recursion HTTP http-proxy python-requests

Python

391

5 年前

salimk / Rcrawler

An R web crawler and scraper

R 爬虫 scraper webcrawler webscraping webscraper webscrapping crawlers

355

3 年前

onepointAI / onepoint

An AI assistant tool that integrates coding, writing, and reading functions. For better alternatives see https://monica.im/desktop

人工智能 Electron ChatGPT all-in-one macOS toolkit React webscraper Code reading gpt-35-turbo

TypeScript

313

2 年前

toby-p / rightmove_webscraper.py

Python class to scrape data from rightmove.co.uk and return listings in a pandas DataFrame object

webscraper pandas pandas-dataframe CSV Python 数据科学数据分析 data-mining

Python

272

117

2 年前

intergalacticalvariable / reader

📚 This is an adapted version of Jina AI's Reader for local deployment using Docker. Convert any URL to an LLM-friendly input with a simple prefix http://127.0.0.1:3000/https://website-to-scrape.com/

Docker 大语言模型 proxy rag scraper 自托管 webscraper webscraping website-screenshot website-screenshot-capturer

TypeScript

242

1 个月前

serpapi / lego-ai-parser

Lego AI Parser is an open-source application that uses OpenAI to parse visible text of HTML elements.

人工智能 classification datascience gpt-3 HTML 机器学习 openai Parser Parsing parser-library Python scraper 工具 Web app webscraper webscraping

Python

236

1 年前

TBosak / mkfd

RSS feed builder created with Bun🥖 and Hono🔥- builds from webpages, email folders, and REST API calls.

Bun feed Hono RSS TypeScript contributors-welcome help-wanted rss-generator scraper 自托管 webscraper Docker Dockerfile dockerhub

TypeScript

183

7 天前

mehmetozkaya / DotnetCrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

.NET 爬虫 crawling scraping scrapy entity-framework-core ddd-architecture C#webcrawler webscraping webscraper htmlagilitypack

178

3 年前