Repository navigation

#

robots-txt

Polite, slim and concurrent web crawler.

Go
2052
4 年前

A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

Go
791
4 年前

🤖 The largest directory for AI-ready documentation and tools implementing the proposed llms.txt standard

TypeScript
574
4 天前

Tame the robots crawling and indexing your Nuxt site.

TypeScript
494
2 天前

The robots.txt exclusion protocol implementation for Go language

Go
277
3 年前

A simple but powerful web crawler library for .NET

C#
251
2 年前

Determine if a page may be crawled from robots.txt, robots meta tags and robot headers

PHP
249
16 天前

A set of reusable Java components that implement functionality common to any web crawler

Java
248
12 天前

Ultimate Website Sitemap Parser

Python
225
25 天前

Opt-Out tool to check Copyright reservations in a way that even machines can understand.

Python
194
2 年前

Open-Source Python Based SEO Web Crawler

Python
181
2 年前

NodeJS robots.txt parser with support for wildcard (*) matching.

JavaScript
160
1 年前

Known tags and settings suggested to opt out of having your content used for AI training.

HTML
156
1 年前

Makes it easy to add robots.txt, sitemap and web app manifest during build to your Astro app.

TypeScript
125
2 年前

grobotstxt is a native Go port of Google's robots.txt parser and matcher library.

Go
113
4 年前

Gatsby plugin that automatically creates robots.txt for your site

JavaScript
105
2 年前

🤖 A curated list of websites that restrict access to AI Agents, AI crawlers and GPTs

Python
93
4 天前

ScrapeGPT is a RAG-based Telegram bot designed to scrape and analyze websites, then answer questions based on the scraped content. The bot utilizes Retrieval Augmented Generation and webscraping to return natural language answers to the user's queries.

Python
87
2 年前