Repository navigation

robots-txt

Website
Wikipedia

PuerkitoBio / gocrawl

Polite, slim and concurrent web crawler.

爬虫 robots-txt

2052

194

4 年前

eliasdabbas / advertools

advertools - online marketing productivity and analysis tools

marketing advertising Python keywords twitter-api 搜索引擎优化 (SEO)social-media YouTube robots-txt scrapy Logging

Python

1274

234

12 天前

PuerkitoBio / fetchbot

A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

爬虫 robots-txt

791

4 年前

thedaviddias / llms-txt-hub

🤖 The largest directory for AI-ready documentation and tools implementing the proposed llms.txt standard

directory 大语言模型 Next robots-txt Supabase cursor cursor-ai

TypeScript

574

333

4 天前

nuxt-modules / robots

Tame the robots crawling and indexing your Nuxt site.

Nuxt.js Vue.js nuxt-module robots-txt ssr

TypeScript

494

2 天前

temoto / robotstxt

The robots.txt exclusion protocol implementation for Go language

Go golang-library robots-txt Web production-ready go-library

277

3 年前

TurnerSoftware / InfinityCrawler

A simple but powerful web crawler library for .NET

爬虫 web-crawler web-crawling robots-txt spider

251

2 年前

spatie / robots-txt

Determine if a page may be crawled from robots.txt, robots meta tags and robot headers

PHP robots-txt 爬虫

PHP

249

16 天前

crawler-commons / crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

web-crawler Java robots-txt Open Source Library

Java

248

12 天前

GateNLP / ultimate-sitemap-parser

Ultimate Website Sitemap Parser

Python sitemap sitemap-xml robots-txt xml-sitemap

Python

225

25 天前

alexjc / weboptout

Opt-Out tool to check Copyright reservations in a way that even machines can understand.

command-line-tool robots-txt webscraping terms-of-service DataOps copyright

Python

194

2 年前

beb7 / gflare-tk

Open-Source Python Based SEO Web Crawler

搜索引擎优化 (SEO)爬虫 scraper Python tkinter robots-txt

Python

181

2 年前

samclarke / robots-parser

NodeJS robots.txt parser with support for wildcard (*) matching.

user-agent JavaScript Node.js robots-txt

JavaScript

160

1 年前

healsdata / ai-training-opt-out

Known tags and settings suggested to opt out of having your content used for AI training.

人工智能 meta robots-txt

HTML

156

1 年前

alextim / astro-lib

Makes it easy to add robots.txt, sitemap and web app manifest during build to your Astro app.

Astro 搜索引擎优化 (SEO)robots-txt sitemap sitemap-xml

TypeScript

125

2 年前

seantomburke / sitemapper

Parse through any sitemap in Node.js

sitemap sitemap-xml Parsing JavaScript 爬虫 crawling indexing robots-txt 搜索引擎优化 (SEO)Web XML

TypeScript

124

2 个月前

jimsmart / grobotstxt

grobotstxt is a native Go port of Google's robots.txt parser and matcher library.

Go robots-txt

113

4 年前

mdreizin / gatsby-plugin-robots-txt

Gatsby plugin that automatically creates robots.txt for your site

gatsby gatsby-plugin robots-txt

JavaScript

105

2 年前

samber / the-great-gpt-firewall

🤖 A curated list of websites that restrict access to AI Agents, AI crawlers and GPTs

agent anthropic blocklist censorship 爬虫 genai generative-ai gpt gpt-4 大语言模型 openai robots-txt user-agent firewall

Python

4 天前

LexiestLeszek / scrapeGPT

ScrapeGPT is a RAG-based Telegram bot designed to scrape and analyze websites, then answer questions based on the scraped content. The bot utilizes Retrieval Augmented Generation and webscraping to return natural language answers to the user's queries.

爬虫 huggingface large-language-models 大语言模型 ollama proxy rag retrieval-augmented-generation robots-txt scraper Telegram website-scraper

Python

2 年前