Building an AI Bot Access Checker: When Robots.txt Lies
Build an AI bot access checker that reveals when crawlers are blocked despite robots.txt saying 'allow'.
Your robots.txt says "allow" but AI bots still can't access your site. Sound familiar?
I kept running into this exact problem during client SEO audits. Website after website had robots.txt files that welcomed AI crawlers with open arms, but the bots were getting blocked at the infrastructure level – CDNs, firewalls, rate limiters doing their job a little too well.
The disconnect was costing businesses real opportunities. While everyone's debating whether to allow or block AI crawlers, many sites are accidentally blocking them without even knowing it.
Building on Open Source
This story actually starts with Serge Bezborodov, CTO at JetOctopus.com, who nailed the problem perfectly in a recent post:
"AI bots crawling is the new hot topic in Technical SEO. There are two sides to the story: we don't want our content scraped and reused, but we want traffic from GPT and other AI chats."
Serge hit on something crucial. Back in early 2023, the trend was clear: block everything in robots.txt except Google and Bing. Fast forward to today, and the conversation has shifted completely. Traffic from AI chatbots may not be huge yet, but it's often highly convertible.
So Serge built a Python script to solve exactly this problem. A simple tool that checks if AI bots are actually allowed to crawl your site by making real requests with AI bot user agents and checking robots.txt, status codes, and meta tags.
The beautiful part? He open-sourced it with this invitation: "The script is stupidly simple. Feel free to tweak it, add more bots, and make it your own. Claude Code is a great help if you want to modify it. 😅"
Challenge accepted. So I built a tool to test what's actually happening versus what the policies say. Today, I'm walking you through exactly how I built the AI Bot Access Checker – and how you can build your own.
The Problem: Policy vs. Reality
Here's what I was seeing in client audits:
Robots.txt: "User-agent: GPTBot" → "Allow: /"
Reality: GPTBot gets a 403 Forbidden response
Result: Zero AI visibility despite "allowing" access
The culprits? CDN-level blocking rules, firewall configurations that treat bot traffic as suspicious, and geographic restrictions that AI companies' servers hit.
This isn't just a technical curiosity. AI-powered search is becoming real traffic. Perplexity, ChatGPT's browsing mode, Claude's web access – they're all driving actual visitors to websites. Miss the crawling phase, miss the traffic.
What We're Building: Real-World Bot Testing
The tool tests seven different AI bot user agents against any website:
OpenAI Ecosystem:
GPTBot: Web crawler for improving AI models
ChatGPT-User: Real-time browsing sessions
OAI-SearchBot: OpenAI's search product crawler
Anthropic Ecosystem:
ClaudeBot: General Anthropic crawler
Claude-User: Real-time Claude web access
Perplexity Ecosystem:
PerplexityBot: Perplexity's web crawler
Perplexity-User: Real-time answer generation
For each bot, we test actual HTTP access, robots.txt compliance, robots meta tag restrictions, and response times. No guessing – real data.
Note: Bot purposes are based on official documentation where available and observed behavior patterns. Companies rarely provide complete technical specifications for their crawlers.
The Architecture: Simple But Effective
Here's the high-level approach:
User Input (URL)
↓
URL Validation & Normalization
↓
Robots.txt Fetch & Parse
↓
Bot Testing Loop:
- For each AI bot user agent
- Make HTTP request with bot headers
- Check response status
- Parse page metadata
- Record timing
↓
Results Analysis & Display
The beauty is in the simplicity. No complex infrastructure needed – just systematic testing with the exact user agents that AI companies use.
The Core Implementation
Bot User Agent Dictionary
First, we define the exact user agents that AI companies use. This is critical – we're testing the real strings, not approximations:
AI_BOT_AGENTS = {
"OpenAI": {
"OAI-SearchBot": "OAI-SearchBot/1.0; +https://openai.com/searchbot",
"ChatGPT-User": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot",
"GPTBot": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot",
},
"Anthropic": {
"ClaudeBot": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)",
"Claude-User": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-User/1.0; +Claude-User@anthropic.com)",
},
"Perplexity": {
"PerplexityBot": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)",
"Perplexity-User": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)",
},
}
Pro tip: These user agent strings change occasionally. Cloudflare maintains an up-to-date list (though they delisted ClaudeBot recently). Thanks to Jimmy Kropelin for sharing that resource.
Robots.txt Analysis
Before testing bot access, we fetch and parse the robots.txt file:
def fetch_robots_txt(url):
try:
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
response = requests.get(robots_url, timeout=10)
if response.status_code == 200:
return Protego.parse(response.text) # Protego handles robots.txt parsing
return None
except Exception:
return None
def check_robots_permission(robots_parser, user_agent, url):
if not robots_parser:
return True # No robots.txt means no restrictions
try:
return robots_parser.can_fetch(url, user_agent)
except Exception:
return False # Error parsing means we assume blocked
The Bot Testing Engine
Here's where the it all happens – systematic testing of each bot's actual access:
def analyze_url_for_ai_bots(url):
results = {"url": url, "timestamp": time.time(), "bots": []}
# Fetch robots.txt once for all bots
robots_parser = fetch_robots_txt(url)
for company, agents in AI_BOT_AGENTS.items():
for bot_name, user_agent in agents.items():
bot_result = {
"company": company,
"bot_name": bot_name,
"user_agent": user_agent
}
try:
# Check robots.txt permission
robots_allowed = check_robots_permission(robots_parser, user_agent, url)
# Make request with bot user agent
headers = {"User-Agent": user_agent}
start_time = time.time()
response = requests.get(url, headers=headers, timeout=30)
load_time = time.time() - start_time
# Extract page metadata
title, robots_meta, has_noindex = extract_page_metadata(response.text)
# Determine overall access status
is_fully_allowed = (
response.status_code == 200
and robots_allowed
and not has_noindex
)
bot_result.update({
"status": "allowed" if is_fully_allowed else "blocked",
"http_status": response.status_code,
"robots_txt": "allowed" if robots_allowed else "blocked",
"robots_meta": robots_meta,
"response_time": round(load_time, 2),
"error": None
})
except requests.exceptions.RequestException as e:
# Handle connection errors - often indicates bot-specific blocking
bot_result.update({
"status": "error",
"error": "Connection refused by server",
"http_status": None,
"response_time": None
})
results["bots"].append(bot_result)
return results
Page Metadata Extraction
We also check for robots meta tags that might block indexing:
def extract_page_metadata(html_content):
try:
soup = BeautifulSoup(html_content, "html.parser")
# Extract page title
title_tag = soup.find("title")
title = title_tag.get_text().strip() if title_tag else "No title found"
# Extract robots meta tag
robots_tag = soup.find("meta", attrs={"name": "robots"})
robots_content = robots_tag.get("content", "").strip() if robots_tag else ""
if not robots_content:
robots_meta = "No robots meta tag"
has_noindex = False
else:
robots_meta = robots_content
has_noindex = "noindex" in robots_content.lower()
return title, robots_meta, has_noindex
except Exception:
return "Parse error", "Parse error", False
The Web Interface: Making It Accessible
The backend analysis is just half the story. To make this useful for SEO professionals and marketers, I built a clean web interface.
Flask Routes
@app.route("/ai-bot-analyzer")
def ai_bot_analyzer():
return render_template("ai_bot_analyzer.html")
@app.route("/api/analyze-ai-bots", methods=["POST"])
def analyze_ai_bots():
try:
data = request.json
url = data.get("url", "").strip()
# Add protocol if missing
if not url.startswith(("http://", "https://")):
url = "https://" + url
if not validate_url(url):
return jsonify({"success": False, "error": "Please provide a valid URL"}), 400
# Analyze the URL
results = analyze_url_for_ai_bots(url)
return jsonify({"success": True, "results": results})
except Exception as e:
return jsonify({"success": False, "error": "An error occurred during analysis"}), 500
Frontend JavaScript Logic
The frontend handles form submission and results display:
// Form submission with validation
form.addEventListener('submit', async function(e) {
e.preventDefault();
const url = urlInput.value.trim();
// Validate URL format
const validation = validateUrl(url);
if (!validation.valid) {
showError(validation.message);
return;
}
// Show loading state
showLoadingState();
try {
const response = await fetch('/api/analyze-ai-bots', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ url: url })
});
const data = await response.json();
if (data.success) {
displayResults(data.results);
} else {
showError(data.error);
}
} catch (error) {
showError('Failed to analyze URL. Please check your connection and try again.');
} finally {
hideLoadingState();
}
});
// Results display with status badges
function displayResults(results) {
results.bots.forEach(bot => {
const row = document.createElement('tr');
// Status badge logic
if (bot.status === 'allowed') {
statusCell.innerHTML = '<span class="badge bg-success">Allowed</span>';
} else if (bot.status === 'blocked') {
statusCell.innerHTML = '<span class="badge bg-danger">Blocked</span>';
if (bot.blocking_reason) {
statusCell.innerHTML += `<br><small class="text-muted">${bot.blocking_reason}</small>`;
}
} else if (bot.status === 'error') {
statusCell.innerHTML = '<span class="badge bg-secondary">Error</span>';
}
// Additional cells for HTTP status, robots.txt, response time...
});
}
Real-World Results: What I Discovered
After building and testing this tool across dozens of websites, here's what I found:
The 403 Paradox: Most common issue – robots.txt says "allow" but bots get HTTP 403. This screams CDN or firewall blocking.
Geographic Blocking: Some sites block requests from certain regions where AI companies run their crawlers.
Rate Limiting: Aggressive rate limiting sometimes blocks AI bots entirely, even though they typically crawl at reasonable speeds.
Testing My Own Results
When I launched the tool, I tested one of my own sites: bedtimestories.pro. The results were encouraging – all seven AI bots showed "Allowed" status with clean 200 responses and sub-0.2s response times.
But here's the interesting part: when worldwide SEO expert Aleyda Solis shared the tool with her audience, I saw an immediate spike in traffic and usage – which told me this was a much-needed tool in the SEO community.
My own testing data shows that roughly 30% of websites have at least one AI bot blocked at the infrastructure level, despite having permissive robots.txt files.
Implementation Tips for Your Own Tool
Error Handling Strategies
# Different types of connection errors need different handling
except requests.exceptions.RequestException as e:
error_message = str(e)
if "RemoteDisconnected" in error_message:
error_message = "Connection refused by server"
elif "timeout" in error_message.lower():
error_message = "Request timeout"
elif "connection" in error_message.lower():
error_message = "Connection error"
else:
error_message = "Request failed"
Rate Limiting Considerations
@limiter.limit("10 per minute") # Prevent abuse while allowing legitimate testing
URL Validation
def validate_url(url):
try:
parsed = urlparse(url)
return parsed.scheme in ("http", "https") and parsed.netloc
except Exception:
return False
The Business Impact
This isn't just a technical exercise. The tool has generated real consulting opportunities. When SEOs and marketing teams see their AI bot access status, they often realize they need help optimizing their infrastructure.
More importantly, it's become a conversation starter about AI-powered SEO strategy. The question isn't just "should we allow AI bots?" anymore – it's "are we actually allowing them, even when we think we are?"
Your Next Steps
Build Your Own Version: Start with the core bot testing logic. You can extend it with additional AI bots or specific testing scenarios for your clients.
Test Your Properties: Use this approach to audit your own websites and client sites. You might be surprised by what you find.
Monitor Changes: AI bot user agents evolve. Set up monitoring to track when new bots appear or existing ones change their strings.
Educational Content: If you're building for clients, include educational content about why AI bot access matters for modern SEO strategy.
The Bigger Picture
We're still in the early stages of AI-powered search. The companies that get their AI bot access strategy right now – allowing the right bots while maintaining security – will have an advantage as this traffic becomes more significant.
Building tools like this isn't just about solving immediate technical problems. It's about understanding where the search landscape is heading and positioning yourself as the expert who can navigate these changes.
The infrastructure gap between robots.txt policy and real-world access is just one example of how AI is creating new SEO challenges. The marketers and developers who can identify and solve these problems will be the ones businesses turn to as AI reshapes digital marketing.
Want to test your own sites? Try the AI Bot Access Checker at maxbraglia.com/ai-bot-analyzer. And if you're implementing AI solutions for your business, let's talk about what's actually possible versus what sounds good in demos.