Major changes: - Add TLD price scraper with Porkbun API (886+ TLDs, no API key needed) - Fix .ch domain checker using rdap.nic.ch custom RDAP - Integrate database for TLD price history tracking - Add admin endpoints for manual scrape and stats - Extend scheduler with daily TLD price scrape job (03:00 UTC) - Update API to use DB data with static fallback - Update README with complete documentation New files: - backend/app/services/tld_scraper/ (scraper package) - TLD_TRACKING_PLAN.md (implementation plan) API changes: - POST /admin/scrape-tld-prices - trigger manual scrape - GET /admin/tld-prices/stats - database statistics - GET /tld-prices/overview now uses DB data
897 lines
34 KiB
Markdown
897 lines
34 KiB
Markdown
# 📊 TLD Price Tracking System - Implementierungsplan
|
||
|
||
## Übersicht
|
||
|
||
Dieses Dokument beschreibt den Plan zur Implementierung eines automatischen TLD-Preis-Tracking-Systems, das Preisdaten über 12 Monate sammelt und korrekt abbildet.
|
||
|
||
**🎯 Fokus: 100% Kostenlos & Unabhängig**
|
||
|
||
---
|
||
|
||
## 🔍 Wie funktioniert das Domain-Preis-Ökosystem?
|
||
|
||
### Die Preiskette verstehen
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────┐
|
||
│ WIE DOMAIN-PREISE ENTSTEHEN │
|
||
├─────────────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ 1️⃣ REGISTRY (z.B. Verisign für .com) │
|
||
│ ┌─────────────────────────────────────────────────────────┐ │
|
||
│ │ • Verwaltet die TLD technisch │ │
|
||
│ │ • Setzt den WHOLESALE-PREIS (von ICANN reguliert) │ │
|
||
│ │ • .com = $9.59 Wholesale (2024) │ │
|
||
│ │ • Diese Preise sind ÖFFENTLICH in ICANN-Verträgen! │ │
|
||
│ └─────────────────────────────────────────────────────────┘ │
|
||
│ ↓ │
|
||
│ 2️⃣ REGISTRAR (Namecheap, GoDaddy, Cloudflare...) │
|
||
│ ┌─────────────────────────────────────────────────────────┐ │
|
||
│ │ • Kauft Domains zum Wholesale-Preis │ │
|
||
│ │ • Verkauft an Endkunden mit MARGE │ │
|
||
│ │ • .com = $10-15 Retail (je nach Registrar) │ │
|
||
│ │ • Preise auf ihren Websites ÖFFENTLICH sichtbar │ │
|
||
│ └─────────────────────────────────────────────────────────┘ │
|
||
│ ↓ │
|
||
│ 3️⃣ AGGREGATOREN (TLD-List.com, DomComp...) │
|
||
│ ┌─────────────────────────────────────────────────────────┐ │
|
||
│ │ WIE SIE AN DIE DATEN KOMMEN: │ │
|
||
│ │ │ │
|
||
│ │ a) 💰 AFFILIATE-PROGRAMME (Hauptquelle!) │ │
|
||
│ │ → Registrare geben Affiliates Zugang zu Preis-Feeds │ │
|
||
│ │ → TLD-List verdient Provision pro Referral │ │
|
||
│ │ → Kostenlos, aber erfordert Traffic/Registrierung │ │
|
||
│ │ │ │
|
||
│ │ b) 🔗 RESELLER-APIs │ │
|
||
│ │ → Als Reseller bekommt man API-Zugang │ │
|
||
│ │ → Erfordert Mindestguthaben (~$100-500) │ │
|
||
│ │ │ │
|
||
│ │ c) 🌐 WEB SCRAPING │ │
|
||
│ │ → Öffentliche Preisseiten automatisch auslesen │ │
|
||
│ │ → Technisch einfach, rechtlich grauzone │ │
|
||
│ │ │ │
|
||
│ │ d) 📋 ÖFFENTLICHE REGISTRY-DATEN │ │
|
||
│ │ → ICANN veröffentlicht Wholesale-Preise │ │
|
||
│ │ → Basis für Preisberechnungen │ │
|
||
│ └─────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Warum können TLD-List & Co. alle Preise zeigen?
|
||
|
||
| Methode | Wie es funktioniert | Für uns nutzbar? |
|
||
|---------|---------------------|------------------|
|
||
| **Affiliate-Programme** | Registrare geben Preis-Feeds an Partner, die Traffic bringen | ⭐ JA - Kostenlos, erfordert Anmeldung |
|
||
| **Reseller-Status** | Als Reseller bekommt man API-Zugang | ⚠️ Erfordert Mindesteinzahlung |
|
||
| **Web Scraping** | Öffentliche Seiten auslesen | ⭐ JA - Sofort möglich |
|
||
| **Direkte Partnerschaften** | Business-Deals mit Registraren | ❌ Nur für grosse Player |
|
||
|
||
---
|
||
|
||
## 🎯 Ziele
|
||
|
||
1. **Automatisierte Datensammlung** - Cronjob crawlt täglich/wöchentlich Preise von mehreren Registraren
|
||
2. **Historische Daten** - 12 Monate Preisverlauf pro TLD und Registrar
|
||
3. **Echte Marktdaten** - Keine generierten/simulierten Daten
|
||
4. **Lokal & Server** - Funktioniert identisch auf beiden Umgebungen
|
||
5. **🆓 Kostenlos** - Keine API-Keys, keine externen Abhängigkeiten
|
||
|
||
---
|
||
|
||
## 🛠️ Unsere Optionen (von einfach bis komplex)
|
||
|
||
### Option A: Web Scraping (Sofort umsetzbar) ⭐ EMPFOHLEN
|
||
|
||
```
|
||
Aufwand: 2-3 Tage | Kosten: $0 | Zuverlässigkeit: ⭐⭐⭐⭐
|
||
```
|
||
|
||
**So funktioniert's:**
|
||
- Registrare zeigen ihre Preise öffentlich auf ihren Websites
|
||
- Wir lesen diese Seiten automatisch aus (BeautifulSoup)
|
||
- Aggregator-Seiten wie TLD-List.com haben bereits alles zusammengetragen
|
||
|
||
**Quellen:**
|
||
| Quelle | Was wir bekommen | Schwierigkeit |
|
||
|--------|------------------|---------------|
|
||
| TLD-List.com | ~1500 TLDs, 50+ Registrare | ⭐ Sehr einfach |
|
||
| Porkbun.com | Direkte Preise | ⭐ Sehr einfach |
|
||
| Spaceship.com | Direkte Preise | ⭐ Einfach |
|
||
|
||
---
|
||
|
||
### Option B: Affiliate-Programme (Beste Langzeit-Lösung) ⭐⭐
|
||
|
||
```
|
||
Aufwand: 1 Woche | Kosten: $0 | Zuverlässigkeit: ⭐⭐⭐⭐⭐
|
||
```
|
||
|
||
**So funktioniert's:**
|
||
- Registrare wie Namecheap, Porkbun, etc. haben Affiliate-Programme
|
||
- Als Affiliate bekommst du Zugang zu strukturierten Preis-Feeds
|
||
- Du verdienst sogar Provision wenn jemand über deinen Link kauft!
|
||
|
||
**Verfügbare Affiliate-Programme:**
|
||
| Registrar | Affiliate-Programm | Preis-Feed? |
|
||
|-----------|-------------------|-------------|
|
||
| **Namecheap** | namecheap.com/support/affiliates | ✅ CSV/XML Feed |
|
||
| **Porkbun** | porkbun.com/affiliate | ✅ API Zugang |
|
||
| **Dynadot** | dynadot.com/community/affiliate | ✅ Preis-API |
|
||
| **NameSilo** | namesilo.com/affiliate | ✅ Bulk-Preise |
|
||
|
||
---
|
||
|
||
### Option C: Reseller-API (Professionellste Lösung)
|
||
|
||
```
|
||
Aufwand: 1-2 Wochen | Kosten: $100-500 Einzahlung | Zuverlässigkeit: ⭐⭐⭐⭐⭐
|
||
```
|
||
|
||
**So funktioniert's:**
|
||
- Werde Reseller bei einem Grosshändler (ResellerClub, eNom, OpenSRS)
|
||
- Du bekommst vollständigen API-Zugang zu allen TLD-Preisen
|
||
- Einmalige Mindesteinzahlung erforderlich
|
||
|
||
**Reseller-Plattformen:**
|
||
| Plattform | Mindesteinzahlung | API-Qualität |
|
||
|-----------|-------------------|--------------|
|
||
| ResellerClub | ~$100 | ⭐⭐⭐⭐⭐ |
|
||
| eNom | ~$250 | ⭐⭐⭐⭐ |
|
||
| OpenSRS | ~$500 | ⭐⭐⭐⭐⭐ |
|
||
|
||
---
|
||
|
||
### Option D: Offizielle Registry-Daten (Für Wholesale-Preise)
|
||
|
||
```
|
||
Aufwand: 1 Tag | Kosten: $0 | Was: Nur Wholesale-Preise
|
||
```
|
||
|
||
**So funktioniert's:**
|
||
- ICANN veröffentlicht Registry-Verträge mit Wholesale-Preisen
|
||
- Verisign (.com), PIR (.org), etc. - alle Preise sind öffentlich
|
||
- Retail-Preise = Wholesale + Registrar-Marge
|
||
|
||
**Öffentliche Quellen:**
|
||
| Quelle | URL | Daten |
|
||
|--------|-----|-------|
|
||
| ICANN Contracts | icann.org/resources/agreements | Wholesale-Preise |
|
||
| IANA Root DB | iana.org/domains/root/db | TLD-Liste + Registry |
|
||
|
||
---
|
||
|
||
## 📊 Datenquellen (100% Kostenlos)
|
||
|
||
### ⭐ Tier 1: Aggregator-Seiten (BESTE OPTION - Eine Quelle für alles!)
|
||
|
||
| Quelle | URL | Vorteile | Scraping |
|
||
|--------|-----|----------|----------|
|
||
| **TLD-List.com** | https://tld-list.com/ | Alle TLDs, alle Registrare, Vergleichstabellen | ⭐ Sehr einfach |
|
||
| **DomComp** | https://www.domcomp.com/ | Preisvergleich, Historische Daten | Einfach |
|
||
|
||
**Warum TLD-List.com die beste Wahl ist:**
|
||
- ✅ ~1500 TLDs abgedeckt
|
||
- ✅ 50+ Registrare verglichen
|
||
- ✅ Saubere HTML-Struktur (einfach zu parsen)
|
||
- ✅ Regelmässig aktualisiert
|
||
- ✅ Keine Login/Auth erforderlich
|
||
- ✅ Keine API-Rate-Limits
|
||
|
||
### Tier 2: Direkte Registrar-Preisseiten (Backup/Validation)
|
||
|
||
| Registrar | URL | Format | Schwierigkeit |
|
||
|-----------|-----|--------|---------------|
|
||
| **Porkbun** | https://porkbun.com/products/domains | HTML-Tabelle | ⭐ Sehr einfach |
|
||
| **Namecheap** | https://www.namecheap.com/domains/domain-pricing/ | JS-Rendered | Mittel |
|
||
| **Cloudflare** | https://www.cloudflare.com/products/registrar/ | "At-cost" Liste | Einfach |
|
||
| **Spaceship** | https://www.spaceship.com/pricing/ | HTML-Tabelle | Einfach |
|
||
| **NameSilo** | https://www.namesilo.com/pricing | HTML-Tabelle | Einfach |
|
||
|
||
### Tier 3: Offizielle Quellen (Validation/Reference)
|
||
|
||
| Quelle | URL | Daten |
|
||
|--------|-----|-------|
|
||
| **IANA** | https://www.iana.org/domains/root/db | Offizielle TLD-Liste |
|
||
| **ICANN** | https://www.icann.org/resources/pages/registries | Registry-Informationen |
|
||
|
||
---
|
||
|
||
## 🏗️ Systemarchitektur (API-Frei)
|
||
|
||
```
|
||
┌────────────────────────────────────────────────────────────────────────┐
|
||
│ TLD Price Tracking System (100% Free) │
|
||
├────────────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ ┌──────────────────┐ ┌─────────────────────────────────────┐ │
|
||
│ │ APScheduler │────▶│ Web Scraper Service │ │
|
||
│ │ (Cronjob) │ │ │ │
|
||
│ │ │ │ ┌─────────────────────────────┐ │ │
|
||
│ │ - Daily: 03:00 │ │ │ TLD-List.com Scraper │ │ │
|
||
│ │ - Weekly: Sun │ │ │ (Hauptquelle - alle TLDs) │ │ │
|
||
│ └──────────────────┘ │ └─────────────────────────────┘ │ │
|
||
│ │ │ │
|
||
│ │ ┌─────────────────────────────┐ │ │
|
||
│ │ │ Porkbun Scraper (Backup) │ │ │
|
||
│ │ └─────────────────────────────┘ │ │
|
||
│ │ │ │
|
||
│ │ ┌─────────────────────────────┐ │ │
|
||
│ │ │ Spaceship Scraper (Backup) │ │ │
|
||
│ │ └─────────────────────────────┘ │ │
|
||
│ └────────────────┬────────────────────┘ │
|
||
│ │ │
|
||
│ ┌─────────────────────────────────────────▼────────────────────────┐ │
|
||
│ │ Data Aggregation & Validation │ │
|
||
│ │ - Cross-check prices from multiple sources │ │
|
||
│ │ - Detect outliers (>20% change = warning) │ │
|
||
│ │ - Calculate confidence score │ │
|
||
│ └─────────────────────────────────────────┬────────────────────────┘ │
|
||
│ │ │
|
||
│ ┌─────────────────────────────────────────▼────────────────────────┐ │
|
||
│ │ SQLite/PostgreSQL │ │
|
||
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
|
||
│ │ │ tld_prices │ │ tld_info │ │ scrape_logs │ │ │
|
||
│ │ │ (history) │ │ (metadata) │ │ (audit trail) │ │ │
|
||
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
|
||
│ └──────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ┌─────────────────────────────────────────▼────────────────────────┐ │
|
||
│ │ FastAPI Backend │ │
|
||
│ │ - GET /tld-prices/overview (cached + fresh) │ │
|
||
│ │ - GET /tld-prices/{tld}/history (12 Monate echte Daten) │ │
|
||
│ │ - GET /tld-prices/trending (echte Trends aus DB) │ │
|
||
│ │ - POST /admin/scrape (manueller Trigger) │ │
|
||
│ └──────────────────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
└────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Warum Web Scraping die beste Lösung ist:
|
||
|
||
| Aspekt | APIs | Web Scraping |
|
||
|--------|------|--------------|
|
||
| **Kosten** | Teils kostenpflichtig | ✅ Kostenlos |
|
||
| **API Keys** | Erforderlich | ✅ Nicht nötig |
|
||
| **Rate Limits** | Streng | ✅ Selbst kontrolliert |
|
||
| **Abhängigkeit** | Von Anbieter | ✅ Unabhängig |
|
||
| **Stabilität** | API-Änderungen | HTML-Änderungen (selten) |
|
||
| **Abdeckung** | Nur ein Registrar | ✅ Alle via Aggregator |
|
||
|
||
---
|
||
|
||
## 📁 Neue Dateien & Struktur
|
||
|
||
```
|
||
backend/
|
||
├── app/
|
||
│ ├── services/
|
||
│ │ ├── tld_scraper/ # NEU: Scraper Package
|
||
│ │ │ ├── __init__.py
|
||
│ │ │ ├── base.py # Basis-Klasse für Scraper
|
||
│ │ │ ├── tld_list.py # TLD-List.com Scraper (Haupt)
|
||
│ │ │ ├── porkbun.py # Porkbun Scraper (Backup)
|
||
│ │ │ ├── spaceship.py # Spaceship Scraper (Backup)
|
||
│ │ │ ├── validator.py # Cross-Validation
|
||
│ │ │ └── aggregator.py # Kombiniert alle Quellen
|
||
│ │ │
|
||
│ │ └── scheduler.py # NEU: APScheduler Service
|
||
│ │
|
||
│ ├── models/
|
||
│ │ └── tld_price.py # Bereits vorhanden ✓
|
||
│ │ └── scrape_log.py # NEU: Audit Logs
|
||
│ │
|
||
│ └── api/
|
||
│ └── tld_prices.py # Anpassen für echte DB-Daten
|
||
│
|
||
├── scripts/
|
||
│ └── seed_initial_prices.py # NEU: Initial-Daten Seed
|
||
│ └── manual_scrape.py # NEU: Manueller Scrape
|
||
│
|
||
└── .env # Nur Scraper-Settings (keine API Keys!)
|
||
```
|
||
|
||
---
|
||
|
||
## 🔧 Konfiguration (Keine API Keys nötig!)
|
||
|
||
### `.env` Erweiterung
|
||
|
||
```env
|
||
# ===== TLD Price Scraper (100% Kostenlos) =====
|
||
|
||
# Scraper Settings
|
||
SCRAPER_ENABLED=true
|
||
SCRAPER_SCHEDULE_DAILY_HOUR=3 # Uhrzeit (UTC)
|
||
SCRAPER_SCHEDULE_WEEKLY_DAY=sun # Vollständiger Scrape
|
||
SCRAPER_MAX_RETRIES=3
|
||
SCRAPER_TIMEOUT_SECONDS=30
|
||
SCRAPER_DELAY_BETWEEN_REQUESTS=2 # Sekunden zwischen Requests
|
||
|
||
# User-Agent Rotation (Optional für Stealth)
|
||
SCRAPER_USER_AGENTS="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
|
||
|
||
# Quellen aktivieren/deaktivieren
|
||
SCRAPER_SOURCE_TLDLIST=true # Hauptquelle
|
||
SCRAPER_SOURCE_PORKBUN=true # Backup
|
||
SCRAPER_SOURCE_SPACESHIP=true # Backup
|
||
|
||
# Validation
|
||
SCRAPER_MAX_PRICE_CHANGE_PERCENT=20 # Warnung bei >20% Änderung
|
||
SCRAPER_MIN_PRICE_USD=0.50 # Minimum gültiger Preis
|
||
SCRAPER_MAX_PRICE_USD=500 # Maximum gültiger Preis
|
||
```
|
||
|
||
---
|
||
|
||
## 📅 Crawl-Strategie
|
||
|
||
### Wann crawlen?
|
||
|
||
| Frequenz | Zeit | Warum |
|
||
|----------|------|-------|
|
||
| **Täglich** | 03:00 UTC | Niedrige Server-Last, frische Daten für neuen Tag |
|
||
| **Wöchentlich** | Sonntag 03:00 | Vollständiger Crawl aller TLDs und Registrare |
|
||
| **Bei Bedarf** | Manual Trigger | Admin kann manuell crawlen |
|
||
|
||
### Was crawlen?
|
||
|
||
| Priorität | TLDs | Frequenz |
|
||
|-----------|------|----------|
|
||
| **Hoch** | com, net, org, io, co, ai, app, dev | Täglich |
|
||
| **Mittel** | info, biz, xyz, me, cc, tv | 2x pro Woche |
|
||
| **Niedrig** | Alle anderen (~500 TLDs) | Wöchentlich |
|
||
|
||
### Daten pro Crawl
|
||
|
||
```python
|
||
{
|
||
"tld": "com",
|
||
"registrar": "cloudflare",
|
||
"registration_price": 10.44,
|
||
"renewal_price": 10.44,
|
||
"transfer_price": 10.44,
|
||
"promo_price": null,
|
||
"currency": "USD",
|
||
"recorded_at": "2024-12-08T03:00:00Z",
|
||
"source": "api", # oder "scrape"
|
||
"confidence": 1.0 # 0.0-1.0
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 🗃️ Datenbank-Schema Erweiterung
|
||
|
||
### Neues Model: `CrawlLog`
|
||
|
||
```python
|
||
class CrawlLog(Base):
|
||
"""Audit trail für Crawler-Aktivitäten."""
|
||
|
||
__tablename__ = "crawl_logs"
|
||
|
||
id: Mapped[int] = mapped_column(primary_key=True)
|
||
|
||
# Crawl Info
|
||
started_at: Mapped[datetime]
|
||
completed_at: Mapped[datetime | None]
|
||
status: Mapped[str] # running, success, partial, failed
|
||
|
||
# Statistics
|
||
tlds_crawled: Mapped[int] = mapped_column(default=0)
|
||
registrars_crawled: Mapped[int] = mapped_column(default=0)
|
||
prices_collected: Mapped[int] = mapped_column(default=0)
|
||
errors_count: Mapped[int] = mapped_column(default=0)
|
||
|
||
# Details
|
||
error_details: Mapped[str | None] = mapped_column(Text, nullable=True)
|
||
source_breakdown: Mapped[str | None] = mapped_column(Text, nullable=True) # JSON
|
||
```
|
||
|
||
### Erweiterung `TLDPrice`
|
||
|
||
```python
|
||
# Zusätzliche Felder
|
||
source: Mapped[str] = mapped_column(String(20), default="api") # api, scrape, manual
|
||
confidence: Mapped[float] = mapped_column(Float, default=1.0) # 0.0-1.0
|
||
crawl_log_id: Mapped[int | None] = mapped_column(ForeignKey("crawl_logs.id"), nullable=True)
|
||
```
|
||
|
||
---
|
||
|
||
## 🔧 Implementierung (Web Scraping)
|
||
|
||
### Phase 1: Basis-Infrastruktur (Tag 1)
|
||
|
||
1. **Scheduler Service erstellen**
|
||
```python
|
||
# backend/app/services/scheduler.py
|
||
from apscheduler.schedulers.asyncio import AsyncIOScheduler
|
||
from apscheduler.triggers.cron import CronTrigger
|
||
|
||
scheduler = AsyncIOScheduler()
|
||
|
||
def setup_scheduler():
|
||
# Daily scrape at 03:00 UTC
|
||
scheduler.add_job(
|
||
scrape_tld_prices,
|
||
CronTrigger(hour=3, minute=0),
|
||
id="daily_tld_scrape",
|
||
replace_existing=True
|
||
)
|
||
scheduler.start()
|
||
```
|
||
|
||
2. **Scraper Base Class**
|
||
```python
|
||
# backend/app/services/tld_scraper/base.py
|
||
from abc import ABC, abstractmethod
|
||
from dataclasses import dataclass
|
||
|
||
@dataclass
|
||
class TLDPriceData:
|
||
tld: str
|
||
registrar: str
|
||
registration_price: float
|
||
renewal_price: float | None
|
||
transfer_price: float | None
|
||
currency: str = "USD"
|
||
source: str = "scrape"
|
||
|
||
class BaseTLDScraper(ABC):
|
||
name: str
|
||
base_url: str
|
||
|
||
@abstractmethod
|
||
async def scrape(self) -> list[TLDPriceData]:
|
||
"""Scrape prices from the source."""
|
||
pass
|
||
|
||
async def health_check(self) -> bool:
|
||
"""Check if source is accessible."""
|
||
async with httpx.AsyncClient() as client:
|
||
response = await client.get(self.base_url, timeout=10)
|
||
return response.status_code == 200
|
||
```
|
||
|
||
### Phase 2: TLD-List.com Scraper (Tag 2) - HAUPTQUELLE
|
||
|
||
**Warum TLD-List.com?**
|
||
- Aggregiert Preise von 50+ Registraren
|
||
- ~1500 TLDs abgedeckt
|
||
- Saubere HTML-Tabellen-Struktur
|
||
- Keine JavaScript-Rendering nötig
|
||
|
||
```python
|
||
# backend/app/services/tld_scraper/tld_list.py
|
||
from bs4 import BeautifulSoup
|
||
import httpx
|
||
import asyncio
|
||
|
||
class TLDListScraper(BaseTLDScraper):
|
||
"""Scraper für tld-list.com - die beste kostenlose Quelle."""
|
||
|
||
name = "tld-list"
|
||
base_url = "https://tld-list.com"
|
||
|
||
# URLs für verschiedene TLD-Kategorien
|
||
ENDPOINTS = {
|
||
"all": "/tlds-from-a-z/",
|
||
"new": "/new-tlds/",
|
||
"cheapest": "/cheapest-tlds/",
|
||
}
|
||
|
||
async def scrape(self) -> list[TLDPriceData]:
|
||
results = []
|
||
|
||
async with httpx.AsyncClient() as client:
|
||
# Alle TLDs scrapen
|
||
response = await client.get(
|
||
f"{self.base_url}{self.ENDPOINTS['all']}",
|
||
headers={"User-Agent": self.get_user_agent()},
|
||
timeout=30
|
||
)
|
||
|
||
soup = BeautifulSoup(response.text, "lxml")
|
||
|
||
# TLD-Tabelle finden
|
||
table = soup.find("table", {"class": "tld-table"})
|
||
if not table:
|
||
raise ScraperError("TLD table not found")
|
||
|
||
for row in table.find_all("tr")[1:]: # Skip header
|
||
cells = row.find_all("td")
|
||
if len(cells) >= 4:
|
||
tld = cells[0].text.strip().lstrip(".")
|
||
|
||
# Preise von verschiedenen Registraren extrahieren
|
||
price_cell = cells[1] # Registration price
|
||
registrar_link = price_cell.find("a")
|
||
|
||
if registrar_link:
|
||
price = self.parse_price(price_cell.text)
|
||
registrar = registrar_link.get("data-registrar", "unknown")
|
||
|
||
results.append(TLDPriceData(
|
||
tld=tld,
|
||
registrar=registrar,
|
||
registration_price=price,
|
||
renewal_price=self.parse_price(cells[2].text),
|
||
transfer_price=self.parse_price(cells[3].text),
|
||
))
|
||
|
||
return results
|
||
|
||
def parse_price(self, text: str) -> float | None:
|
||
"""Parse price from text like '$9.99' or '€8.50'."""
|
||
import re
|
||
match = re.search(r'[\$€£]?([\d,]+\.?\d*)', text.replace(",", ""))
|
||
return float(match.group(1)) if match else None
|
||
|
||
def get_user_agent(self) -> str:
|
||
"""Rotate user agents to avoid detection."""
|
||
import random
|
||
agents = [
|
||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
|
||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
|
||
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
|
||
]
|
||
return random.choice(agents)
|
||
```
|
||
|
||
### Phase 3: Backup Scraper (Tag 3)
|
||
|
||
```python
|
||
# backend/app/services/tld_scraper/porkbun.py
|
||
class PorkbunScraper(BaseTLDScraper):
|
||
"""Backup: Direkt von Porkbun scrapen."""
|
||
|
||
name = "porkbun"
|
||
base_url = "https://porkbun.com/products/domains"
|
||
|
||
async def scrape(self) -> list[TLDPriceData]:
|
||
async with httpx.AsyncClient() as client:
|
||
response = await client.get(
|
||
self.base_url,
|
||
headers={"User-Agent": self.get_user_agent()},
|
||
timeout=30
|
||
)
|
||
|
||
soup = BeautifulSoup(response.text, "lxml")
|
||
results = []
|
||
|
||
# Porkbun hat eine saubere Tabellen-Struktur
|
||
for tld_div in soup.find_all("div", {"class": "tldRow"}):
|
||
tld = tld_div.find("span", {"class": "tld"}).text.strip()
|
||
price = tld_div.find("span", {"class": "price"}).text.strip()
|
||
|
||
results.append(TLDPriceData(
|
||
tld=tld.lstrip("."),
|
||
registrar="porkbun",
|
||
registration_price=self.parse_price(price),
|
||
renewal_price=None, # Separate Abfrage nötig
|
||
transfer_price=None,
|
||
))
|
||
|
||
return results
|
||
```
|
||
|
||
### Phase 4: Aggregator & Validation (Tag 4)
|
||
|
||
```python
|
||
# backend/app/services/tld_scraper/aggregator.py
|
||
class TLDPriceAggregator:
|
||
"""Kombiniert alle Scraper und validiert Daten."""
|
||
|
||
def __init__(self):
|
||
self.scrapers = [
|
||
TLDListScraper(), # Hauptquelle
|
||
PorkbunScraper(), # Backup
|
||
SpaceshipScraper(), # Backup
|
||
]
|
||
|
||
async def run_full_scrape(self, db: AsyncSession) -> ScrapeLog:
|
||
log = ScrapeLog(started_at=datetime.utcnow(), status="running")
|
||
all_prices: dict[str, list[TLDPriceData]] = {}
|
||
|
||
for scraper in self.scrapers:
|
||
try:
|
||
prices = await scraper.scrape()
|
||
|
||
for price in prices:
|
||
key = f"{price.tld}_{price.registrar}"
|
||
if key not in all_prices:
|
||
all_prices[key] = []
|
||
all_prices[key].append(price)
|
||
|
||
log.sources_scraped += 1
|
||
except Exception as e:
|
||
log.errors.append(f"{scraper.name}: {str(e)}")
|
||
|
||
# Validierung: Cross-check zwischen Quellen
|
||
validated_prices = self.validate_prices(all_prices)
|
||
|
||
# In DB speichern
|
||
await self.save_to_db(db, validated_prices)
|
||
|
||
log.prices_collected = len(validated_prices)
|
||
log.completed_at = datetime.utcnow()
|
||
log.status = "success" if not log.errors else "partial"
|
||
|
||
return log
|
||
|
||
def validate_prices(self, prices: dict) -> list[TLDPriceData]:
|
||
"""Cross-validate prices from multiple sources."""
|
||
validated = []
|
||
|
||
for key, price_list in prices.items():
|
||
if len(price_list) == 1:
|
||
# Nur eine Quelle - verwenden mit niedrigerem Confidence
|
||
price = price_list[0]
|
||
price.confidence = 0.7
|
||
validated.append(price)
|
||
else:
|
||
# Mehrere Quellen - Durchschnitt bilden
|
||
avg_price = sum(p.registration_price for p in price_list) / len(price_list)
|
||
|
||
# Prüfen ob Preise ähnlich sind (max 10% Abweichung)
|
||
is_consistent = all(
|
||
abs(p.registration_price - avg_price) / avg_price < 0.1
|
||
for p in price_list
|
||
)
|
||
|
||
result = price_list[0]
|
||
result.registration_price = avg_price
|
||
result.confidence = 0.95 if is_consistent else 0.8
|
||
validated.append(result)
|
||
|
||
return validated
|
||
```
|
||
|
||
---
|
||
|
||
## 🖥️ Lokal vs Server
|
||
|
||
### Lokal (Development)
|
||
|
||
```bash
|
||
# .env.local
|
||
CRAWLER_ENABLED=true
|
||
CRAWLER_SCHEDULE_DAILY_HOUR=* # Jede Stunde zum Testen
|
||
DATABASE_URL=sqlite+aiosqlite:///./domainwatch.db
|
||
```
|
||
|
||
```bash
|
||
# Manueller Test
|
||
curl -X POST http://localhost:8000/api/v1/admin/crawl-prices
|
||
```
|
||
|
||
### Server (Production)
|
||
|
||
```bash
|
||
# .env
|
||
CRAWLER_ENABLED=true
|
||
CRAWLER_SCHEDULE_DAILY_HOUR=3
|
||
DATABASE_URL=postgresql+asyncpg://user:pass@db:5432/pounce
|
||
|
||
# Docker Compose
|
||
services:
|
||
backend:
|
||
environment:
|
||
- CRAWLER_ENABLED=true
|
||
# APScheduler läuft im selben Container
|
||
```
|
||
|
||
### Systemd Service (ohne Docker)
|
||
|
||
```ini
|
||
# /etc/systemd/system/pounce-crawler.service
|
||
[Unit]
|
||
Description=Pounce TLD Price Crawler
|
||
After=network.target
|
||
|
||
[Service]
|
||
Type=simple
|
||
User=pounce
|
||
WorkingDirectory=/opt/pounce/backend
|
||
ExecStart=/opt/pounce/backend/venv/bin/python -m app.services.scheduler
|
||
Restart=always
|
||
|
||
[Install]
|
||
WantedBy=multi-user.target
|
||
```
|
||
|
||
---
|
||
|
||
## 📈 API Endpoints (Angepasst)
|
||
|
||
```python
|
||
# Echte historische Daten statt generierte
|
||
|
||
@router.get("/{tld}/history")
|
||
async def get_tld_price_history(tld: str, db: Database, days: int = 365):
|
||
"""Echte 12-Monats-Daten aus der Datenbank."""
|
||
|
||
result = await db.execute(
|
||
select(TLDPrice)
|
||
.where(TLDPrice.tld == tld)
|
||
.where(TLDPrice.recorded_at >= datetime.utcnow() - timedelta(days=days))
|
||
.order_by(TLDPrice.recorded_at)
|
||
)
|
||
prices = result.scalars().all()
|
||
|
||
# Gruppiere nach Datum und berechne Durchschnitt
|
||
return aggregate_daily_prices(prices)
|
||
```
|
||
|
||
---
|
||
|
||
## 📦 Abhängigkeiten
|
||
|
||
### Neue requirements.txt Einträge
|
||
|
||
```txt
|
||
# Web Scraping (Hauptmethode - KOSTENLOS!)
|
||
beautifulsoup4>=4.12.0
|
||
lxml>=5.0.0
|
||
|
||
# Optional: Für JavaScript-lastige Seiten (nur wenn nötig)
|
||
# playwright>=1.40.0 # Gross, nur bei Bedarf aktivieren
|
||
|
||
# Rate Limiting (respektvoller Scraping)
|
||
aiolimiter>=1.1.0
|
||
|
||
# Bereits vorhanden:
|
||
# httpx>=0.28.0 ✓
|
||
# apscheduler>=3.10.4 ✓
|
||
```
|
||
|
||
**Gesamtgrösse der neuen Dependencies: ~2MB** (minimal!)
|
||
|
||
---
|
||
|
||
## 🚀 Implementierungs-Zeitplan (Schneller ohne APIs!)
|
||
|
||
| Phase | Dauer | Beschreibung |
|
||
|-------|-------|--------------|
|
||
| **1** | 1 Tag | Scheduler + DB Schema + Base Scraper |
|
||
| **2** | 1 Tag | TLD-List.com Scraper (Hauptquelle) |
|
||
| **3** | 0.5 Tag | Porkbun + Spaceship Backup Scraper |
|
||
| **4** | 0.5 Tag | Aggregator + Validation |
|
||
| **5** | 1 Tag | API Endpoints + Frontend Anpassung |
|
||
| **6** | 1 Tag | Testing + Initial Scrape |
|
||
|
||
**Total: ~5 Tage** (schneller als mit APIs!)
|
||
|
||
---
|
||
|
||
## ⚠️ Wichtige Hinweise
|
||
|
||
### Respektvolles Scraping
|
||
|
||
```python
|
||
from aiolimiter import AsyncLimiter
|
||
import asyncio
|
||
|
||
# Max 30 Requests pro Minute (respektvoll)
|
||
scrape_limiter = AsyncLimiter(30, 60)
|
||
|
||
async def scrape_with_limit(url: str):
|
||
async with scrape_limiter:
|
||
# Zufällige Verzögerung zwischen Requests
|
||
await asyncio.sleep(random.uniform(1, 3))
|
||
return await make_request(url)
|
||
```
|
||
|
||
### Robots.txt Compliance
|
||
|
||
```python
|
||
# Vor dem Scrapen prüfen
|
||
async def check_robots_txt(base_url: str) -> bool:
|
||
"""Check if scraping is allowed."""
|
||
robots_url = f"{base_url}/robots.txt"
|
||
# ... parse robots.txt
|
||
# TLD-List.com erlaubt Scraping (kein Disallow für relevante Pfade)
|
||
```
|
||
|
||
### Error Handling
|
||
|
||
```python
|
||
class ScraperError(Exception):
|
||
"""Basis-Exception für Scraper-Fehler."""
|
||
pass
|
||
|
||
class HTMLStructureChanged(ScraperError):
|
||
"""Website-Struktur hat sich geändert - Scraper muss angepasst werden."""
|
||
pass
|
||
|
||
class RateLimitDetected(ScraperError):
|
||
"""Zu viele Requests - warten und erneut versuchen."""
|
||
retry_after: int = 300 # 5 Minuten
|
||
```
|
||
|
||
### Datenqualität
|
||
|
||
- **Confidence Score**:
|
||
- 0.95 = Mehrere Quellen stimmen überein
|
||
- 0.80 = Nur eine Quelle oder kleine Abweichungen
|
||
- 0.70 = Unsicher (grosse Abweichungen)
|
||
- **Outlier Detection**: Warnung bei >20% Preisänderung in 24h
|
||
- **Validation**: Preis muss zwischen $0.50 und $500 liegen
|
||
|
||
---
|
||
|
||
## 🔒 Sicherheit & Best Practices
|
||
|
||
1. **Keine API Keys nötig** ✅ (Web Scraping ist 100% kostenlos)
|
||
2. **User-Agent Rotation**: Wechselnde Browser-Identitäten
|
||
3. **Rate Limiting**: Max 30 req/min, 2-3 Sekunden Delay
|
||
4. **Robots.txt**: Immer respektieren
|
||
5. **Backup**: Tägliches Datenbank-Backup
|
||
6. **Monitoring**: Alert bei Scraper-Fehlern (HTML-Änderungen)
|
||
|
||
---
|
||
|
||
## 🧪 Robustheit-Strategien
|
||
|
||
### HTML-Struktur-Änderungen erkennen
|
||
|
||
```python
|
||
async def validate_page_structure(soup: BeautifulSoup) -> bool:
|
||
"""Prüfen ob erwartete Elemente noch existieren."""
|
||
expected_elements = [
|
||
("table", {"class": "tld-table"}),
|
||
("th", {"text": "TLD"}),
|
||
("th", {"text": "Price"}),
|
||
]
|
||
|
||
for tag, attrs in expected_elements:
|
||
if not soup.find(tag, attrs):
|
||
return False
|
||
return True
|
||
```
|
||
|
||
### Fallback-Chain
|
||
|
||
```
|
||
TLD-List.com (Haupt)
|
||
↓ (falls Fehler)
|
||
Porkbun (Backup 1)
|
||
↓ (falls Fehler)
|
||
Spaceship (Backup 2)
|
||
↓ (falls Fehler)
|
||
Letzte bekannte Daten aus DB verwenden + Alert
|
||
```
|
||
|
||
---
|
||
|
||
## ✅ Nächste Schritte (Kein API-Setup nötig!)
|
||
|
||
1. [ ] Datenbank-Migration für neue Felder
|
||
2. [ ] Scheduler Service implementieren
|
||
3. [ ] TLD-List.com Scraper entwickeln
|
||
4. [ ] Backup Scraper (Porkbun, Spaceship)
|
||
5. [ ] Aggregator + Validation
|
||
6. [ ] API Endpoints anpassen
|
||
7. [ ] Initialen Scrape durchführen
|
||
8. [ ] Frontend für echte historische Daten aktualisieren
|
||
|
||
---
|
||
|
||
## 💰 Kostenübersicht
|
||
|
||
| Posten | Kosten |
|
||
|--------|--------|
|
||
| API Keys | $0 ✅ |
|
||
| Externe Services | $0 ✅ |
|
||
| Server-Ressourcen | Minimal (1x täglich ~100 Requests) |
|
||
| Wartungsaufwand | ~1h/Monat (HTML-Änderungen prüfen) |
|
||
|
||
**Total: $0/Monat** 🎉
|
||
|
||
---
|
||
|
||
**Soll ich mit der Implementierung beginnen?**
|
||
|