add some more makelaars, and some more infra

2026-04-03 15:49:42 +02:00
parent 26d9d936f4
commit 17b35d1997
9 changed files with 928 additions and 70 deletions
--- a/add_scraper_context.md
+++ b/add_scraper_context.md
@@ -0,0 +1,358 @@
+# Huizenbot — Agent Context for Adding Routes
+
+## Project Overview
+
+**Huizenbot** is a periodic scraper of real estate broker websites in Delft and Schiedam (Netherlands). It:
+- Fetches property listings from broker websites
+- Saves new ones to SQLite with `RawListing` schema
+- Calculates travel times (bike + public transit) to two work locations
+- Sends push notifications via Home Assistant webhook (with email fallback)
+
+**Your role:** You will add new broker routes (scrapers) to the `adapters/` directory. A human will:
+1. Select a broker from the list
+2. Help you investigate the broker's website
+3. For API-based brokers: develop curl requests to test
+4. For HTML scrapers: develop parsing logic using BeautifulSoup
+5. Run `tests/test_adapters.py` to validate
+6. Merge your code snippets into the codebase
+
+---
+
+## Key Schema: RawListing
+
+**Location:** `src/huizenbot.py` (lines 29–52)
+
+This is the data model you must populate. All fields except `url` are optional:
+
+```python
+@dataclass
+class RawListing:
+    url: str                          # REQUIRED — the listing URL
+    
+    source_makelaar: str = ""         # Name of the broker (e.g., "bjornd", "vdaal")
+    datum_aanmelding: str | None = None  # ISO 8601 date if available
+    status: str = "beschikbaar"       # enum: beschikbaar | onder_bod | verkocht
+    
+    # Location
+    adres: str | None = None          # Street address (e.g., "Binnenwatersloot 3")
+    postcode: str | None = None       # Dutch postcode (e.g., "2611CA")
+    stad: str | None = None           # City (e.g., "Delft")
+    
+    # Property details
+    prijs: int | None = None          # Price in euros (integer, no float)
+    woningtype: str | None = None     # Type (e.g., "appartement", "tussenwoning")
+    woonoppervlak: int | None = None  # Living space in m²
+    perceeloppervlak: int | None = None  # Plot size in m² (NULL for apartments)
+    kamers: int | None = None         # Number of rooms
+    slaapkamers: int | None = None    # Number of bedrooms
+    bouwjaar: int | None = None       # Build year
+    energielabel: str | None = None   # Energy label (e.g., "A", "B")
+    
+    # Media
+    hero_image_url: str | None = None # Main photo URL
+    
+    # Extra data (broker-specific fields)
+    extra: dict[str, Any] = field(default_factory=dict)  # Arbitrary JSON data
+```
+
+**DB Upsert:** The listing is inserted on first run (with `id = sha256(url)`) and updated only on `last_seen` / `status` on subsequent runs. Travel times are calculated only on first insert.
+
+---
+
+## Adapter Structure
+
+Adapters live in `src/adapters/` and are organized by type:
+
+### Two Adapter Types
+
+#### 1. **API-based** (`src/adapters/api.py`)
+For brokers with REST/JSON endpoints.
+
+**Pattern:**
+```python
+def fetch_bjornd() -> list[RawListing]:
+    data = fetch_json("https://...", params={...}, headers={...})
+    listings = []
+    for item in data:
+        # Filter / validate
+        if item.get("status") in _SKIP:
+            continue
+        if item.get("price") > config.MAX_PRICE:
+            continue
+        
+        listings.append(RawListing(
+            url=item["url"],
+            source_makelaar="bjornd",
+            adres=item.get("address"),
+            postcode=item.get("zipcode"),
+            # ... etc
+        ))
+    
+    log.info("bjornd: %d listings", len(listings))
+    return listings
+```
+
+**Helpers available:**
+- `fetch_json(url, *, params=None, headers=None)` — GET with User-Agent, timeout, Retry-After handling
+- Built-in logging via `log = logging.getLogger("huizenbot.api")`
+
+#### 2. **SSR/HTML-based** (`src/adapters/ssr.py`)
+For brokers with server-side rendered HTML.
+
+**Pattern:**
+```python
+def fetch_vdaal() -> list[RawListing]:
+    soup = fetch_soup("https://vdaalmakelaardij.nl/aanbod")
+    listings = []
+    
+    for card in soup.select(".property-card"):
+        try:
+            url = card.select_one("a[href]")["href"]
+            if not url.startswith("http"):
+                url = VDAAL_BASE + url
+            
+            adres = _text(card, ".address-selector")
+            postcode = _extract_postcode(adres)
+            prijs = parse_prijs(_text(card, ".price"))
+            
+            listings.append(RawListing(
+                url=url,
+                source_makelaar="vdaal",
+                adres=adres,
+                postcode=postcode,
+                stad=_infer_stad(postcode),
+                prijs=prijs,
+                # ... etc
+            ))
+        except Exception as e:
+            log.warning("Parse error: %s", e)
+    
+    log.info("vdaal: %d listings", len(listings))
+    return listings
+```
+
+**Helpers available:**
+- `fetch_soup(url, *, params=None)` — GET with BeautifulSoup, Retry-After handling
+- `parse_prijs(text)` — Extract price from strings like "€ 325.000 k.k." → 325000
+- `parse_m2(text)` — Extract area from "87 m²" → 87
+- `_text(soup, selector)` — Get inner text from element
+- `_src(soup, selector)` — Get src or data-src attribute
+- `_extract_postcode(text)` — Regex postcode from any text
+- `_infer_stad(postcode)` — Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam
+
+---
+
+## Registration
+
+Both `api.py` and `ssr.py` have a `SCRAPERS` dict at the bottom:
+
+```python
+# api.py
+SCRAPERS = {
+    'bjornd': fetch_bjornd,
+    'your_broker': fetch_your_broker,  # ← Add here
+}
+
+# ssr.py
+SCRAPERS = {
+    'bjornd_demo': fetch_bjornd_demo,
+    'your_broker': fetch_your_broker,  # ← Add here
+}
+```
+
+The `src/adapters/__init__.py` merges both dicts, so the runner picks up all registered adapters automatically.
+
+---
+
+## Testing Workflow
+
+### 1. Understand the Website
+The human will help you:
+- Identify the broker's API endpoint (or the HTML structure)
+- Check for a `robots.txt` or rate limit headers
+- Write exploratory curl requests (for APIs) or BeautifulSoup inspections
+
+### 2. Develop & Test Locally
+- Add your scraper function to the appropriate file (`api.py` or `ssr.py`)
+- Register it in the `SCRAPERS` dict
+- The human updates `tests/test_adapters.py` to point to your adapter:
+  ```python
+  ADAPTER = SCRAPERS['your_broker_name']
+  ```
+- Run the test:
+  ```bash
+  cd tests && python test_adapters.py
+  ```
+- The test prints listings in a simple format so you can validate output
+
+### 3. Merge Code
+Once validated, the human will **copy your inline code snippets** into the main codebase. You produce **easily pasteable functions**, not entire files.
+
+---
+
+## Config & Constants
+
+**Location:** `src/config.py`
+
+Key values you may reference:
+- `MAX_PRICE = 300_000` — Price filter (your scraper can skip listings above this)
+- `USER_AGENT = "Huizenbot/1.0 (+mark@kalsbeek.dev) persoonlijk gebruik"` — Used in all HTTP headers
+- `MARK_WERK_POSTCODE`, `MICHELLE_WERK_POSTCODE` — Work postcodes for travel time calculation
+
+Secrets (API keys, webhook URLs) are **environment variables**, not in config.
+
+---
+
+## CMS Detection Tool
+
+Before investigating a broker's HTML manually, prod the human in the loop to run `autoscraper.py` from the project root:
+```bash
+python autoscraper.py listings <listings-url>
+python autoscraper.py details <detail-page-url>
+```
+
+If the broker uses a known CMS, the tool prints the exact code to add — no further investigation needed. Currently detected CMSes:
+
+- **Realworks** → prints a ready-to-paste `fetch_realworks(...)` one-liner for `ssr.py`
+
+If the CMS is unknown, the tool prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
+
+## Important Notes
+
+### Status Mapping
+Brokers use different status strings. Always map to one of:
+- `"beschikbaar"` — Available for sale
+- `"onder_bod"` — Under offer
+- `"verkocht"` — Sold
+
+Example from api.py:
+```python
+_STATUS_MAP = {
+    "available": "beschikbaar",
+    "under_bid": "onder_bod",
+    "sold": "verkocht",
+}
+status = _STATUS_MAP.get(item.get("status"), "beschikbaar")
+```
+
+### Postcode Extraction
+Always aim for the **Dutch postcode format** (4 digits + 2 letters, e.g., `"2611CA"`). The travel time calculation depends on it. If a broker only provides the address string, use `_extract_postcode(address)`.
+
+### Price Handling
+Prices are **integers** (euros), never floats. Use `parse_prijs()` for HTML.
+
+### Image URLs
+Store the hero/main image URL in `hero_image_url`. This appears in Home Assistant notifications.
+
+### Extra Data
+If a broker provides extra fields that don't fit the schema (e.g., balcony, garden, orientation), store them in the `extra` dict:
+```python
+listings.append(RawListing(
+    url=...,
+    ...
+    extra={
+        "balcony": item.get("has_balcony"),
+        "garden": item.get("has_garden"),
+        "custom_field": item.get("something_else"),
+    }
+))
+```
+
+The database stores this as JSON in the `extra` column.
+
+### Error Handling
+- Wrap individual listing parsing in try/except to continue on one bad listing
+- Log parse warnings, not errors (brokers' HTML changes)
+- Let HTTP errors bubble up (the runner catches them at the adapter level)
+
+### Rate Limiting & Ethics
+- Both `fetch_json()` and `fetch_soup()` handle 429 Retry-After automatically
+- Nominatim (geocoding) has a 1 req/s limiter built into `huizenbot.py`
+- Never spawn parallel requests without the human's approval
+- Always use the `USER_AGENT` header (includes contact info for respectful scraping)
+
+---
+
+## Example: Adding "Van Daal" (API-based)
+
+### Scenario
+The human finds that Van Daal (vandaalmakelaardij.nl) has a JSON API at:
+```
+https://api.vandaal.nl/listings?city=delft&status=available
+```
+
+### Your Code (add to api.py)
+
+```python
+# Van Daal
+# --------
+_VANDAAL_BASE = "https://www.vandaalmakelaardij.nl"
+_VANDAAL_API = "https://api.vandaal.nl/listings"
+
+_VANDAAL_STATUS_MAP = {
+    "available": "beschikbaar",
+    "under_offer": "onder_bod",
+    "sold": "verkocht",
+}
+
+def fetch_vandaal() -> list[RawListing]:
+    listings = []
+    for city in ["delft", "schiedam"]:
+        data = fetch_json(
+            _VANDAAL_API,
+            params={"city": city, "status": "available"}
+        )
+        
+        for item in data.get("listings", []):
+            if item.get("price", 0) > config.MAX_PRICE:
+                continue
+            
+            listings.append(RawListing(
+                url=item["url"],
+                source_makelaar="vandaal",
+                adres=item.get("address"),
+                postcode=item.get("postcode"),
+                stad=item.get("city"),
+                prijs=item.get("price"),
+                woningtype=item.get("type"),
+                woonoppervlak=item.get("living_area"),
+                slaapkamers=item.get("bedrooms"),
+                hero_image_url=item.get("image_url"),
+            ))
+    
+    log.info("vandaal: %d listings", len(listings))
+    return listings
+```
+
+### Register in SCRAPERS (in api.py)
+```python
+SCRAPERS = {
+    'bjornd': fetch_bjornd,
+    'vandaal': fetch_vandaal,  # ← Add this
+}
+```
+
+### Test
+Human updates `test_adapters.py`:
+```python
+ADAPTER = SCRAPERS['vandaal']
+```
+
+Then runs:
+```bash
+cd tests && python test_adapters.py
+```
+
+If all looks good, the human copies the `fetch_vandaal()` function into the real `api.py` and adds it to `SCRAPERS`.
+
+---
+
+## Summary
+
+1. **You receive** an adapter request + investigation results (API endpoint or HTML structure)
+2. **You write** a clean, self-contained scraper function that returns `list[RawListing]`
+3. **You register** it in the appropriate `SCRAPERS` dict
+4. **The human tests** it with `test_adapters.py` and validates output
+5. **The human merges** your code into the production files
+
+Keep code simple, use the provided helpers, populate `RawListing` fields as best you can, and always set `source_makelaar` and `url` correctly.