huizenbot/add_scraper_context.md

# Huizenbot — Agent Context for Adding Routes

## Project Overview

**Huizenbot** is a periodic scraper of real estate broker websites in Delft and Schiedam (Netherlands). It:
- Fetches property listings from broker websites
- Saves new ones to SQLite with `RawListing` schema
- Calculates travel times (bike + public transit) to two work locations
- Sends push notifications via Home Assistant webhook (with email fallback)

**Your role:** You will add new broker routes (scrapers) to the `adapters/` directory. A human will:
1. Select a broker from the list
2. Help you investigate the broker's website
3. For API-based brokers: develop curl requests to test
4. For HTML scrapers: develop parsing logic using BeautifulSoup
5. Run `tests/test_adapters.py` to validate
6. Merge your code snippets into the codebase

---

## Key Schema: RawListing

**Location:** `src/huizenbot.py` (lines 29–52)

This is the data model you must populate. All fields except `url` are optional:

```python
@dataclass
class RawListing:
    url: str                          # REQUIRED — the listing URL

    source_makelaar: str = ""         # Name of the broker (e.g., "bjornd", "vdaal")
    datum_aanmelding: str | None = None  # ISO 8601 date if available
    status: str = "beschikbaar"       # enum: beschikbaar | onder_bod | verkocht

    # Location
    adres: str | None = None          # Street address (e.g., "Binnenwatersloot 3")
    postcode: str | None = None       # Dutch postcode (e.g., "2611CA")
    stad: str | None = None           # City (e.g., "Delft")

    # Property details
    prijs: int | None = None          # Price in euros (integer, no float)
    woningtype: str | None = None     # Type (e.g., "appartement", "tussenwoning")
    woonoppervlak: int | None = None  # Living space in m²
    perceeloppervlak: int | None = None  # Plot size in m² (NULL for apartments)
    kamers: int | None = None         # Number of rooms
    slaapkamers: int | None = None    # Number of bedrooms
    bouwjaar: int | None = None       # Build year
    energielabel: str | None = None   # Energy label (e.g., "A", "B")

    # Media
    hero_image_url: str | None = None # Main photo URL

    # Extra data (broker-specific fields)
    extra: dict[str, Any] = field(default_factory=dict)  # Arbitrary JSON data
```

**DB Upsert:** The listing is inserted on first run (with `id = sha256(url)`) and updated only on `last_seen` / `status` on subsequent runs. Travel times are calculated only on first insert.

---

## Adapter Structure

Adapters live in `src/adapters/` and are organized by type:

### Two Adapter Types

#### 1. **API-based** (`src/adapters/api.py`)
For brokers with REST/JSON endpoints.

**Pattern:**
```python
def fetch_bjornd() -> list[RawListing]:
    data = fetch_json("https://...", params={...}, headers={...})
    listings = []
    for item in data:
        # Filter / validate
        if item.get("status") in _SKIP:
            continue
        if item.get("price") > config.MAX_PRICE:
            continue

        listings.append(RawListing(
            url=item["url"],
            source_makelaar="bjornd",
            adres=item.get("address"),
            postcode=item.get("zipcode"),
            # ... etc
        ))

    log.info("bjornd: %d listings", len(listings))
    return listings
```

**Helpers available:**
- `fetch_json(url, *, params=None, headers=None)` — GET with User-Agent, timeout, Retry-After handling
- Built-in logging via `log = logging.getLogger("huizenbot.api")`

#### 2. **SSR/HTML-based** (`src/adapters/ssr.py`)
For brokers with server-side rendered HTML.

**Pattern:**
```python
def fetch_vdaal() -> list[RawListing]:
    soup = fetch_soup("https://vdaalmakelaardij.nl/aanbod")
    listings = []

    for card in soup.select(".property-card"):
        try:
            url = card.select_one("a[href]")["href"]
            if not url.startswith("http"):
                url = VDAAL_BASE + url

            adres = _text(card, ".address-selector")
            postcode = _extract_postcode(adres)
            prijs = parse_prijs(_text(card, ".price"))

            listings.append(RawListing(
                url=url,
                source_makelaar="vdaal",
                adres=adres,
                postcode=postcode,
                stad=_infer_stad(postcode),
                prijs=prijs,
                # ... etc
            ))
        except Exception as e:
            log.warning("Parse error: %s", e)

    log.info("vdaal: %d listings", len(listings))
    return listings
```

**Helpers available:**
- `fetch_soup(url, *, params=None)` — GET with BeautifulSoup, Retry-After handling
- `parse_prijs(text)` — Extract price from strings like "€ 325.000 k.k." → 325000
- `parse_m2(text)` — Extract area from "87 m²" → 87
- `_text(soup, selector)` — Get inner text from element
- `_src(soup, selector)` — Get src or data-src attribute
- `_extract_postcode(text)` — Regex postcode from any text
- `_infer_stad(postcode)` — Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam (Den Haag not in this helper; use the city value from the broker directly)

---

## Registration

Both `api.py` and `ssr.py` have a `SCRAPERS` dict at the bottom:

```python
# api.py
SCRAPERS = {
    'bjornd': fetch_bjornd,
    'your_broker': fetch_your_broker,  # ← Add here
}

# ssr.py
SCRAPERS = {
    'bjornd_demo': fetch_bjornd_demo,
    'your_broker': fetch_your_broker,  # ← Add here
}
```

The `src/adapters/__init__.py` merges both dicts, so the runner picks up all registered adapters automatically.

---

## Testing Workflow

### 1. Understand the Website
The human will help you:
- Identify the broker's API endpoint (or the HTML structure)
- Check for a `robots.txt` or rate limit headers
- Write exploratory curl requests (for APIs) or BeautifulSoup inspections

### 2. Develop & Test Locally
- Add your scraper function to the appropriate file (`api.py` or `ssr.py`)
- Register it in the `SCRAPERS` dict
- The human updates `tests/test_adapters.py` to point to your adapter:
  ```python
  ADAPTER = SCRAPERS['your_broker_name']
  ```
- Run the test:
  ```bash
  cd tests && python test_adapters.py
  ```
- The test prints listings in a simple format so you can validate output

### 3. Merge Code
Once validated, the human will **copy your inline code snippets** into the main codebase. You produce **easily pasteable functions**, not entire files.

---

## Config & Constants

**Location:** `src/config.py`

Key values you may reference:
- `MAX_PRICE = 300_000` — Price filter (your scraper can skip listings above this)
- `USER_AGENT = "Huizenbot/1.0 (+mark@kalsbeek.dev) persoonlijk gebruik"` — Used in all HTTP headers
- `MARK_WERK_POSTCODE`, `MICHELLE_WERK_POSTCODE` — Work postcodes for travel time calculation

Secrets (API keys, webhook URLs) are **environment variables**, not in config.

---

## Platform / CMS Quick Identification

Before investigating a broker's HTML manually, check for known platforms in this order:

### 1. OG Online / realtime-listings (API — fastest)
Check if `https://<base>/nl/realtime-listings/consumer` returns JSON (with header `X-Requested-With: XMLHttpRequest`). If yes, this is a 10-line addition to `api.py`. Known brokers: bjornd, moerman, vandaal, elzenaar, doen.

Fields: `isSales`, `statusOrig`, `salesPrice`, `address`, `zipcode`, `city`, `rooms`, `bedrooms`, `livingSurface`, `plotSurface`, `dateOfConstruction`, `energyLabel`, `type`, `photo`, `url`.

Add a `_CITIES` set to filter by city if the broker covers a wide area. Skip statuses `"rented"` and `"rented_ur"`.

### 2. Realworks CMS (SSR — one liner)
Run `autoscraper.py` or check HTML for `li.aanbodEntry`. If detected:
```python
def fetch_mybroker() -> list[RawListing]:
    return fetch_realworks("https://www.mybroker.nl", "mybroker")
```

### 3. SURE WordPress Plugin (SSR — ~50 lines)
Check HTML for `sure-` CSS classes or `?sure_koop_huur=koop` filter. Two card variants:
- `a.card-house` (single dash) — e.g. Olsthoorn
- `a.card--house` (double dash) — e.g. Borgdorff

Both use `?sure_koop_huur=koop` to filter buy listings and `/page/{N}/` pagination. Detail page always has `#kenmerken li span span` pairs with labels like `status`, `soort woonhuis`/`soort woning`/`soort bouw`, `bouwjaar`, `gebruiksoppervlakte wonen`, `perceeloppervlakte`, `aantal slaapkamers`, `energielabel`. Postcode is often **not** available on the detail page.

Terminate pagination when `len(cards) < expected_per_page` (typically 15 for SURE).

### 4. Unknown CMS
Run the autoscraper tool:
```bash
python autoscraper.py listings <listings-url>
python autoscraper.py details <detail-page-url>
```
It prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.

## Important Notes

Don't treat detail pages as optional, we always want all the info!

### Status Mapping
Brokers use different status strings. Always map to one of:
- `"beschikbaar"` — Available for sale
- `"onder_bod"` — Under offer
- `"verkocht"` — Sold

Example from api.py:
```python
_STATUS_MAP = {
    "available": "beschikbaar",
    "under_bid": "onder_bod",
    "sold": "verkocht",
}
status = _STATUS_MAP.get(item.get("status"), "beschikbaar")
```

### Postcode Extraction
Always aim for the **Dutch postcode format** (4 digits + 2 letters, e.g., `"2611CA"`). The travel time calculation depends on it. If a broker only provides the address string, use `_extract_postcode(address)`.

If a postcode field contains extra text (e.g., `"2522GW Den Haag"`), extract cleanly with:
```python
m = re.search(r"\d{4}\s*[A-Z]{2}", raw.upper())
postcode = m.group(0).replace(" ", "") if m else None
```
Never just `.replace(" ", "")` — that produces garbage like `"2522GWDenHaag"`.

### Price Handling
Prices are **integers** (euros), never floats. Use `parse_prijs()` for HTML.

### Image URLs
Store the hero/main image URL in `hero_image_url`. This appears in Home Assistant notifications.

### Extra Data
If a broker provides extra fields that don't fit the schema (e.g., balcony, garden, orientation), store them in the `extra` dict:
```python
listings.append(RawListing(
    url=...,
    ...
    extra={
        "balcony": item.get("has_balcony"),
        "garden": item.get("has_garden"),
        "custom_field": item.get("something_else"),
    }
))
```

The database stores this as JSON in the `extra` column.

### Error Handling
- Wrap individual listing parsing in try/except to continue on one bad listing
- Log parse warnings, not errors (brokers' HTML changes)
- Let HTTP errors bubble up (the runner catches them at the adapter level)

### Rate Limiting & Ethics
- Both `fetch_json()` and `fetch_soup()` handle 429 Retry-After automatically
- Nominatim (geocoding) has a 1 req/s limiter built into `huizenbot.py`
- Never spawn parallel requests without the human's approval
- Always use the `USER_AGENT` header (includes contact info for respectful scraping)
- Don't keep curling the same endpoint, pipe it to a <name makelaar>.dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
- Don't over-investigate pagination — confirm card count on page 1, assume it's consistent across pages, move on. Never fetch multiple pages just to verify the per-page count.

---

## Example: Adding "Van Daal" (API-based)

### Scenario
The human finds that Van Daal (vandaalmakelaardij.nl) has a JSON API at:
```
https://api.vandaal.nl/listings?city=delft&status=available
```

### Your Code (add to api.py)

```python
# Van Daal
# --------
_VANDAAL_BASE = "https://www.vandaalmakelaardij.nl"
_VANDAAL_API = "https://api.vandaal.nl/listings"

_VANDAAL_STATUS_MAP = {
    "available": "beschikbaar",
    "under_offer": "onder_bod",
    "sold": "verkocht",
}

def fetch_vandaal() -> list[RawListing]:
    listings = []
    for city in ["delft", "schiedam"]:
        data = fetch_json(
            _VANDAAL_API,
            params={"city": city, "status": "available"}
        )

        for item in data.get("listings", []):
            if item.get("price", 0) > config.MAX_PRICE:
                continue

            listings.append(RawListing(
                url=item["url"],
                source_makelaar="vandaal",
                adres=item.get("address"),
                postcode=item.get("postcode"),
                stad=item.get("city"),
                prijs=item.get("price"),
                woningtype=item.get("type"),
                woonoppervlak=item.get("living_area"),
                slaapkamers=item.get("bedrooms"),
                hero_image_url=item.get("image_url"),
            ))

    log.info("vandaal: %d listings", len(listings))
    return listings
```

### Register in SCRAPERS (in api.py)
```python
SCRAPERS = {
    'bjornd': fetch_bjornd,
    'vandaal': fetch_vandaal,  # ← Add this
}
```

### Test
Human updates `test_adapters.py`:
```python
ADAPTER = SCRAPERS['vandaal']
```

Then runs:
```bash
cd tests && python test_adapters.py
```

If all looks good, the human copies the `fetch_vandaal()` function into the real `api.py` and adds it to `SCRAPERS`.

---

## Summary

1. **You receive** an adapter request + investigation results (API endpoint or HTML structure)
2. **You write** a clean, self-contained scraper function that returns `list[RawListing]`
3. **You register** it in the appropriate `SCRAPERS` dict
4. **The human tests** it with `test_adapters.py` and validates output
5. **The human merges** your code into the production files

Keep code simple, use the provided helpers, populate `RawListing` fields as best you can, and always set `source_makelaar` and `url` correctly.