Files
huizenbot/add_scraper_context.md
Mark Kalsbeek f74e9bcfb0 refactor: split ssr.py into package, enrich OG Online detail pages, fix travel upsert
- Split src/adapters/ssr.py (2160 LOC) into ssr/ package grouped by CMS:
  realworks.py, sure.py, schiedam.py, denhaag.py, overige.py
- Add _og_detail() to api.py; all OG Online scrapers now fall back to
  detail page fetch when energielabel/bouwjaar are missing from the API
- Fix run() to recalculate travel times for existing listings where
  fiets_mark IS NULL; upsert() now writes travel cols on existing rows too
- Update tests/cache.py to patch fetch_soup in every ssr submodule
- Update docs to reflect new package structure and mark API enrichment TODO done

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 23:39:35 +02:00

14 KiB
Raw Permalink Blame History

Huizenbot — Agent Context for Adding Routes

Project Overview

Huizenbot is a periodic scraper of real estate broker websites in Delft and Schiedam (Netherlands). It:

  • Fetches property listings from broker websites
  • Saves new ones to SQLite with RawListing schema
  • Calculates travel times (bike + public transit) to two work locations
  • Sends push notifications via Home Assistant webhook (with email fallback)

Your role: You will add new broker routes (scrapers) to the adapters/ directory. A human will:

  1. Select a broker from the list
  2. Help you investigate the broker's website
  3. For API-based brokers: develop curl requests to test
  4. For HTML scrapers: develop parsing logic using BeautifulSoup
  5. Run tests/test_adapters.py to validate
  6. Merge your code snippets into the codebase

Key Schema: RawListing

Location: src/huizenbot.py (lines 2952)

This is the data model you must populate. All fields except url are optional:

@dataclass
class RawListing:
    url: str                          # REQUIRED — the listing URL
    
    source_makelaar: str = ""         # Name of the broker (e.g., "bjornd", "vdaal")
    datum_aanmelding: str | None = None  # ISO 8601 date if available
    status: str = "beschikbaar"       # enum: beschikbaar | onder_bod | verkocht
    
    # Location
    adres: str | None = None          # Street address (e.g., "Binnenwatersloot 3")
    postcode: str | None = None       # Dutch postcode (e.g., "2611CA")
    stad: str | None = None           # City (e.g., "Delft")
    
    # Property details
    prijs: int | None = None          # Price in euros (integer, no float)
    woningtype: str | None = None     # Type (e.g., "appartement", "tussenwoning")
    woonoppervlak: int | None = None  # Living space in m²
    perceeloppervlak: int | None = None  # Plot size in m² (NULL for apartments)
    kamers: int | None = None         # Number of rooms
    slaapkamers: int | None = None    # Number of bedrooms
    bouwjaar: int | None = None       # Build year
    energielabel: str | None = None   # Energy label (e.g., "A", "B")
    
    # Media
    hero_image_url: str | None = None # Main photo URL
    
    # Extra data (broker-specific fields)
    extra: dict[str, Any] = field(default_factory=dict)  # Arbitrary JSON data

DB Upsert: The listing is inserted on first run (with id = sha256(url)) and updated only on last_seen / status on subsequent runs. Travel times are calculated only on first insert.


Adapter Structure

Adapters live in src/adapters/ and are organized by type:

Two Adapter Types

1. API-based (src/adapters/api.py)

For brokers with REST/JSON endpoints.

Pattern:

def fetch_bjornd() -> list[RawListing]:
    data = fetch_json("https://...", params={...}, headers={...})
    listings = []
    for item in data:
        # Filter / validate
        if item.get("status") in _SKIP:
            continue
        if item.get("price") > config.MAX_PRICE:
            continue
        
        listings.append(RawListing(
            url=item["url"],
            source_makelaar="bjornd",
            adres=item.get("address"),
            postcode=item.get("zipcode"),
            # ... etc
        ))
    
    log.info("bjornd: %d listings", len(listings))
    return listings

Helpers available:

  • fetch_json(url, *, params=None, headers=None) — GET with User-Agent, timeout, Retry-After handling
  • Built-in logging via log = logging.getLogger("huizenbot.api")

2. SSR/HTML-based (src/adapters/ssr/ package)

For brokers with server-side rendered HTML. The package is split by CMS platform:

  • realworks.py — Realworks CMS (li/div.aanbodEntry cards + span.kenmerk detail)
  • sure.py — SURE WordPress plugin (/wonen?sure_koop_huur=koop + #kenmerken detail)
  • schiedam.py — Custom Schiedam scrapers (diverse platforms)
  • denhaag.py — Den Haag scrapers (diverse platforms)
  • overige.py — Other / multi-city scrapers (OG Online WP, Elementor)

Pattern:

def fetch_vdaal() -> list[RawListing]:
    soup = fetch_soup("https://vdaalmakelaardij.nl/aanbod")
    listings = []
    
    for card in soup.select(".property-card"):
        try:
            url = card.select_one("a[href]")["href"]
            if not url.startswith("http"):
                url = VDAAL_BASE + url
            
            adres = _text(card, ".address-selector")
            postcode = _extract_postcode(adres)
            prijs = parse_prijs(_text(card, ".price"))
            
            listings.append(RawListing(
                url=url,
                source_makelaar="vdaal",
                adres=adres,
                postcode=postcode,
                stad=_infer_stad(postcode),
                prijs=prijs,
                # ... etc
            ))
        except Exception as e:
            log.warning("Parse error: %s", e)
    
    log.info("vdaal: %d listings", len(listings))
    return listings

Helpers available:

  • fetch_soup(url, *, params=None) — GET with BeautifulSoup, Retry-After handling
  • parse_prijs(text) — Extract price from strings like "€ 325.000 k.k." → 325000
  • parse_m2(text) — Extract area from "87 m²" → 87
  • _text(soup, selector) — Get inner text from element
  • _src(soup, selector) — Get src or data-src attribute
  • _extract_postcode(text) — Regex postcode from any text
  • _infer_stad(postcode) — Simple lookup: 26002629 → Delft, 31003135 → Schiedam (Den Haag not in this helper; use the city value from the broker directly)

Registration

API scrapers (src/adapters/api.py): Add your function and register in the SCRAPERS dict at the bottom of the file.

SSR scrapers: Add your function to the appropriate submodule (realworks.py, sure.py, schiedam.py, denhaag.py, or overige.py), then import it in src/adapters/ssr/__init__.py and add it to the SCRAPERS dict there.

# api.py — SCRAPERS dict
SCRAPERS = {
    'bjornd': fetch_bjornd,
    'your_broker': fetch_your_broker,  # ← Add here
}

# ssr/__init__.py — import + register
from .realworks import fetch_your_broker   # ← import from the right submodule

SCRAPERS = {
    ...
    'your_broker': fetch_your_broker,  # ← Add here
}

The src/adapters/__init__.py merges both dicts, so the runner picks up all registered adapters automatically.


Testing Workflow

1. Understand the Website

The human will help you:

  • Identify the broker's API endpoint (or the HTML structure)
  • Check for a robots.txt or rate limit headers
  • Write exploratory curl requests (for APIs) or BeautifulSoup inspections

2. Develop & Test Locally

  • Add your scraper function to the appropriate file (api.py or the right ssr/ submodule)
  • Register it in the SCRAPERS dict
  • The human updates tests/test_adapters.py to point to your adapter:
    ADAPTER = SCRAPERS['your_broker_name']
    
  • Run the test:
    cd tests && python test_adapters.py
    
  • The test prints listings in a simple format so you can validate output

3. Merge Code

Once validated, the human will copy your inline code snippets into the main codebase. You produce easily pasteable functions, not entire files.


Config & Constants

Location: src/config.py

Key values you may reference:

  • MAX_PRICE = 300_000 — Price filter (your scraper can skip listings above this)
  • USER_AGENT = "Huizenbot/1.0 (+mark@kalsbeek.dev) persoonlijk gebruik" — Used in all HTTP headers
  • MARK_WERK_POSTCODE, MICHELLE_WERK_POSTCODE — Work postcodes for travel time calculation

Secrets (API keys, webhook URLs) are environment variables, not in config.


Platform / CMS Quick Identification

Before investigating a broker's HTML manually, check for known platforms in this order:

1. OG Online / realtime-listings (API — fastest)

File: src/adapters/api.py

Check if https://<base>/nl/realtime-listings/consumer returns JSON (with header X-Requested-With: XMLHttpRequest). If yes, this is a 10-line addition to api.py. Known brokers: bjornd, moerman, vandaal, elzenaar, doen.

Fields: isSales, statusOrig, salesPrice, address, zipcode, city, rooms, bedrooms, livingSurface, plotSurface, dateOfConstruction, energyLabel, type, photo, url.

Add a _CITIES set to filter by city if the broker covers a wide area. Skip statuses "rented" and "rented_ur".

2. Realworks CMS (SSR — one liner)

File: src/adapters/ssr/realworks.py

Run autoscraper.py or check HTML for li.aanbodEntry. If detected:

def fetch_mybroker() -> list[RawListing]:
    return fetch_realworks("https://www.mybroker.nl", "mybroker")

3. SURE WordPress Plugin (SSR — ~50 lines)

File: src/adapters/ssr/sure.py

Check HTML for sure- CSS classes or ?sure_koop_huur=koop filter. Two card variants:

  • a.card-house (single dash) — e.g. Olsthoorn
  • a.card--house (double dash) — e.g. Borgdorff

Both use ?sure_koop_huur=koop to filter buy listings and /page/{N}/ pagination. Detail page always has #kenmerken li span span pairs with labels like status, soort woonhuis/soort woning/soort bouw, bouwjaar, gebruiksoppervlakte wonen, perceeloppervlakte, aantal slaapkamers, energielabel. Postcode is often not available on the detail page.

Terminate pagination when len(cards) < expected_per_page (typically 15 for SURE).

4. Unknown CMS

File: src/adapters/ssr/schiedam.py, denhaag.py, or overige.py depending on city — or add a new file if needed.

Run the autoscraper tool:

python autoscraper.py listings <listings-url>
python autoscraper.py details <detail-page-url>

It prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.

Important Notes

Don't treat detail pages as optional, we always want all the info!

Status Mapping

Brokers use different status strings. Always map to one of:

  • "beschikbaar" — Available for sale
  • "onder_bod" — Under offer
  • "verkocht" — Sold

Example from api.py:

_STATUS_MAP = {
    "available": "beschikbaar",
    "under_bid": "onder_bod",
    "sold": "verkocht",
}
status = _STATUS_MAP.get(item.get("status"), "beschikbaar")

Postcode Extraction

Always aim for the Dutch postcode format (4 digits + 2 letters, e.g., "2611CA"). The travel time calculation depends on it. If a broker only provides the address string, use _extract_postcode(address).

If a postcode field contains extra text (e.g., "2522GW Den Haag"), extract cleanly with:

m = re.search(r"\d{4}\s*[A-Z]{2}", raw.upper())
postcode = m.group(0).replace(" ", "") if m else None

Never just .replace(" ", "") — that produces garbage like "2522GWDenHaag".

Price Handling

Prices are integers (euros), never floats. Use parse_prijs() for HTML.

Image URLs

Store the hero/main image URL in hero_image_url. This appears in Home Assistant notifications.

Extra Data

If a broker provides extra fields that don't fit the schema (e.g., balcony, garden, orientation), store them in the extra dict:

listings.append(RawListing(
    url=...,
    ...
    extra={
        "balcony": item.get("has_balcony"),
        "garden": item.get("has_garden"),
        "custom_field": item.get("something_else"),
    }
))

The database stores this as JSON in the extra column.

Error Handling

  • Wrap individual listing parsing in try/except to continue on one bad listing
  • Log parse warnings, not errors (brokers' HTML changes)
  • Let HTTP errors bubble up (the runner catches them at the adapter level)

Rate Limiting & Ethics

  • Both fetch_json() and fetch_soup() handle 429 Retry-After automatically
  • Nominatim (geocoding) has a 1 req/s limiter built into huizenbot.py
  • Never spawn parallel requests without the human's approval
  • Always use the USER_AGENT header (includes contact info for respectful scraping)
  • Don't keep curling the same endpoint, pipe it to a .dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
  • Don't over-investigate pagination — confirm card count on page 1, assume it's consistent across pages, move on. Never fetch multiple pages just to verify the per-page count.

Example: Adding "Van Daal" (API-based)

Scenario

The human finds that Van Daal (vandaalmakelaardij.nl) has a JSON API at:

https://api.vandaal.nl/listings?city=delft&status=available

Your Code (add to api.py)

# Van Daal
# --------
_VANDAAL_BASE = "https://www.vandaalmakelaardij.nl"
_VANDAAL_API = "https://api.vandaal.nl/listings"

_VANDAAL_STATUS_MAP = {
    "available": "beschikbaar",
    "under_offer": "onder_bod",
    "sold": "verkocht",
}

def fetch_vandaal() -> list[RawListing]:
    listings = []
    for city in ["delft", "schiedam"]:
        data = fetch_json(
            _VANDAAL_API,
            params={"city": city, "status": "available"}
        )
        
        for item in data.get("listings", []):
            if item.get("price", 0) > config.MAX_PRICE:
                continue
            
            listings.append(RawListing(
                url=item["url"],
                source_makelaar="vandaal",
                adres=item.get("address"),
                postcode=item.get("postcode"),
                stad=item.get("city"),
                prijs=item.get("price"),
                woningtype=item.get("type"),
                woonoppervlak=item.get("living_area"),
                slaapkamers=item.get("bedrooms"),
                hero_image_url=item.get("image_url"),
            ))
    
    log.info("vandaal: %d listings", len(listings))
    return listings

Register in SCRAPERS (in api.py)

SCRAPERS = {
    'bjornd': fetch_bjornd,
    'vandaal': fetch_vandaal,  # ← Add this
}

Test

Human updates test_adapters.py:

ADAPTER = SCRAPERS['vandaal']

Then runs:

cd tests && python test_adapters.py

If all looks good, the human copies the fetch_vandaal() function into the real api.py and adds it to SCRAPERS.


Summary

  1. You receive an adapter request + investigation results (API endpoint or HTML structure)
  2. You write a clean, self-contained scraper function that returns list[RawListing]
  3. You register it in the appropriate SCRAPERS dict
  4. The human tests it with test_adapters.py and validates output
  5. The human merges your code into the production files

Keep code simple, use the provided helpers, populate RawListing fields as best you can, and always set source_makelaar and url correctly.