# Huizenbot — Agent Context for Adding Routes ## Project Overview **Huizenbot** is a periodic scraper of real estate broker websites in Delft and Schiedam (Netherlands). It: - Fetches property listings from broker websites - Saves new ones to SQLite with `RawListing` schema - Calculates travel times (bike + public transit) to two work locations - Sends push notifications via Home Assistant webhook (with email fallback) **Your role:** You will add new broker routes (scrapers) to the `adapters/` directory. A human will: 1. Select a broker from the list 2. Help you investigate the broker's website 3. For API-based brokers: develop curl requests to test 4. For HTML scrapers: develop parsing logic using BeautifulSoup 5. Run `tests/test_adapters.py` to validate 6. Merge your code snippets into the codebase --- ## Key Schema: RawListing **Location:** `src/huizenbot.py` (lines 29–52) This is the data model you must populate. All fields except `url` are optional: ```python @dataclass class RawListing: url: str # REQUIRED — the listing URL source_makelaar: str = "" # Name of the broker (e.g., "bjornd", "vdaal") datum_aanmelding: str | None = None # ISO 8601 date if available status: str = "beschikbaar" # enum: beschikbaar | onder_bod | verkocht # Location adres: str | None = None # Street address (e.g., "Binnenwatersloot 3") postcode: str | None = None # Dutch postcode (e.g., "2611CA") stad: str | None = None # City (e.g., "Delft") # Property details prijs: int | None = None # Price in euros (integer, no float) woningtype: str | None = None # Type (e.g., "appartement", "tussenwoning") woonoppervlak: int | None = None # Living space in m² perceeloppervlak: int | None = None # Plot size in m² (NULL for apartments) kamers: int | None = None # Number of rooms slaapkamers: int | None = None # Number of bedrooms bouwjaar: int | None = None # Build year energielabel: str | None = None # Energy label (e.g., "A", "B") # Media hero_image_url: str | None = None # Main photo URL # Extra data (broker-specific fields) extra: dict[str, Any] = field(default_factory=dict) # Arbitrary JSON data ``` **DB Upsert:** The listing is inserted on first run (with `id = sha256(url)`) and updated only on `last_seen` / `status` on subsequent runs. Travel times are calculated only on first insert. --- ## Adapter Structure Adapters live in `src/adapters/` and are organized by type: ### Two Adapter Types #### 1. **API-based** (`src/adapters/api.py`) For brokers with REST/JSON endpoints. **Pattern:** ```python def fetch_bjornd() -> list[RawListing]: data = fetch_json("https://...", params={...}, headers={...}) listings = [] for item in data: # Filter / validate if item.get("status") in _SKIP: continue if item.get("price") > config.MAX_PRICE: continue listings.append(RawListing( url=item["url"], source_makelaar="bjornd", adres=item.get("address"), postcode=item.get("zipcode"), # ... etc )) log.info("bjornd: %d listings", len(listings)) return listings ``` **Helpers available:** - `fetch_json(url, *, params=None, headers=None)` — GET with User-Agent, timeout, Retry-After handling - Built-in logging via `log = logging.getLogger("huizenbot.api")` #### 2. **SSR/HTML-based** (`src/adapters/ssr/` package) For brokers with server-side rendered HTML. The package is split by CMS platform: - `realworks.py` — Realworks CMS (li/div.aanbodEntry cards + span.kenmerk detail) - `sure.py` — SURE WordPress plugin (/wonen?sure_koop_huur=koop + #kenmerken detail) - `schiedam.py` — Custom Schiedam scrapers (diverse platforms) - `denhaag.py` — Den Haag scrapers (diverse platforms) - `overige.py` — Other / multi-city scrapers (OG Online WP, Elementor) **Pattern:** ```python def fetch_vdaal() -> list[RawListing]: soup = fetch_soup("https://vdaalmakelaardij.nl/aanbod") listings = [] for card in soup.select(".property-card"): try: url = card.select_one("a[href]")["href"] if not url.startswith("http"): url = VDAAL_BASE + url adres = _text(card, ".address-selector") postcode = _extract_postcode(adres) prijs = parse_prijs(_text(card, ".price")) listings.append(RawListing( url=url, source_makelaar="vdaal", adres=adres, postcode=postcode, stad=_infer_stad(postcode), prijs=prijs, # ... etc )) except Exception as e: log.warning("Parse error: %s", e) log.info("vdaal: %d listings", len(listings)) return listings ``` **Helpers available:** - `fetch_soup(url, *, params=None)` — GET with BeautifulSoup, Retry-After handling - `parse_prijs(text)` — Extract price from strings like "€ 325.000 k.k." → 325000 - `parse_m2(text)` — Extract area from "87 m²" → 87 - `_text(soup, selector)` — Get inner text from element - `_src(soup, selector)` — Get src or data-src attribute - `_extract_postcode(text)` — Regex postcode from any text - `_infer_stad(postcode)` — Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam (Den Haag not in this helper; use the city value from the broker directly) --- ## Registration **API scrapers** (`src/adapters/api.py`): Add your function and register in the `SCRAPERS` dict at the bottom of the file. **SSR scrapers**: Add your function to the appropriate submodule (`realworks.py`, `sure.py`, `schiedam.py`, `denhaag.py`, or `overige.py`), then import it in `src/adapters/ssr/__init__.py` and add it to the `SCRAPERS` dict there. ```python # api.py — SCRAPERS dict SCRAPERS = { 'bjornd': fetch_bjornd, 'your_broker': fetch_your_broker, # ← Add here } # ssr/__init__.py — import + register from .realworks import fetch_your_broker # ← import from the right submodule SCRAPERS = { ... 'your_broker': fetch_your_broker, # ← Add here } ``` The `src/adapters/__init__.py` merges both dicts, so the runner picks up all registered adapters automatically. --- ## Testing Workflow ### 1. Understand the Website The human will help you: - Identify the broker's API endpoint (or the HTML structure) - Check for a `robots.txt` or rate limit headers - Write exploratory curl requests (for APIs) or BeautifulSoup inspections ### 2. Develop & Test Locally - Add your scraper function to the appropriate file (`api.py` or the right `ssr/` submodule) - Register it in the `SCRAPERS` dict - The human updates `tests/test_adapters.py` to point to your adapter: ```python ADAPTER = SCRAPERS['your_broker_name'] ``` - Run the test: ```bash cd tests && python test_adapters.py ``` - The test prints listings in a simple format so you can validate output ### 3. Merge Code Once validated, the human will **copy your inline code snippets** into the main codebase. You produce **easily pasteable functions**, not entire files. --- ## Config & Constants **Location:** `src/config.py` Key values you may reference: - `MAX_PRICE = 300_000` — Price filter (your scraper can skip listings above this) - `USER_AGENT = "Huizenbot/1.0 (+mark@kalsbeek.dev) persoonlijk gebruik"` — Used in all HTTP headers - `MARK_WERK_POSTCODE`, `MICHELLE_WERK_POSTCODE` — Work postcodes for travel time calculation Secrets (API keys, webhook URLs) are **environment variables**, not in config. --- ## Platform / CMS Quick Identification Before investigating a broker's HTML manually, check for known platforms in this order: ### 1. OG Online / realtime-listings (API — fastest) **File:** `src/adapters/api.py` Check if `https:///nl/realtime-listings/consumer` returns JSON (with header `X-Requested-With: XMLHttpRequest`). If yes, this is a 10-line addition to `api.py`. Known brokers: bjornd, moerman, vandaal, elzenaar, doen. Fields: `isSales`, `statusOrig`, `salesPrice`, `address`, `zipcode`, `city`, `rooms`, `bedrooms`, `livingSurface`, `plotSurface`, `dateOfConstruction`, `energyLabel`, `type`, `photo`, `url`. Add a `_CITIES` set to filter by city if the broker covers a wide area. Skip statuses `"rented"` and `"rented_ur"`. ### 2. Realworks CMS (SSR — one liner) **File:** `src/adapters/ssr/realworks.py` Run `autoscraper.py` or check HTML for `li.aanbodEntry`. If detected: ```python def fetch_mybroker() -> list[RawListing]: return fetch_realworks("https://www.mybroker.nl", "mybroker") ``` ### 3. SURE WordPress Plugin (SSR — ~50 lines) **File:** `src/adapters/ssr/sure.py` Check HTML for `sure-` CSS classes or `?sure_koop_huur=koop` filter. Two card variants: - `a.card-house` (single dash) — e.g. Olsthoorn - `a.card--house` (double dash) — e.g. Borgdorff Both use `?sure_koop_huur=koop` to filter buy listings and `/page/{N}/` pagination. Detail page always has `#kenmerken li span span` pairs with labels like `status`, `soort woonhuis`/`soort woning`/`soort bouw`, `bouwjaar`, `gebruiksoppervlakte wonen`, `perceeloppervlakte`, `aantal slaapkamers`, `energielabel`. Postcode is often **not** available on the detail page. Terminate pagination when `len(cards) < expected_per_page` (typically 15 for SURE). ### 4. Unknown CMS **File:** `src/adapters/ssr/schiedam.py`, `denhaag.py`, or `overige.py` depending on city — or add a new file if needed. Run the autoscraper tool: ```bash python autoscraper.py listings python autoscraper.py details ``` It prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development. ## Important Notes Don't treat detail pages as optional, we always want all the info! ### Status Mapping Brokers use different status strings. Always map to one of: - `"beschikbaar"` — Available for sale - `"onder_bod"` — Under offer - `"verkocht"` — Sold Example from api.py: ```python _STATUS_MAP = { "available": "beschikbaar", "under_bid": "onder_bod", "sold": "verkocht", } status = _STATUS_MAP.get(item.get("status"), "beschikbaar") ``` ### Postcode Extraction Always aim for the **Dutch postcode format** (4 digits + 2 letters, e.g., `"2611CA"`). The travel time calculation depends on it. If a broker only provides the address string, use `_extract_postcode(address)`. If a postcode field contains extra text (e.g., `"2522GW Den Haag"`), extract cleanly with: ```python m = re.search(r"\d{4}\s*[A-Z]{2}", raw.upper()) postcode = m.group(0).replace(" ", "") if m else None ``` Never just `.replace(" ", "")` — that produces garbage like `"2522GWDenHaag"`. ### Price Handling Prices are **integers** (euros), never floats. Use `parse_prijs()` for HTML. ### Image URLs Store the hero/main image URL in `hero_image_url`. This appears in Home Assistant notifications. ### Extra Data If a broker provides extra fields that don't fit the schema (e.g., balcony, garden, orientation), store them in the `extra` dict: ```python listings.append(RawListing( url=..., ... extra={ "balcony": item.get("has_balcony"), "garden": item.get("has_garden"), "custom_field": item.get("something_else"), } )) ``` The database stores this as JSON in the `extra` column. ### Error Handling - Wrap individual listing parsing in try/except to continue on one bad listing - Log parse warnings, not errors (brokers' HTML changes) - Let HTTP errors bubble up (the runner catches them at the adapter level) ### Rate Limiting & Ethics - Both `fetch_json()` and `fetch_soup()` handle 429 Retry-After automatically - Nominatim (geocoding) has a 1 req/s limiter built into `huizenbot.py` - Never spawn parallel requests without the human's approval - Always use the `USER_AGENT` header (includes contact info for respectful scraping) - Don't keep curling the same endpoint, pipe it to a .dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that. - Don't over-investigate pagination — confirm card count on page 1, assume it's consistent across pages, move on. Never fetch multiple pages just to verify the per-page count. --- ## Example: Adding "Van Daal" (API-based) ### Scenario The human finds that Van Daal (vandaalmakelaardij.nl) has a JSON API at: ``` https://api.vandaal.nl/listings?city=delft&status=available ``` ### Your Code (add to api.py) ```python # Van Daal # -------- _VANDAAL_BASE = "https://www.vandaalmakelaardij.nl" _VANDAAL_API = "https://api.vandaal.nl/listings" _VANDAAL_STATUS_MAP = { "available": "beschikbaar", "under_offer": "onder_bod", "sold": "verkocht", } def fetch_vandaal() -> list[RawListing]: listings = [] for city in ["delft", "schiedam"]: data = fetch_json( _VANDAAL_API, params={"city": city, "status": "available"} ) for item in data.get("listings", []): if item.get("price", 0) > config.MAX_PRICE: continue listings.append(RawListing( url=item["url"], source_makelaar="vandaal", adres=item.get("address"), postcode=item.get("postcode"), stad=item.get("city"), prijs=item.get("price"), woningtype=item.get("type"), woonoppervlak=item.get("living_area"), slaapkamers=item.get("bedrooms"), hero_image_url=item.get("image_url"), )) log.info("vandaal: %d listings", len(listings)) return listings ``` ### Register in SCRAPERS (in api.py) ```python SCRAPERS = { 'bjornd': fetch_bjornd, 'vandaal': fetch_vandaal, # ← Add this } ``` ### Test Human updates `test_adapters.py`: ```python ADAPTER = SCRAPERS['vandaal'] ``` Then runs: ```bash cd tests && python test_adapters.py ``` If all looks good, the human copies the `fetch_vandaal()` function into the real `api.py` and adds it to `SCRAPERS`. --- ## Summary 1. **You receive** an adapter request + investigation results (API endpoint or HTML structure) 2. **You write** a clean, self-contained scraper function that returns `list[RawListing]` 3. **You register** it in the appropriate `SCRAPERS` dict 4. **The human tests** it with `test_adapters.py` and validates output 5. **The human merges** your code into the production files Keep code simple, use the provided helpers, populate `RawListing` fields as best you can, and always set `source_makelaar` and `url` correctly.