- Split src/adapters/ssr.py (2160 LOC) into ssr/ package grouped by CMS: realworks.py, sure.py, schiedam.py, denhaag.py, overige.py - Add _og_detail() to api.py; all OG Online scrapers now fall back to detail page fetch when energielabel/bouwjaar are missing from the API - Fix run() to recalculate travel times for existing listings where fiets_mark IS NULL; upsert() now writes travel cols on existing rows too - Update tests/cache.py to patch fetch_soup in every ssr submodule - Update docs to reflect new package structure and mark API enrichment TODO done Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
14 KiB
Huizenbot — Agent Context for Adding Routes
Project Overview
Huizenbot is a periodic scraper of real estate broker websites in Delft and Schiedam (Netherlands). It:
- Fetches property listings from broker websites
- Saves new ones to SQLite with
RawListingschema - Calculates travel times (bike + public transit) to two work locations
- Sends push notifications via Home Assistant webhook (with email fallback)
Your role: You will add new broker routes (scrapers) to the adapters/ directory. A human will:
- Select a broker from the list
- Help you investigate the broker's website
- For API-based brokers: develop curl requests to test
- For HTML scrapers: develop parsing logic using BeautifulSoup
- Run
tests/test_adapters.pyto validate - Merge your code snippets into the codebase
Key Schema: RawListing
Location: src/huizenbot.py (lines 29–52)
This is the data model you must populate. All fields except url are optional:
@dataclass
class RawListing:
url: str # REQUIRED — the listing URL
source_makelaar: str = "" # Name of the broker (e.g., "bjornd", "vdaal")
datum_aanmelding: str | None = None # ISO 8601 date if available
status: str = "beschikbaar" # enum: beschikbaar | onder_bod | verkocht
# Location
adres: str | None = None # Street address (e.g., "Binnenwatersloot 3")
postcode: str | None = None # Dutch postcode (e.g., "2611CA")
stad: str | None = None # City (e.g., "Delft")
# Property details
prijs: int | None = None # Price in euros (integer, no float)
woningtype: str | None = None # Type (e.g., "appartement", "tussenwoning")
woonoppervlak: int | None = None # Living space in m²
perceeloppervlak: int | None = None # Plot size in m² (NULL for apartments)
kamers: int | None = None # Number of rooms
slaapkamers: int | None = None # Number of bedrooms
bouwjaar: int | None = None # Build year
energielabel: str | None = None # Energy label (e.g., "A", "B")
# Media
hero_image_url: str | None = None # Main photo URL
# Extra data (broker-specific fields)
extra: dict[str, Any] = field(default_factory=dict) # Arbitrary JSON data
DB Upsert: The listing is inserted on first run (with id = sha256(url)) and updated only on last_seen / status on subsequent runs. Travel times are calculated only on first insert.
Adapter Structure
Adapters live in src/adapters/ and are organized by type:
Two Adapter Types
1. API-based (src/adapters/api.py)
For brokers with REST/JSON endpoints.
Pattern:
def fetch_bjornd() -> list[RawListing]:
data = fetch_json("https://...", params={...}, headers={...})
listings = []
for item in data:
# Filter / validate
if item.get("status") in _SKIP:
continue
if item.get("price") > config.MAX_PRICE:
continue
listings.append(RawListing(
url=item["url"],
source_makelaar="bjornd",
adres=item.get("address"),
postcode=item.get("zipcode"),
# ... etc
))
log.info("bjornd: %d listings", len(listings))
return listings
Helpers available:
fetch_json(url, *, params=None, headers=None)— GET with User-Agent, timeout, Retry-After handling- Built-in logging via
log = logging.getLogger("huizenbot.api")
2. SSR/HTML-based (src/adapters/ssr/ package)
For brokers with server-side rendered HTML. The package is split by CMS platform:
realworks.py— Realworks CMS (li/div.aanbodEntry cards + span.kenmerk detail)sure.py— SURE WordPress plugin (/wonen?sure_koop_huur=koop + #kenmerken detail)schiedam.py— Custom Schiedam scrapers (diverse platforms)denhaag.py— Den Haag scrapers (diverse platforms)overige.py— Other / multi-city scrapers (OG Online WP, Elementor)
Pattern:
def fetch_vdaal() -> list[RawListing]:
soup = fetch_soup("https://vdaalmakelaardij.nl/aanbod")
listings = []
for card in soup.select(".property-card"):
try:
url = card.select_one("a[href]")["href"]
if not url.startswith("http"):
url = VDAAL_BASE + url
adres = _text(card, ".address-selector")
postcode = _extract_postcode(adres)
prijs = parse_prijs(_text(card, ".price"))
listings.append(RawListing(
url=url,
source_makelaar="vdaal",
adres=adres,
postcode=postcode,
stad=_infer_stad(postcode),
prijs=prijs,
# ... etc
))
except Exception as e:
log.warning("Parse error: %s", e)
log.info("vdaal: %d listings", len(listings))
return listings
Helpers available:
fetch_soup(url, *, params=None)— GET with BeautifulSoup, Retry-After handlingparse_prijs(text)— Extract price from strings like "€ 325.000 k.k." → 325000parse_m2(text)— Extract area from "87 m²" → 87_text(soup, selector)— Get inner text from element_src(soup, selector)— Get src or data-src attribute_extract_postcode(text)— Regex postcode from any text_infer_stad(postcode)— Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam (Den Haag not in this helper; use the city value from the broker directly)
Registration
API scrapers (src/adapters/api.py): Add your function and register in the SCRAPERS dict at the bottom of the file.
SSR scrapers: Add your function to the appropriate submodule (realworks.py, sure.py, schiedam.py, denhaag.py, or overige.py), then import it in src/adapters/ssr/__init__.py and add it to the SCRAPERS dict there.
# api.py — SCRAPERS dict
SCRAPERS = {
'bjornd': fetch_bjornd,
'your_broker': fetch_your_broker, # ← Add here
}
# ssr/__init__.py — import + register
from .realworks import fetch_your_broker # ← import from the right submodule
SCRAPERS = {
...
'your_broker': fetch_your_broker, # ← Add here
}
The src/adapters/__init__.py merges both dicts, so the runner picks up all registered adapters automatically.
Testing Workflow
1. Understand the Website
The human will help you:
- Identify the broker's API endpoint (or the HTML structure)
- Check for a
robots.txtor rate limit headers - Write exploratory curl requests (for APIs) or BeautifulSoup inspections
2. Develop & Test Locally
- Add your scraper function to the appropriate file (
api.pyor the rightssr/submodule) - Register it in the
SCRAPERSdict - The human updates
tests/test_adapters.pyto point to your adapter:ADAPTER = SCRAPERS['your_broker_name'] - Run the test:
cd tests && python test_adapters.py - The test prints listings in a simple format so you can validate output
3. Merge Code
Once validated, the human will copy your inline code snippets into the main codebase. You produce easily pasteable functions, not entire files.
Config & Constants
Location: src/config.py
Key values you may reference:
MAX_PRICE = 300_000— Price filter (your scraper can skip listings above this)USER_AGENT = "Huizenbot/1.0 (+mark@kalsbeek.dev) persoonlijk gebruik"— Used in all HTTP headersMARK_WERK_POSTCODE,MICHELLE_WERK_POSTCODE— Work postcodes for travel time calculation
Secrets (API keys, webhook URLs) are environment variables, not in config.
Platform / CMS Quick Identification
Before investigating a broker's HTML manually, check for known platforms in this order:
1. OG Online / realtime-listings (API — fastest)
File: src/adapters/api.py
Check if https://<base>/nl/realtime-listings/consumer returns JSON (with header X-Requested-With: XMLHttpRequest). If yes, this is a 10-line addition to api.py. Known brokers: bjornd, moerman, vandaal, elzenaar, doen.
Fields: isSales, statusOrig, salesPrice, address, zipcode, city, rooms, bedrooms, livingSurface, plotSurface, dateOfConstruction, energyLabel, type, photo, url.
Add a _CITIES set to filter by city if the broker covers a wide area. Skip statuses "rented" and "rented_ur".
2. Realworks CMS (SSR — one liner)
File: src/adapters/ssr/realworks.py
Run autoscraper.py or check HTML for li.aanbodEntry. If detected:
def fetch_mybroker() -> list[RawListing]:
return fetch_realworks("https://www.mybroker.nl", "mybroker")
3. SURE WordPress Plugin (SSR — ~50 lines)
File: src/adapters/ssr/sure.py
Check HTML for sure- CSS classes or ?sure_koop_huur=koop filter. Two card variants:
a.card-house(single dash) — e.g. Olsthoorna.card--house(double dash) — e.g. Borgdorff
Both use ?sure_koop_huur=koop to filter buy listings and /page/{N}/ pagination. Detail page always has #kenmerken li span span pairs with labels like status, soort woonhuis/soort woning/soort bouw, bouwjaar, gebruiksoppervlakte wonen, perceeloppervlakte, aantal slaapkamers, energielabel. Postcode is often not available on the detail page.
Terminate pagination when len(cards) < expected_per_page (typically 15 for SURE).
4. Unknown CMS
File: src/adapters/ssr/schiedam.py, denhaag.py, or overige.py depending on city — or add a new file if needed.
Run the autoscraper tool:
python autoscraper.py listings <listings-url>
python autoscraper.py details <detail-page-url>
It prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
Important Notes
Don't treat detail pages as optional, we always want all the info!
Status Mapping
Brokers use different status strings. Always map to one of:
"beschikbaar"— Available for sale"onder_bod"— Under offer"verkocht"— Sold
Example from api.py:
_STATUS_MAP = {
"available": "beschikbaar",
"under_bid": "onder_bod",
"sold": "verkocht",
}
status = _STATUS_MAP.get(item.get("status"), "beschikbaar")
Postcode Extraction
Always aim for the Dutch postcode format (4 digits + 2 letters, e.g., "2611CA"). The travel time calculation depends on it. If a broker only provides the address string, use _extract_postcode(address).
If a postcode field contains extra text (e.g., "2522GW Den Haag"), extract cleanly with:
m = re.search(r"\d{4}\s*[A-Z]{2}", raw.upper())
postcode = m.group(0).replace(" ", "") if m else None
Never just .replace(" ", "") — that produces garbage like "2522GWDenHaag".
Price Handling
Prices are integers (euros), never floats. Use parse_prijs() for HTML.
Image URLs
Store the hero/main image URL in hero_image_url. This appears in Home Assistant notifications.
Extra Data
If a broker provides extra fields that don't fit the schema (e.g., balcony, garden, orientation), store them in the extra dict:
listings.append(RawListing(
url=...,
...
extra={
"balcony": item.get("has_balcony"),
"garden": item.get("has_garden"),
"custom_field": item.get("something_else"),
}
))
The database stores this as JSON in the extra column.
Error Handling
- Wrap individual listing parsing in try/except to continue on one bad listing
- Log parse warnings, not errors (brokers' HTML changes)
- Let HTTP errors bubble up (the runner catches them at the adapter level)
Rate Limiting & Ethics
- Both
fetch_json()andfetch_soup()handle 429 Retry-After automatically - Nominatim (geocoding) has a 1 req/s limiter built into
huizenbot.py - Never spawn parallel requests without the human's approval
- Always use the
USER_AGENTheader (includes contact info for respectful scraping) - Don't keep curling the same endpoint, pipe it to a .dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
- Don't over-investigate pagination — confirm card count on page 1, assume it's consistent across pages, move on. Never fetch multiple pages just to verify the per-page count.
Example: Adding "Van Daal" (API-based)
Scenario
The human finds that Van Daal (vandaalmakelaardij.nl) has a JSON API at:
https://api.vandaal.nl/listings?city=delft&status=available
Your Code (add to api.py)
# Van Daal
# --------
_VANDAAL_BASE = "https://www.vandaalmakelaardij.nl"
_VANDAAL_API = "https://api.vandaal.nl/listings"
_VANDAAL_STATUS_MAP = {
"available": "beschikbaar",
"under_offer": "onder_bod",
"sold": "verkocht",
}
def fetch_vandaal() -> list[RawListing]:
listings = []
for city in ["delft", "schiedam"]:
data = fetch_json(
_VANDAAL_API,
params={"city": city, "status": "available"}
)
for item in data.get("listings", []):
if item.get("price", 0) > config.MAX_PRICE:
continue
listings.append(RawListing(
url=item["url"],
source_makelaar="vandaal",
adres=item.get("address"),
postcode=item.get("postcode"),
stad=item.get("city"),
prijs=item.get("price"),
woningtype=item.get("type"),
woonoppervlak=item.get("living_area"),
slaapkamers=item.get("bedrooms"),
hero_image_url=item.get("image_url"),
))
log.info("vandaal: %d listings", len(listings))
return listings
Register in SCRAPERS (in api.py)
SCRAPERS = {
'bjornd': fetch_bjornd,
'vandaal': fetch_vandaal, # ← Add this
}
Test
Human updates test_adapters.py:
ADAPTER = SCRAPERS['vandaal']
Then runs:
cd tests && python test_adapters.py
If all looks good, the human copies the fetch_vandaal() function into the real api.py and adds it to SCRAPERS.
Summary
- You receive an adapter request + investigation results (API endpoint or HTML structure)
- You write a clean, self-contained scraper function that returns
list[RawListing] - You register it in the appropriate
SCRAPERSdict - The human tests it with
test_adapters.pyand validates output - The human merges your code into the production files
Keep code simple, use the provided helpers, populate RawListing fields as best you can, and always set source_makelaar and url correctly.