12 KiB
Huizenbot — Agent Context for Adding Routes
Project Overview
Huizenbot is a periodic scraper of real estate broker websites in Delft and Schiedam (Netherlands). It:
- Fetches property listings from broker websites
- Saves new ones to SQLite with
RawListingschema - Calculates travel times (bike + public transit) to two work locations
- Sends push notifications via Home Assistant webhook (with email fallback)
Your role: You will add new broker routes (scrapers) to the adapters/ directory. A human will:
- Select a broker from the list
- Help you investigate the broker's website
- For API-based brokers: develop curl requests to test
- For HTML scrapers: develop parsing logic using BeautifulSoup
- Run
tests/test_adapters.pyto validate - Merge your code snippets into the codebase
Key Schema: RawListing
Location: src/huizenbot.py (lines 29–52)
This is the data model you must populate. All fields except url are optional:
@dataclass
class RawListing:
url: str # REQUIRED — the listing URL
source_makelaar: str = "" # Name of the broker (e.g., "bjornd", "vdaal")
datum_aanmelding: str | None = None # ISO 8601 date if available
status: str = "beschikbaar" # enum: beschikbaar | onder_bod | verkocht
# Location
adres: str | None = None # Street address (e.g., "Binnenwatersloot 3")
postcode: str | None = None # Dutch postcode (e.g., "2611CA")
stad: str | None = None # City (e.g., "Delft")
# Property details
prijs: int | None = None # Price in euros (integer, no float)
woningtype: str | None = None # Type (e.g., "appartement", "tussenwoning")
woonoppervlak: int | None = None # Living space in m²
perceeloppervlak: int | None = None # Plot size in m² (NULL for apartments)
kamers: int | None = None # Number of rooms
slaapkamers: int | None = None # Number of bedrooms
bouwjaar: int | None = None # Build year
energielabel: str | None = None # Energy label (e.g., "A", "B")
# Media
hero_image_url: str | None = None # Main photo URL
# Extra data (broker-specific fields)
extra: dict[str, Any] = field(default_factory=dict) # Arbitrary JSON data
DB Upsert: The listing is inserted on first run (with id = sha256(url)) and updated only on last_seen / status on subsequent runs. Travel times are calculated only on first insert.
Adapter Structure
Adapters live in src/adapters/ and are organized by type:
Two Adapter Types
1. API-based (src/adapters/api.py)
For brokers with REST/JSON endpoints.
Pattern:
def fetch_bjornd() -> list[RawListing]:
data = fetch_json("https://...", params={...}, headers={...})
listings = []
for item in data:
# Filter / validate
if item.get("status") in _SKIP:
continue
if item.get("price") > config.MAX_PRICE:
continue
listings.append(RawListing(
url=item["url"],
source_makelaar="bjornd",
adres=item.get("address"),
postcode=item.get("zipcode"),
# ... etc
))
log.info("bjornd: %d listings", len(listings))
return listings
Helpers available:
fetch_json(url, *, params=None, headers=None)— GET with User-Agent, timeout, Retry-After handling- Built-in logging via
log = logging.getLogger("huizenbot.api")
2. SSR/HTML-based (src/adapters/ssr.py)
For brokers with server-side rendered HTML.
Pattern:
def fetch_vdaal() -> list[RawListing]:
soup = fetch_soup("https://vdaalmakelaardij.nl/aanbod")
listings = []
for card in soup.select(".property-card"):
try:
url = card.select_one("a[href]")["href"]
if not url.startswith("http"):
url = VDAAL_BASE + url
adres = _text(card, ".address-selector")
postcode = _extract_postcode(adres)
prijs = parse_prijs(_text(card, ".price"))
listings.append(RawListing(
url=url,
source_makelaar="vdaal",
adres=adres,
postcode=postcode,
stad=_infer_stad(postcode),
prijs=prijs,
# ... etc
))
except Exception as e:
log.warning("Parse error: %s", e)
log.info("vdaal: %d listings", len(listings))
return listings
Helpers available:
fetch_soup(url, *, params=None)— GET with BeautifulSoup, Retry-After handlingparse_prijs(text)— Extract price from strings like "€ 325.000 k.k." → 325000parse_m2(text)— Extract area from "87 m²" → 87_text(soup, selector)— Get inner text from element_src(soup, selector)— Get src or data-src attribute_extract_postcode(text)— Regex postcode from any text_infer_stad(postcode)— Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam
Registration
Both api.py and ssr.py have a SCRAPERS dict at the bottom:
# api.py
SCRAPERS = {
'bjornd': fetch_bjornd,
'your_broker': fetch_your_broker, # ← Add here
}
# ssr.py
SCRAPERS = {
'bjornd_demo': fetch_bjornd_demo,
'your_broker': fetch_your_broker, # ← Add here
}
The src/adapters/__init__.py merges both dicts, so the runner picks up all registered adapters automatically.
Testing Workflow
1. Understand the Website
The human will help you:
- Identify the broker's API endpoint (or the HTML structure)
- Check for a
robots.txtor rate limit headers - Write exploratory curl requests (for APIs) or BeautifulSoup inspections
2. Develop & Test Locally
- Add your scraper function to the appropriate file (
api.pyorssr.py) - Register it in the
SCRAPERSdict - The human updates
tests/test_adapters.pyto point to your adapter:ADAPTER = SCRAPERS['your_broker_name'] - Run the test:
cd tests && python test_adapters.py - The test prints listings in a simple format so you can validate output
3. Merge Code
Once validated, the human will copy your inline code snippets into the main codebase. You produce easily pasteable functions, not entire files.
Config & Constants
Location: src/config.py
Key values you may reference:
MAX_PRICE = 300_000— Price filter (your scraper can skip listings above this)USER_AGENT = "Huizenbot/1.0 (+mark@kalsbeek.dev) persoonlijk gebruik"— Used in all HTTP headersMARK_WERK_POSTCODE,MICHELLE_WERK_POSTCODE— Work postcodes for travel time calculation
Secrets (API keys, webhook URLs) are environment variables, not in config.
CMS Detection Tool
Before investigating a broker's HTML manually, prod the human in the loop to run autoscraper.py from the project root:
python autoscraper.py listings <listings-url>
python autoscraper.py details <detail-page-url>
If the broker uses a known CMS, the tool prints the exact code to add — no further investigation needed. Currently detected CMSes:
- Realworks → prints a ready-to-paste
fetch_realworks(...)one-liner forssr.py
If the CMS is unknown, the tool prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
Important Notes
Don't treat detail pages as optional, we always want all the info!
Status Mapping
Brokers use different status strings. Always map to one of:
"beschikbaar"— Available for sale"onder_bod"— Under offer"verkocht"— Sold
Example from api.py:
_STATUS_MAP = {
"available": "beschikbaar",
"under_bid": "onder_bod",
"sold": "verkocht",
}
status = _STATUS_MAP.get(item.get("status"), "beschikbaar")
Postcode Extraction
Always aim for the Dutch postcode format (4 digits + 2 letters, e.g., "2611CA"). The travel time calculation depends on it. If a broker only provides the address string, use _extract_postcode(address).
Price Handling
Prices are integers (euros), never floats. Use parse_prijs() for HTML.
Image URLs
Store the hero/main image URL in hero_image_url. This appears in Home Assistant notifications.
Extra Data
If a broker provides extra fields that don't fit the schema (e.g., balcony, garden, orientation), store them in the extra dict:
listings.append(RawListing(
url=...,
...
extra={
"balcony": item.get("has_balcony"),
"garden": item.get("has_garden"),
"custom_field": item.get("something_else"),
}
))
The database stores this as JSON in the extra column.
Error Handling
- Wrap individual listing parsing in try/except to continue on one bad listing
- Log parse warnings, not errors (brokers' HTML changes)
- Let HTTP errors bubble up (the runner catches them at the adapter level)
Rate Limiting & Ethics
- Both
fetch_json()andfetch_soup()handle 429 Retry-After automatically - Nominatim (geocoding) has a 1 req/s limiter built into
huizenbot.py - Never spawn parallel requests without the human's approval
- Always use the
USER_AGENTheader (includes contact info for respectful scraping) - Don't keep curling the same endpoint, pipe it to a .dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
Example: Adding "Van Daal" (API-based)
Scenario
The human finds that Van Daal (vandaalmakelaardij.nl) has a JSON API at:
https://api.vandaal.nl/listings?city=delft&status=available
Your Code (add to api.py)
# Van Daal
# --------
_VANDAAL_BASE = "https://www.vandaalmakelaardij.nl"
_VANDAAL_API = "https://api.vandaal.nl/listings"
_VANDAAL_STATUS_MAP = {
"available": "beschikbaar",
"under_offer": "onder_bod",
"sold": "verkocht",
}
def fetch_vandaal() -> list[RawListing]:
listings = []
for city in ["delft", "schiedam"]:
data = fetch_json(
_VANDAAL_API,
params={"city": city, "status": "available"}
)
for item in data.get("listings", []):
if item.get("price", 0) > config.MAX_PRICE:
continue
listings.append(RawListing(
url=item["url"],
source_makelaar="vandaal",
adres=item.get("address"),
postcode=item.get("postcode"),
stad=item.get("city"),
prijs=item.get("price"),
woningtype=item.get("type"),
woonoppervlak=item.get("living_area"),
slaapkamers=item.get("bedrooms"),
hero_image_url=item.get("image_url"),
))
log.info("vandaal: %d listings", len(listings))
return listings
Register in SCRAPERS (in api.py)
SCRAPERS = {
'bjornd': fetch_bjornd,
'vandaal': fetch_vandaal, # ← Add this
}
Test
Human updates test_adapters.py:
ADAPTER = SCRAPERS['vandaal']
Then runs:
cd tests && python test_adapters.py
If all looks good, the human copies the fetch_vandaal() function into the real api.py and adds it to SCRAPERS.
Summary
- You receive an adapter request + investigation results (API endpoint or HTML structure)
- You write a clean, self-contained scraper function that returns
list[RawListing] - You register it in the appropriate
SCRAPERSdict - The human tests it with
test_adapters.pyand validates output - The human merges your code into the production files
Keep code simple, use the provided helpers, populate RawListing fields as best you can, and always set source_makelaar and url correctly.