Markk116/huizenbot

Fork 0

Files

Mark Kalsbeek b35025b9cb ever onwards

2026-04-03 16:58:57 +02:00

12 KiB

Raw Blame History

Huizenbot — Agent Context for Adding Routes

Project Overview

Huizenbot is a periodic scraper of real estate broker websites in Delft and Schiedam (Netherlands). It:

Fetches property listings from broker websites
Saves new ones to SQLite with RawListing schema
Calculates travel times (bike + public transit) to two work locations
Sends push notifications via Home Assistant webhook (with email fallback)

Your role: You will add new broker routes (scrapers) to the adapters/ directory. A human will:

Select a broker from the list
Help you investigate the broker's website
For API-based brokers: develop curl requests to test
For HTML scrapers: develop parsing logic using BeautifulSoup
Run tests/test_adapters.py to validate
Merge your code snippets into the codebase

Key Schema: RawListing

Location: src/huizenbot.py (lines 29–52)

This is the data model you must populate. All fields except url are optional:

@dataclass
class RawListing:
    url: str                          # REQUIRED — the listing URL
    
    source_makelaar: str = ""         # Name of the broker (e.g., "bjornd", "vdaal")
    datum_aanmelding: str | None = None  # ISO 8601 date if available
    status: str = "beschikbaar"       # enum: beschikbaar | onder_bod | verkocht
    
    # Location
    adres: str | None = None          # Street address (e.g., "Binnenwatersloot 3")
    postcode: str | None = None       # Dutch postcode (e.g., "2611CA")
    stad: str | None = None           # City (e.g., "Delft")
    
    # Property details
    prijs: int | None = None          # Price in euros (integer, no float)
    woningtype: str | None = None     # Type (e.g., "appartement", "tussenwoning")
    woonoppervlak: int | None = None  # Living space in m²
    perceeloppervlak: int | None = None  # Plot size in m² (NULL for apartments)
    kamers: int | None = None         # Number of rooms
    slaapkamers: int | None = None    # Number of bedrooms
    bouwjaar: int | None = None       # Build year
    energielabel: str | None = None   # Energy label (e.g., "A", "B")
    
    # Media
    hero_image_url: str | None = None # Main photo URL
    
    # Extra data (broker-specific fields)
    extra: dict[str, Any] = field(default_factory=dict)  # Arbitrary JSON data

DB Upsert: The listing is inserted on first run (with id = sha256(url)) and updated only on last_seen / status on subsequent runs. Travel times are calculated only on first insert.

Adapter Structure

Adapters live in src/adapters/ and are organized by type:

Two Adapter Types

1. API-based (`src/adapters/api.py`)

For brokers with REST/JSON endpoints.

Pattern:

def fetch_bjornd() -> list[RawListing]:
    data = fetch_json("https://...", params={...}, headers={...})
    listings = []
    for item in data:
        # Filter / validate
        if item.get("status") in _SKIP:
            continue
        if item.get("price") > config.MAX_PRICE:
            continue
        
        listings.append(RawListing(
            url=item["url"],
            source_makelaar="bjornd",
            adres=item.get("address"),
            postcode=item.get("zipcode"),
            # ... etc
        ))
    
    log.info("bjornd: %d listings", len(listings))
    return listings

Helpers available:

fetch_json(url, *, params=None, headers=None) — GET with User-Agent, timeout, Retry-After handling
Built-in logging via log = logging.getLogger("huizenbot.api")

2. SSR/HTML-based (`src/adapters/ssr.py`)

For brokers with server-side rendered HTML.

Pattern:

def fetch_vdaal() -> list[RawListing]:
    soup = fetch_soup("https://vdaalmakelaardij.nl/aanbod")
    listings = []
    
    for card in soup.select(".property-card"):
        try:
            url = card.select_one("a[href]")["href"]
            if not url.startswith("http"):
                url = VDAAL_BASE + url
            
            adres = _text(card, ".address-selector")
            postcode = _extract_postcode(adres)
            prijs = parse_prijs(_text(card, ".price"))
            
            listings.append(RawListing(
                url=url,
                source_makelaar="vdaal",
                adres=adres,
                postcode=postcode,
                stad=_infer_stad(postcode),
                prijs=prijs,
                # ... etc
            ))
        except Exception as e:
            log.warning("Parse error: %s", e)
    
    log.info("vdaal: %d listings", len(listings))
    return listings

Helpers available:

fetch_soup(url, *, params=None) — GET with BeautifulSoup, Retry-After handling
parse_prijs(text) — Extract price from strings like "€ 325.000 k.k." → 325000
parse_m2(text) — Extract area from "87 m²" → 87
_text(soup, selector) — Get inner text from element
_src(soup, selector) — Get src or data-src attribute
_extract_postcode(text) — Regex postcode from any text
_infer_stad(postcode) — Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam

Registration

Both api.py and ssr.py have a SCRAPERS dict at the bottom:

# api.py
SCRAPERS = {
    'bjornd': fetch_bjornd,
    'your_broker': fetch_your_broker,  # ← Add here
}

# ssr.py
SCRAPERS = {
    'bjornd_demo': fetch_bjornd_demo,
    'your_broker': fetch_your_broker,  # ← Add here
}

The src/adapters/__init__.py merges both dicts, so the runner picks up all registered adapters automatically.

Testing Workflow

1. Understand the Website

The human will help you:

Identify the broker's API endpoint (or the HTML structure)
Check for a robots.txt or rate limit headers
Write exploratory curl requests (for APIs) or BeautifulSoup inspections

2. Develop & Test Locally

Add your scraper function to the appropriate file (api.py or ssr.py)
Register it in the SCRAPERS dict
The human updates tests/test_adapters.py to point to your adapter:
```
ADAPTER = SCRAPERS['your_broker_name']
```
Run the test:
```
cd tests && python test_adapters.py
```
The test prints listings in a simple format so you can validate output

3. Merge Code

Once validated, the human will copy your inline code snippets into the main codebase. You produce easily pasteable functions, not entire files.

Config & Constants

Location: src/config.py

Key values you may reference:

MAX_PRICE = 300_000 — Price filter (your scraper can skip listings above this)
USER_AGENT = "Huizenbot/1.0 (+mark@kalsbeek.dev) persoonlijk gebruik" — Used in all HTTP headers
MARK_WERK_POSTCODE, MICHELLE_WERK_POSTCODE — Work postcodes for travel time calculation

Secrets (API keys, webhook URLs) are environment variables, not in config.

CMS Detection Tool

Before investigating a broker's HTML manually, prod the human in the loop to run autoscraper.py from the project root:

python autoscraper.py listings <listings-url>
python autoscraper.py details <detail-page-url>

If the broker uses a known CMS, the tool prints the exact code to add — no further investigation needed. Currently detected CMSes:

Realworks → prints a ready-to-paste fetch_realworks(...) one-liner for ssr.py

If the CMS is unknown, the tool prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.

Important Notes

Don't treat detail pages as optional, we always want all the info!

Status Mapping

Brokers use different status strings. Always map to one of:

"beschikbaar" — Available for sale
"onder_bod" — Under offer
"verkocht" — Sold

Example from api.py:

_STATUS_MAP = {
    "available": "beschikbaar",
    "under_bid": "onder_bod",
    "sold": "verkocht",
}
status = _STATUS_MAP.get(item.get("status"), "beschikbaar")

Postcode Extraction

Always aim for the Dutch postcode format (4 digits + 2 letters, e.g., "2611CA"). The travel time calculation depends on it. If a broker only provides the address string, use _extract_postcode(address).

Price Handling

Prices are integers (euros), never floats. Use parse_prijs() for HTML.

Image URLs

Store the hero/main image URL in hero_image_url. This appears in Home Assistant notifications.

Extra Data

If a broker provides extra fields that don't fit the schema (e.g., balcony, garden, orientation), store them in the extra dict:

listings.append(RawListing(
    url=...,
    ...
    extra={
        "balcony": item.get("has_balcony"),
        "garden": item.get("has_garden"),
        "custom_field": item.get("something_else"),
    }
))

The database stores this as JSON in the extra column.

Error Handling

Wrap individual listing parsing in try/except to continue on one bad listing
Log parse warnings, not errors (brokers' HTML changes)
Let HTTP errors bubble up (the runner catches them at the adapter level)

Rate Limiting & Ethics

Both fetch_json() and fetch_soup() handle 429 Retry-After automatically
Nominatim (geocoding) has a 1 req/s limiter built into huizenbot.py
Never spawn parallel requests without the human's approval
Always use the USER_AGENT header (includes contact info for respectful scraping)
Don't keep curling the same endpoint, pipe it to a .dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.

Example: Adding "Van Daal" (API-based)

Scenario

The human finds that Van Daal (vandaalmakelaardij.nl) has a JSON API at:

https://api.vandaal.nl/listings?city=delft&status=available

Your Code (add to api.py)

# Van Daal
# --------
_VANDAAL_BASE = "https://www.vandaalmakelaardij.nl"
_VANDAAL_API = "https://api.vandaal.nl/listings"

_VANDAAL_STATUS_MAP = {
    "available": "beschikbaar",
    "under_offer": "onder_bod",
    "sold": "verkocht",
}

def fetch_vandaal() -> list[RawListing]:
    listings = []
    for city in ["delft", "schiedam"]:
        data = fetch_json(
            _VANDAAL_API,
            params={"city": city, "status": "available"}
        )
        
        for item in data.get("listings", []):
            if item.get("price", 0) > config.MAX_PRICE:
                continue
            
            listings.append(RawListing(
                url=item["url"],
                source_makelaar="vandaal",
                adres=item.get("address"),
                postcode=item.get("postcode"),
                stad=item.get("city"),
                prijs=item.get("price"),
                woningtype=item.get("type"),
                woonoppervlak=item.get("living_area"),
                slaapkamers=item.get("bedrooms"),
                hero_image_url=item.get("image_url"),
            ))
    
    log.info("vandaal: %d listings", len(listings))
    return listings

Register in SCRAPERS (in api.py)

SCRAPERS = {
    'bjornd': fetch_bjornd,
    'vandaal': fetch_vandaal,  # ← Add this
}

Test

Human updates test_adapters.py:

ADAPTER = SCRAPERS['vandaal']

Then runs:

cd tests && python test_adapters.py

If all looks good, the human copies the fetch_vandaal() function into the real api.py and adds it to SCRAPERS.

Summary

You receive an adapter request + investigation results (API endpoint or HTML structure)
You write a clean, self-contained scraper function that returns list[RawListing]
You register it in the appropriate SCRAPERS dict
The human tests it with test_adapters.py and validates output
The human merges your code into the production files

Keep code simple, use the provided helpers, populate RawListing fields as best you can, and always set source_makelaar and url correctly.

12 KiB Raw Blame History Unescape Escape