Files
huizenbot/add_scraper_context.md

391 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Huizenbot — Agent Context for Adding Routes
## Project Overview
**Huizenbot** is a periodic scraper of real estate broker websites in Delft and Schiedam (Netherlands). It:
- Fetches property listings from broker websites
- Saves new ones to SQLite with `RawListing` schema
- Calculates travel times (bike + public transit) to two work locations
- Sends push notifications via Home Assistant webhook (with email fallback)
**Your role:** You will add new broker routes (scrapers) to the `adapters/` directory. A human will:
1. Select a broker from the list
2. Help you investigate the broker's website
3. For API-based brokers: develop curl requests to test
4. For HTML scrapers: develop parsing logic using BeautifulSoup
5. Run `tests/test_adapters.py` to validate
6. Merge your code snippets into the codebase
---
## Key Schema: RawListing
**Location:** `src/huizenbot.py` (lines 2952)
This is the data model you must populate. All fields except `url` are optional:
```python
@dataclass
class RawListing:
url: str # REQUIRED — the listing URL
source_makelaar: str = "" # Name of the broker (e.g., "bjornd", "vdaal")
datum_aanmelding: str | None = None # ISO 8601 date if available
status: str = "beschikbaar" # enum: beschikbaar | onder_bod | verkocht
# Location
adres: str | None = None # Street address (e.g., "Binnenwatersloot 3")
postcode: str | None = None # Dutch postcode (e.g., "2611CA")
stad: str | None = None # City (e.g., "Delft")
# Property details
prijs: int | None = None # Price in euros (integer, no float)
woningtype: str | None = None # Type (e.g., "appartement", "tussenwoning")
woonoppervlak: int | None = None # Living space in m²
perceeloppervlak: int | None = None # Plot size in m² (NULL for apartments)
kamers: int | None = None # Number of rooms
slaapkamers: int | None = None # Number of bedrooms
bouwjaar: int | None = None # Build year
energielabel: str | None = None # Energy label (e.g., "A", "B")
# Media
hero_image_url: str | None = None # Main photo URL
# Extra data (broker-specific fields)
extra: dict[str, Any] = field(default_factory=dict) # Arbitrary JSON data
```
**DB Upsert:** The listing is inserted on first run (with `id = sha256(url)`) and updated only on `last_seen` / `status` on subsequent runs. Travel times are calculated only on first insert.
---
## Adapter Structure
Adapters live in `src/adapters/` and are organized by type:
### Two Adapter Types
#### 1. **API-based** (`src/adapters/api.py`)
For brokers with REST/JSON endpoints.
**Pattern:**
```python
def fetch_bjornd() -> list[RawListing]:
data = fetch_json("https://...", params={...}, headers={...})
listings = []
for item in data:
# Filter / validate
if item.get("status") in _SKIP:
continue
if item.get("price") > config.MAX_PRICE:
continue
listings.append(RawListing(
url=item["url"],
source_makelaar="bjornd",
adres=item.get("address"),
postcode=item.get("zipcode"),
# ... etc
))
log.info("bjornd: %d listings", len(listings))
return listings
```
**Helpers available:**
- `fetch_json(url, *, params=None, headers=None)` — GET with User-Agent, timeout, Retry-After handling
- Built-in logging via `log = logging.getLogger("huizenbot.api")`
#### 2. **SSR/HTML-based** (`src/adapters/ssr.py`)
For brokers with server-side rendered HTML.
**Pattern:**
```python
def fetch_vdaal() -> list[RawListing]:
soup = fetch_soup("https://vdaalmakelaardij.nl/aanbod")
listings = []
for card in soup.select(".property-card"):
try:
url = card.select_one("a[href]")["href"]
if not url.startswith("http"):
url = VDAAL_BASE + url
adres = _text(card, ".address-selector")
postcode = _extract_postcode(adres)
prijs = parse_prijs(_text(card, ".price"))
listings.append(RawListing(
url=url,
source_makelaar="vdaal",
adres=adres,
postcode=postcode,
stad=_infer_stad(postcode),
prijs=prijs,
# ... etc
))
except Exception as e:
log.warning("Parse error: %s", e)
log.info("vdaal: %d listings", len(listings))
return listings
```
**Helpers available:**
- `fetch_soup(url, *, params=None)` — GET with BeautifulSoup, Retry-After handling
- `parse_prijs(text)` — Extract price from strings like "€ 325.000 k.k." → 325000
- `parse_m2(text)` — Extract area from "87 m²" → 87
- `_text(soup, selector)` — Get inner text from element
- `_src(soup, selector)` — Get src or data-src attribute
- `_extract_postcode(text)` — Regex postcode from any text
- `_infer_stad(postcode)` — Simple lookup: 26002629 → Delft, 31003135 → Schiedam (Den Haag not in this helper; use the city value from the broker directly)
---
## Registration
Both `api.py` and `ssr.py` have a `SCRAPERS` dict at the bottom:
```python
# api.py
SCRAPERS = {
'bjornd': fetch_bjornd,
'your_broker': fetch_your_broker, # ← Add here
}
# ssr.py
SCRAPERS = {
'bjornd_demo': fetch_bjornd_demo,
'your_broker': fetch_your_broker, # ← Add here
}
```
The `src/adapters/__init__.py` merges both dicts, so the runner picks up all registered adapters automatically.
---
## Testing Workflow
### 1. Understand the Website
The human will help you:
- Identify the broker's API endpoint (or the HTML structure)
- Check for a `robots.txt` or rate limit headers
- Write exploratory curl requests (for APIs) or BeautifulSoup inspections
### 2. Develop & Test Locally
- Add your scraper function to the appropriate file (`api.py` or `ssr.py`)
- Register it in the `SCRAPERS` dict
- The human updates `tests/test_adapters.py` to point to your adapter:
```python
ADAPTER = SCRAPERS['your_broker_name']
```
- Run the test:
```bash
cd tests && python test_adapters.py
```
- The test prints listings in a simple format so you can validate output
### 3. Merge Code
Once validated, the human will **copy your inline code snippets** into the main codebase. You produce **easily pasteable functions**, not entire files.
---
## Config & Constants
**Location:** `src/config.py`
Key values you may reference:
- `MAX_PRICE = 300_000` — Price filter (your scraper can skip listings above this)
- `USER_AGENT = "Huizenbot/1.0 (+mark@kalsbeek.dev) persoonlijk gebruik"` — Used in all HTTP headers
- `MARK_WERK_POSTCODE`, `MICHELLE_WERK_POSTCODE` — Work postcodes for travel time calculation
Secrets (API keys, webhook URLs) are **environment variables**, not in config.
---
## Platform / CMS Quick Identification
Before investigating a broker's HTML manually, check for known platforms in this order:
### 1. OG Online / realtime-listings (API — fastest)
Check if `https://<base>/nl/realtime-listings/consumer` returns JSON (with header `X-Requested-With: XMLHttpRequest`). If yes, this is a 10-line addition to `api.py`. Known brokers: bjornd, moerman, vandaal, elzenaar, doen.
Fields: `isSales`, `statusOrig`, `salesPrice`, `address`, `zipcode`, `city`, `rooms`, `bedrooms`, `livingSurface`, `plotSurface`, `dateOfConstruction`, `energyLabel`, `type`, `photo`, `url`.
Add a `_CITIES` set to filter by city if the broker covers a wide area. Skip statuses `"rented"` and `"rented_ur"`.
### 2. Realworks CMS (SSR — one liner)
Run `autoscraper.py` or check HTML for `li.aanbodEntry`. If detected:
```python
def fetch_mybroker() -> list[RawListing]:
return fetch_realworks("https://www.mybroker.nl", "mybroker")
```
### 3. SURE WordPress Plugin (SSR — ~50 lines)
Check HTML for `sure-` CSS classes or `?sure_koop_huur=koop` filter. Two card variants:
- `a.card-house` (single dash) — e.g. Olsthoorn
- `a.card--house` (double dash) — e.g. Borgdorff
Both use `?sure_koop_huur=koop` to filter buy listings and `/page/{N}/` pagination. Detail page always has `#kenmerken li span span` pairs with labels like `status`, `soort woonhuis`/`soort woning`/`soort bouw`, `bouwjaar`, `gebruiksoppervlakte wonen`, `perceeloppervlakte`, `aantal slaapkamers`, `energielabel`. Postcode is often **not** available on the detail page.
Terminate pagination when `len(cards) < expected_per_page` (typically 15 for SURE).
### 4. Unknown CMS
Run the autoscraper tool:
```bash
python autoscraper.py listings <listings-url>
python autoscraper.py details <detail-page-url>
```
It prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
## Important Notes
Don't treat detail pages as optional, we always want all the info!
### Status Mapping
Brokers use different status strings. Always map to one of:
- `"beschikbaar"` — Available for sale
- `"onder_bod"` — Under offer
- `"verkocht"` — Sold
Example from api.py:
```python
_STATUS_MAP = {
"available": "beschikbaar",
"under_bid": "onder_bod",
"sold": "verkocht",
}
status = _STATUS_MAP.get(item.get("status"), "beschikbaar")
```
### Postcode Extraction
Always aim for the **Dutch postcode format** (4 digits + 2 letters, e.g., `"2611CA"`). The travel time calculation depends on it. If a broker only provides the address string, use `_extract_postcode(address)`.
If a postcode field contains extra text (e.g., `"2522GW Den Haag"`), extract cleanly with:
```python
m = re.search(r"\d{4}\s*[A-Z]{2}", raw.upper())
postcode = m.group(0).replace(" ", "") if m else None
```
Never just `.replace(" ", "")` — that produces garbage like `"2522GWDenHaag"`.
### Price Handling
Prices are **integers** (euros), never floats. Use `parse_prijs()` for HTML.
### Image URLs
Store the hero/main image URL in `hero_image_url`. This appears in Home Assistant notifications.
### Extra Data
If a broker provides extra fields that don't fit the schema (e.g., balcony, garden, orientation), store them in the `extra` dict:
```python
listings.append(RawListing(
url=...,
...
extra={
"balcony": item.get("has_balcony"),
"garden": item.get("has_garden"),
"custom_field": item.get("something_else"),
}
))
```
The database stores this as JSON in the `extra` column.
### Error Handling
- Wrap individual listing parsing in try/except to continue on one bad listing
- Log parse warnings, not errors (brokers' HTML changes)
- Let HTTP errors bubble up (the runner catches them at the adapter level)
### Rate Limiting & Ethics
- Both `fetch_json()` and `fetch_soup()` handle 429 Retry-After automatically
- Nominatim (geocoding) has a 1 req/s limiter built into `huizenbot.py`
- Never spawn parallel requests without the human's approval
- Always use the `USER_AGENT` header (includes contact info for respectful scraping)
- Don't keep curling the same endpoint, pipe it to a <name makelaar>.dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
- Don't over-investigate pagination — confirm card count on page 1, assume it's consistent across pages, move on. Never fetch multiple pages just to verify the per-page count.
---
## Example: Adding "Van Daal" (API-based)
### Scenario
The human finds that Van Daal (vandaalmakelaardij.nl) has a JSON API at:
```
https://api.vandaal.nl/listings?city=delft&status=available
```
### Your Code (add to api.py)
```python
# Van Daal
# --------
_VANDAAL_BASE = "https://www.vandaalmakelaardij.nl"
_VANDAAL_API = "https://api.vandaal.nl/listings"
_VANDAAL_STATUS_MAP = {
"available": "beschikbaar",
"under_offer": "onder_bod",
"sold": "verkocht",
}
def fetch_vandaal() -> list[RawListing]:
listings = []
for city in ["delft", "schiedam"]:
data = fetch_json(
_VANDAAL_API,
params={"city": city, "status": "available"}
)
for item in data.get("listings", []):
if item.get("price", 0) > config.MAX_PRICE:
continue
listings.append(RawListing(
url=item["url"],
source_makelaar="vandaal",
adres=item.get("address"),
postcode=item.get("postcode"),
stad=item.get("city"),
prijs=item.get("price"),
woningtype=item.get("type"),
woonoppervlak=item.get("living_area"),
slaapkamers=item.get("bedrooms"),
hero_image_url=item.get("image_url"),
))
log.info("vandaal: %d listings", len(listings))
return listings
```
### Register in SCRAPERS (in api.py)
```python
SCRAPERS = {
'bjornd': fetch_bjornd,
'vandaal': fetch_vandaal, # ← Add this
}
```
### Test
Human updates `test_adapters.py`:
```python
ADAPTER = SCRAPERS['vandaal']
```
Then runs:
```bash
cd tests && python test_adapters.py
```
If all looks good, the human copies the `fetch_vandaal()` function into the real `api.py` and adds it to `SCRAPERS`.
---
## Summary
1. **You receive** an adapter request + investigation results (API endpoint or HTML structure)
2. **You write** a clean, self-contained scraper function that returns `list[RawListing]`
3. **You register** it in the appropriate `SCRAPERS` dict
4. **The human tests** it with `test_adapters.py` and validates output
5. **The human merges** your code into the production files
Keep code simple, use the provided helpers, populate `RawListing` fields as best you can, and always set `source_makelaar` and `url` correctly.