362 lines
12 KiB
Markdown
362 lines
12 KiB
Markdown
# Huizenbot — Agent Context for Adding Routes
|
||
|
||
## Project Overview
|
||
|
||
**Huizenbot** is a periodic scraper of real estate broker websites in Delft and Schiedam (Netherlands). It:
|
||
- Fetches property listings from broker websites
|
||
- Saves new ones to SQLite with `RawListing` schema
|
||
- Calculates travel times (bike + public transit) to two work locations
|
||
- Sends push notifications via Home Assistant webhook (with email fallback)
|
||
|
||
**Your role:** You will add new broker routes (scrapers) to the `adapters/` directory. A human will:
|
||
1. Select a broker from the list
|
||
2. Help you investigate the broker's website
|
||
3. For API-based brokers: develop curl requests to test
|
||
4. For HTML scrapers: develop parsing logic using BeautifulSoup
|
||
5. Run `tests/test_adapters.py` to validate
|
||
6. Merge your code snippets into the codebase
|
||
|
||
---
|
||
|
||
## Key Schema: RawListing
|
||
|
||
**Location:** `src/huizenbot.py` (lines 29–52)
|
||
|
||
This is the data model you must populate. All fields except `url` are optional:
|
||
|
||
```python
|
||
@dataclass
|
||
class RawListing:
|
||
url: str # REQUIRED — the listing URL
|
||
|
||
source_makelaar: str = "" # Name of the broker (e.g., "bjornd", "vdaal")
|
||
datum_aanmelding: str | None = None # ISO 8601 date if available
|
||
status: str = "beschikbaar" # enum: beschikbaar | onder_bod | verkocht
|
||
|
||
# Location
|
||
adres: str | None = None # Street address (e.g., "Binnenwatersloot 3")
|
||
postcode: str | None = None # Dutch postcode (e.g., "2611CA")
|
||
stad: str | None = None # City (e.g., "Delft")
|
||
|
||
# Property details
|
||
prijs: int | None = None # Price in euros (integer, no float)
|
||
woningtype: str | None = None # Type (e.g., "appartement", "tussenwoning")
|
||
woonoppervlak: int | None = None # Living space in m²
|
||
perceeloppervlak: int | None = None # Plot size in m² (NULL for apartments)
|
||
kamers: int | None = None # Number of rooms
|
||
slaapkamers: int | None = None # Number of bedrooms
|
||
bouwjaar: int | None = None # Build year
|
||
energielabel: str | None = None # Energy label (e.g., "A", "B")
|
||
|
||
# Media
|
||
hero_image_url: str | None = None # Main photo URL
|
||
|
||
# Extra data (broker-specific fields)
|
||
extra: dict[str, Any] = field(default_factory=dict) # Arbitrary JSON data
|
||
```
|
||
|
||
**DB Upsert:** The listing is inserted on first run (with `id = sha256(url)`) and updated only on `last_seen` / `status` on subsequent runs. Travel times are calculated only on first insert.
|
||
|
||
---
|
||
|
||
## Adapter Structure
|
||
|
||
Adapters live in `src/adapters/` and are organized by type:
|
||
|
||
### Two Adapter Types
|
||
|
||
#### 1. **API-based** (`src/adapters/api.py`)
|
||
For brokers with REST/JSON endpoints.
|
||
|
||
**Pattern:**
|
||
```python
|
||
def fetch_bjornd() -> list[RawListing]:
|
||
data = fetch_json("https://...", params={...}, headers={...})
|
||
listings = []
|
||
for item in data:
|
||
# Filter / validate
|
||
if item.get("status") in _SKIP:
|
||
continue
|
||
if item.get("price") > config.MAX_PRICE:
|
||
continue
|
||
|
||
listings.append(RawListing(
|
||
url=item["url"],
|
||
source_makelaar="bjornd",
|
||
adres=item.get("address"),
|
||
postcode=item.get("zipcode"),
|
||
# ... etc
|
||
))
|
||
|
||
log.info("bjornd: %d listings", len(listings))
|
||
return listings
|
||
```
|
||
|
||
**Helpers available:**
|
||
- `fetch_json(url, *, params=None, headers=None)` — GET with User-Agent, timeout, Retry-After handling
|
||
- Built-in logging via `log = logging.getLogger("huizenbot.api")`
|
||
|
||
#### 2. **SSR/HTML-based** (`src/adapters/ssr.py`)
|
||
For brokers with server-side rendered HTML.
|
||
|
||
**Pattern:**
|
||
```python
|
||
def fetch_vdaal() -> list[RawListing]:
|
||
soup = fetch_soup("https://vdaalmakelaardij.nl/aanbod")
|
||
listings = []
|
||
|
||
for card in soup.select(".property-card"):
|
||
try:
|
||
url = card.select_one("a[href]")["href"]
|
||
if not url.startswith("http"):
|
||
url = VDAAL_BASE + url
|
||
|
||
adres = _text(card, ".address-selector")
|
||
postcode = _extract_postcode(adres)
|
||
prijs = parse_prijs(_text(card, ".price"))
|
||
|
||
listings.append(RawListing(
|
||
url=url,
|
||
source_makelaar="vdaal",
|
||
adres=adres,
|
||
postcode=postcode,
|
||
stad=_infer_stad(postcode),
|
||
prijs=prijs,
|
||
# ... etc
|
||
))
|
||
except Exception as e:
|
||
log.warning("Parse error: %s", e)
|
||
|
||
log.info("vdaal: %d listings", len(listings))
|
||
return listings
|
||
```
|
||
|
||
**Helpers available:**
|
||
- `fetch_soup(url, *, params=None)` — GET with BeautifulSoup, Retry-After handling
|
||
- `parse_prijs(text)` — Extract price from strings like "€ 325.000 k.k." → 325000
|
||
- `parse_m2(text)` — Extract area from "87 m²" → 87
|
||
- `_text(soup, selector)` — Get inner text from element
|
||
- `_src(soup, selector)` — Get src or data-src attribute
|
||
- `_extract_postcode(text)` — Regex postcode from any text
|
||
- `_infer_stad(postcode)` — Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam
|
||
|
||
---
|
||
|
||
## Registration
|
||
|
||
Both `api.py` and `ssr.py` have a `SCRAPERS` dict at the bottom:
|
||
|
||
```python
|
||
# api.py
|
||
SCRAPERS = {
|
||
'bjornd': fetch_bjornd,
|
||
'your_broker': fetch_your_broker, # ← Add here
|
||
}
|
||
|
||
# ssr.py
|
||
SCRAPERS = {
|
||
'bjornd_demo': fetch_bjornd_demo,
|
||
'your_broker': fetch_your_broker, # ← Add here
|
||
}
|
||
```
|
||
|
||
The `src/adapters/__init__.py` merges both dicts, so the runner picks up all registered adapters automatically.
|
||
|
||
---
|
||
|
||
## Testing Workflow
|
||
|
||
### 1. Understand the Website
|
||
The human will help you:
|
||
- Identify the broker's API endpoint (or the HTML structure)
|
||
- Check for a `robots.txt` or rate limit headers
|
||
- Write exploratory curl requests (for APIs) or BeautifulSoup inspections
|
||
|
||
### 2. Develop & Test Locally
|
||
- Add your scraper function to the appropriate file (`api.py` or `ssr.py`)
|
||
- Register it in the `SCRAPERS` dict
|
||
- The human updates `tests/test_adapters.py` to point to your adapter:
|
||
```python
|
||
ADAPTER = SCRAPERS['your_broker_name']
|
||
```
|
||
- Run the test:
|
||
```bash
|
||
cd tests && python test_adapters.py
|
||
```
|
||
- The test prints listings in a simple format so you can validate output
|
||
|
||
### 3. Merge Code
|
||
Once validated, the human will **copy your inline code snippets** into the main codebase. You produce **easily pasteable functions**, not entire files.
|
||
|
||
---
|
||
|
||
## Config & Constants
|
||
|
||
**Location:** `src/config.py`
|
||
|
||
Key values you may reference:
|
||
- `MAX_PRICE = 300_000` — Price filter (your scraper can skip listings above this)
|
||
- `USER_AGENT = "Huizenbot/1.0 (+mark@kalsbeek.dev) persoonlijk gebruik"` — Used in all HTTP headers
|
||
- `MARK_WERK_POSTCODE`, `MICHELLE_WERK_POSTCODE` — Work postcodes for travel time calculation
|
||
|
||
Secrets (API keys, webhook URLs) are **environment variables**, not in config.
|
||
|
||
---
|
||
|
||
## CMS Detection Tool
|
||
|
||
Before investigating a broker's HTML manually, prod the human in the loop to run `autoscraper.py` from the project root:
|
||
```bash
|
||
python autoscraper.py listings <listings-url>
|
||
python autoscraper.py details <detail-page-url>
|
||
```
|
||
|
||
If the broker uses a known CMS, the tool prints the exact code to add — no further investigation needed. Currently detected CMSes:
|
||
|
||
- **Realworks** → prints a ready-to-paste `fetch_realworks(...)` one-liner for `ssr.py`
|
||
|
||
If the CMS is unknown, the tool prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
|
||
|
||
## Important Notes
|
||
|
||
Don't treat detail pages as optional, we always want all the info!
|
||
|
||
### Status Mapping
|
||
Brokers use different status strings. Always map to one of:
|
||
- `"beschikbaar"` — Available for sale
|
||
- `"onder_bod"` — Under offer
|
||
- `"verkocht"` — Sold
|
||
|
||
Example from api.py:
|
||
```python
|
||
_STATUS_MAP = {
|
||
"available": "beschikbaar",
|
||
"under_bid": "onder_bod",
|
||
"sold": "verkocht",
|
||
}
|
||
status = _STATUS_MAP.get(item.get("status"), "beschikbaar")
|
||
```
|
||
|
||
### Postcode Extraction
|
||
Always aim for the **Dutch postcode format** (4 digits + 2 letters, e.g., `"2611CA"`). The travel time calculation depends on it. If a broker only provides the address string, use `_extract_postcode(address)`.
|
||
|
||
### Price Handling
|
||
Prices are **integers** (euros), never floats. Use `parse_prijs()` for HTML.
|
||
|
||
### Image URLs
|
||
Store the hero/main image URL in `hero_image_url`. This appears in Home Assistant notifications.
|
||
|
||
### Extra Data
|
||
If a broker provides extra fields that don't fit the schema (e.g., balcony, garden, orientation), store them in the `extra` dict:
|
||
```python
|
||
listings.append(RawListing(
|
||
url=...,
|
||
...
|
||
extra={
|
||
"balcony": item.get("has_balcony"),
|
||
"garden": item.get("has_garden"),
|
||
"custom_field": item.get("something_else"),
|
||
}
|
||
))
|
||
```
|
||
|
||
The database stores this as JSON in the `extra` column.
|
||
|
||
### Error Handling
|
||
- Wrap individual listing parsing in try/except to continue on one bad listing
|
||
- Log parse warnings, not errors (brokers' HTML changes)
|
||
- Let HTTP errors bubble up (the runner catches them at the adapter level)
|
||
|
||
### Rate Limiting & Ethics
|
||
- Both `fetch_json()` and `fetch_soup()` handle 429 Retry-After automatically
|
||
- Nominatim (geocoding) has a 1 req/s limiter built into `huizenbot.py`
|
||
- Never spawn parallel requests without the human's approval
|
||
- Always use the `USER_AGENT` header (includes contact info for respectful scraping)
|
||
- Don't keep curling the same endpoint, pipe it to a <name makelaar>.dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
|
||
|
||
---
|
||
|
||
## Example: Adding "Van Daal" (API-based)
|
||
|
||
### Scenario
|
||
The human finds that Van Daal (vandaalmakelaardij.nl) has a JSON API at:
|
||
```
|
||
https://api.vandaal.nl/listings?city=delft&status=available
|
||
```
|
||
|
||
### Your Code (add to api.py)
|
||
|
||
```python
|
||
# Van Daal
|
||
# --------
|
||
_VANDAAL_BASE = "https://www.vandaalmakelaardij.nl"
|
||
_VANDAAL_API = "https://api.vandaal.nl/listings"
|
||
|
||
_VANDAAL_STATUS_MAP = {
|
||
"available": "beschikbaar",
|
||
"under_offer": "onder_bod",
|
||
"sold": "verkocht",
|
||
}
|
||
|
||
def fetch_vandaal() -> list[RawListing]:
|
||
listings = []
|
||
for city in ["delft", "schiedam"]:
|
||
data = fetch_json(
|
||
_VANDAAL_API,
|
||
params={"city": city, "status": "available"}
|
||
)
|
||
|
||
for item in data.get("listings", []):
|
||
if item.get("price", 0) > config.MAX_PRICE:
|
||
continue
|
||
|
||
listings.append(RawListing(
|
||
url=item["url"],
|
||
source_makelaar="vandaal",
|
||
adres=item.get("address"),
|
||
postcode=item.get("postcode"),
|
||
stad=item.get("city"),
|
||
prijs=item.get("price"),
|
||
woningtype=item.get("type"),
|
||
woonoppervlak=item.get("living_area"),
|
||
slaapkamers=item.get("bedrooms"),
|
||
hero_image_url=item.get("image_url"),
|
||
))
|
||
|
||
log.info("vandaal: %d listings", len(listings))
|
||
return listings
|
||
```
|
||
|
||
### Register in SCRAPERS (in api.py)
|
||
```python
|
||
SCRAPERS = {
|
||
'bjornd': fetch_bjornd,
|
||
'vandaal': fetch_vandaal, # ← Add this
|
||
}
|
||
```
|
||
|
||
### Test
|
||
Human updates `test_adapters.py`:
|
||
```python
|
||
ADAPTER = SCRAPERS['vandaal']
|
||
```
|
||
|
||
Then runs:
|
||
```bash
|
||
cd tests && python test_adapters.py
|
||
```
|
||
|
||
If all looks good, the human copies the `fetch_vandaal()` function into the real `api.py` and adds it to `SCRAPERS`.
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
1. **You receive** an adapter request + investigation results (API endpoint or HTML structure)
|
||
2. **You write** a clean, self-contained scraper function that returns `list[RawListing]`
|
||
3. **You register** it in the appropriate `SCRAPERS` dict
|
||
4. **The human tests** it with `test_adapters.py` and validates output
|
||
5. **The human merges** your code into the production files
|
||
|
||
Keep code simple, use the provided helpers, populate `RawListing` fields as best you can, and always set `source_makelaar` and `url` correctly.
|