refactor: split ssr.py into package, enrich OG Online detail pages, fix travel upsert

- Split src/adapters/ssr.py (2160 LOC) into ssr/ package grouped by CMS:
  realworks.py, sure.py, schiedam.py, denhaag.py, overige.py
- Add _og_detail() to api.py; all OG Online scrapers now fall back to
  detail page fetch when energielabel/bouwjaar are missing from the API
- Fix run() to recalculate travel times for existing listings where
  fiets_mark IS NULL; upsert() now writes travel cols on existing rows too
- Update tests/cache.py to patch fetch_soup in every ssr submodule
- Update docs to reflect new package structure and mark API enrichment TODO done

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-11 23:39:35 +02:00
parent 1011d9cf87
commit f74e9bcfb0
14 changed files with 2478 additions and 2199 deletions

View File

@@ -96,8 +96,13 @@ def fetch_bjornd() -> list[RawListing]:
- `fetch_json(url, *, params=None, headers=None)` — GET with User-Agent, timeout, Retry-After handling
- Built-in logging via `log = logging.getLogger("huizenbot.api")`
#### 2. **SSR/HTML-based** (`src/adapters/ssr.py`)
For brokers with server-side rendered HTML.
#### 2. **SSR/HTML-based** (`src/adapters/ssr/` package)
For brokers with server-side rendered HTML. The package is split by CMS platform:
- `realworks.py` — Realworks CMS (li/div.aanbodEntry cards + span.kenmerk detail)
- `sure.py` — SURE WordPress plugin (/wonen?sure_koop_huur=koop + #kenmerken detail)
- `schiedam.py` — Custom Schiedam scrapers (diverse platforms)
- `denhaag.py` — Den Haag scrapers (diverse platforms)
- `overige.py` — Other / multi-city scrapers (OG Online WP, Elementor)
**Pattern:**
```python
@@ -144,18 +149,22 @@ def fetch_vdaal() -> list[RawListing]:
## Registration
Both `api.py` and `ssr.py` have a `SCRAPERS` dict at the bottom:
**API scrapers** (`src/adapters/api.py`): Add your function and register in the `SCRAPERS` dict at the bottom of the file.
**SSR scrapers**: Add your function to the appropriate submodule (`realworks.py`, `sure.py`, `schiedam.py`, `denhaag.py`, or `overige.py`), then import it in `src/adapters/ssr/__init__.py` and add it to the `SCRAPERS` dict there.
```python
# api.py
# api.py — SCRAPERS dict
SCRAPERS = {
'bjornd': fetch_bjornd,
'your_broker': fetch_your_broker, # ← Add here
}
# ssr.py
# ssr/__init__.py — import + register
from .realworks import fetch_your_broker # ← import from the right submodule
SCRAPERS = {
'bjornd_demo': fetch_bjornd_demo,
...
'your_broker': fetch_your_broker, # ← Add here
}
```
@@ -173,7 +182,7 @@ The human will help you:
- Write exploratory curl requests (for APIs) or BeautifulSoup inspections
### 2. Develop & Test Locally
- Add your scraper function to the appropriate file (`api.py` or `ssr.py`)
- Add your scraper function to the appropriate file (`api.py` or the right `ssr/` submodule)
- Register it in the `SCRAPERS` dict
- The human updates `tests/test_adapters.py` to point to your adapter:
```python
@@ -208,6 +217,8 @@ Secrets (API keys, webhook URLs) are **environment variables**, not in config.
Before investigating a broker's HTML manually, check for known platforms in this order:
### 1. OG Online / realtime-listings (API — fastest)
**File:** `src/adapters/api.py`
Check if `https://<base>/nl/realtime-listings/consumer` returns JSON (with header `X-Requested-With: XMLHttpRequest`). If yes, this is a 10-line addition to `api.py`. Known brokers: bjornd, moerman, vandaal, elzenaar, doen.
Fields: `isSales`, `statusOrig`, `salesPrice`, `address`, `zipcode`, `city`, `rooms`, `bedrooms`, `livingSurface`, `plotSurface`, `dateOfConstruction`, `energyLabel`, `type`, `photo`, `url`.
@@ -215,6 +226,8 @@ Fields: `isSales`, `statusOrig`, `salesPrice`, `address`, `zipcode`, `city`, `ro
Add a `_CITIES` set to filter by city if the broker covers a wide area. Skip statuses `"rented"` and `"rented_ur"`.
### 2. Realworks CMS (SSR — one liner)
**File:** `src/adapters/ssr/realworks.py`
Run `autoscraper.py` or check HTML for `li.aanbodEntry`. If detected:
```python
def fetch_mybroker() -> list[RawListing]:
@@ -222,6 +235,8 @@ def fetch_mybroker() -> list[RawListing]:
```
### 3. SURE WordPress Plugin (SSR — ~50 lines)
**File:** `src/adapters/ssr/sure.py`
Check HTML for `sure-` CSS classes or `?sure_koop_huur=koop` filter. Two card variants:
- `a.card-house` (single dash) — e.g. Olsthoorn
- `a.card--house` (double dash) — e.g. Borgdorff
@@ -231,6 +246,8 @@ Both use `?sure_koop_huur=koop` to filter buy listings and `/page/{N}/` paginati
Terminate pagination when `len(cards) < expected_per_page` (typically 15 for SURE).
### 4. Unknown CMS
**File:** `src/adapters/ssr/schiedam.py`, `denhaag.py`, or `overige.py` depending on city — or add a new file if needed.
Run the autoscraper tool:
```bash
python autoscraper.py listings <listings-url>