refactor: split ssr.py into package, enrich OG Online detail pages, fix travel upsert
- Split src/adapters/ssr.py (2160 LOC) into ssr/ package grouped by CMS: realworks.py, sure.py, schiedam.py, denhaag.py, overige.py - Add _og_detail() to api.py; all OG Online scrapers now fall back to detail page fetch when energielabel/bouwjaar are missing from the API - Fix run() to recalculate travel times for existing listings where fiets_mark IS NULL; upsert() now writes travel cols on existing rows too - Update tests/cache.py to patch fetch_soup in every ssr submodule - Update docs to reflect new package structure and mark API enrichment TODO done Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -96,8 +96,13 @@ def fetch_bjornd() -> list[RawListing]:
|
||||
- `fetch_json(url, *, params=None, headers=None)` — GET with User-Agent, timeout, Retry-After handling
|
||||
- Built-in logging via `log = logging.getLogger("huizenbot.api")`
|
||||
|
||||
#### 2. **SSR/HTML-based** (`src/adapters/ssr.py`)
|
||||
For brokers with server-side rendered HTML.
|
||||
#### 2. **SSR/HTML-based** (`src/adapters/ssr/` package)
|
||||
For brokers with server-side rendered HTML. The package is split by CMS platform:
|
||||
- `realworks.py` — Realworks CMS (li/div.aanbodEntry cards + span.kenmerk detail)
|
||||
- `sure.py` — SURE WordPress plugin (/wonen?sure_koop_huur=koop + #kenmerken detail)
|
||||
- `schiedam.py` — Custom Schiedam scrapers (diverse platforms)
|
||||
- `denhaag.py` — Den Haag scrapers (diverse platforms)
|
||||
- `overige.py` — Other / multi-city scrapers (OG Online WP, Elementor)
|
||||
|
||||
**Pattern:**
|
||||
```python
|
||||
@@ -144,18 +149,22 @@ def fetch_vdaal() -> list[RawListing]:
|
||||
|
||||
## Registration
|
||||
|
||||
Both `api.py` and `ssr.py` have a `SCRAPERS` dict at the bottom:
|
||||
**API scrapers** (`src/adapters/api.py`): Add your function and register in the `SCRAPERS` dict at the bottom of the file.
|
||||
|
||||
**SSR scrapers**: Add your function to the appropriate submodule (`realworks.py`, `sure.py`, `schiedam.py`, `denhaag.py`, or `overige.py`), then import it in `src/adapters/ssr/__init__.py` and add it to the `SCRAPERS` dict there.
|
||||
|
||||
```python
|
||||
# api.py
|
||||
# api.py — SCRAPERS dict
|
||||
SCRAPERS = {
|
||||
'bjornd': fetch_bjornd,
|
||||
'your_broker': fetch_your_broker, # ← Add here
|
||||
}
|
||||
|
||||
# ssr.py
|
||||
# ssr/__init__.py — import + register
|
||||
from .realworks import fetch_your_broker # ← import from the right submodule
|
||||
|
||||
SCRAPERS = {
|
||||
'bjornd_demo': fetch_bjornd_demo,
|
||||
...
|
||||
'your_broker': fetch_your_broker, # ← Add here
|
||||
}
|
||||
```
|
||||
@@ -173,7 +182,7 @@ The human will help you:
|
||||
- Write exploratory curl requests (for APIs) or BeautifulSoup inspections
|
||||
|
||||
### 2. Develop & Test Locally
|
||||
- Add your scraper function to the appropriate file (`api.py` or `ssr.py`)
|
||||
- Add your scraper function to the appropriate file (`api.py` or the right `ssr/` submodule)
|
||||
- Register it in the `SCRAPERS` dict
|
||||
- The human updates `tests/test_adapters.py` to point to your adapter:
|
||||
```python
|
||||
@@ -208,6 +217,8 @@ Secrets (API keys, webhook URLs) are **environment variables**, not in config.
|
||||
Before investigating a broker's HTML manually, check for known platforms in this order:
|
||||
|
||||
### 1. OG Online / realtime-listings (API — fastest)
|
||||
**File:** `src/adapters/api.py`
|
||||
|
||||
Check if `https://<base>/nl/realtime-listings/consumer` returns JSON (with header `X-Requested-With: XMLHttpRequest`). If yes, this is a 10-line addition to `api.py`. Known brokers: bjornd, moerman, vandaal, elzenaar, doen.
|
||||
|
||||
Fields: `isSales`, `statusOrig`, `salesPrice`, `address`, `zipcode`, `city`, `rooms`, `bedrooms`, `livingSurface`, `plotSurface`, `dateOfConstruction`, `energyLabel`, `type`, `photo`, `url`.
|
||||
@@ -215,6 +226,8 @@ Fields: `isSales`, `statusOrig`, `salesPrice`, `address`, `zipcode`, `city`, `ro
|
||||
Add a `_CITIES` set to filter by city if the broker covers a wide area. Skip statuses `"rented"` and `"rented_ur"`.
|
||||
|
||||
### 2. Realworks CMS (SSR — one liner)
|
||||
**File:** `src/adapters/ssr/realworks.py`
|
||||
|
||||
Run `autoscraper.py` or check HTML for `li.aanbodEntry`. If detected:
|
||||
```python
|
||||
def fetch_mybroker() -> list[RawListing]:
|
||||
@@ -222,6 +235,8 @@ def fetch_mybroker() -> list[RawListing]:
|
||||
```
|
||||
|
||||
### 3. SURE WordPress Plugin (SSR — ~50 lines)
|
||||
**File:** `src/adapters/ssr/sure.py`
|
||||
|
||||
Check HTML for `sure-` CSS classes or `?sure_koop_huur=koop` filter. Two card variants:
|
||||
- `a.card-house` (single dash) — e.g. Olsthoorn
|
||||
- `a.card--house` (double dash) — e.g. Borgdorff
|
||||
@@ -231,6 +246,8 @@ Both use `?sure_koop_huur=koop` to filter buy listings and `/page/{N}/` paginati
|
||||
Terminate pagination when `len(cards) < expected_per_page` (typically 15 for SURE).
|
||||
|
||||
### 4. Unknown CMS
|
||||
**File:** `src/adapters/ssr/schiedam.py`, `denhaag.py`, or `overige.py` depending on city — or add a new file if needed.
|
||||
|
||||
Run the autoscraper tool:
|
||||
```bash
|
||||
python autoscraper.py listings <listings-url>
|
||||
|
||||
Reference in New Issue
Block a user