docs: update scraper prompt with OG Online, SURE, Realworks patterns and postcode tip

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-04 23:32:37 +02:00
parent 75c5b6f26d
commit 7096220203
2 changed files with 76 additions and 12 deletions

View File

@@ -138,7 +138,7 @@ def fetch_vdaal() -> list[RawListing]:
- `_text(soup, selector)` — Get inner text from element
- `_src(soup, selector)` — Get src or data-src attribute
- `_extract_postcode(text)` — Regex postcode from any text
- `_infer_stad(postcode)` — Simple lookup: 26002629 → Delft, 31003135 → Schiedam
- `_infer_stad(postcode)` — Simple lookup: 26002629 → Delft, 31003135 → Schiedam (Den Haag not in this helper; use the city value from the broker directly)
---
@@ -203,19 +203,40 @@ Secrets (API keys, webhook URLs) are **environment variables**, not in config.
---
## CMS Detection Tool
## Platform / CMS Quick Identification
Before investigating a broker's HTML manually, prod the human in the loop to run `autoscraper.py` from the project root:
Before investigating a broker's HTML manually, check for known platforms in this order:
### 1. OG Online / realtime-listings (API — fastest)
Check if `https://<base>/nl/realtime-listings/consumer` returns JSON (with header `X-Requested-With: XMLHttpRequest`). If yes, this is a 10-line addition to `api.py`. Known brokers: bjornd, moerman, vandaal, elzenaar, doen.
Fields: `isSales`, `statusOrig`, `salesPrice`, `address`, `zipcode`, `city`, `rooms`, `bedrooms`, `livingSurface`, `plotSurface`, `dateOfConstruction`, `energyLabel`, `type`, `photo`, `url`.
Add a `_CITIES` set to filter by city if the broker covers a wide area. Skip statuses `"rented"` and `"rented_ur"`.
### 2. Realworks CMS (SSR — one liner)
Run `autoscraper.py` or check HTML for `li.aanbodEntry`. If detected:
```python
def fetch_mybroker() -> list[RawListing]:
return fetch_realworks("https://www.mybroker.nl", "mybroker")
```
### 3. SURE WordPress Plugin (SSR — ~50 lines)
Check HTML for `sure-` CSS classes or `?sure_koop_huur=koop` filter. Two card variants:
- `a.card-house` (single dash) — e.g. Olsthoorn
- `a.card--house` (double dash) — e.g. Borgdorff
Both use `?sure_koop_huur=koop` to filter buy listings and `/page/{N}/` pagination. Detail page always has `#kenmerken li span span` pairs with labels like `status`, `soort woonhuis`/`soort woning`/`soort bouw`, `bouwjaar`, `gebruiksoppervlakte wonen`, `perceeloppervlakte`, `aantal slaapkamers`, `energielabel`. Postcode is often **not** available on the detail page.
Terminate pagination when `len(cards) < expected_per_page` (typically 15 for SURE).
### 4. Unknown CMS
Run the autoscraper tool:
```bash
python autoscraper.py listings <listings-url>
python autoscraper.py details <detail-page-url>
```
If the broker uses a known CMS, the tool prints the exact code to add — no further investigation needed. Currently detected CMSes:
- **Realworks** → prints a ready-to-paste `fetch_realworks(...)` one-liner for `ssr.py`
If the CMS is unknown, the tool prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
It prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
## Important Notes
@@ -240,6 +261,13 @@ status = _STATUS_MAP.get(item.get("status"), "beschikbaar")
### Postcode Extraction
Always aim for the **Dutch postcode format** (4 digits + 2 letters, e.g., `"2611CA"`). The travel time calculation depends on it. If a broker only provides the address string, use `_extract_postcode(address)`.
If a postcode field contains extra text (e.g., `"2522GW Den Haag"`), extract cleanly with:
```python
m = re.search(r"\d{4}\s*[A-Z]{2}", raw.upper())
postcode = m.group(0).replace(" ", "") if m else None
```
Never just `.replace(" ", "")` — that produces garbage like `"2522GWDenHaag"`.
### Price Handling
Prices are **integers** (euros), never floats. Use `parse_prijs()` for HTML.
@@ -272,7 +300,8 @@ The database stores this as JSON in the `extra` column.
- Nominatim (geocoding) has a 1 req/s limiter built into `huizenbot.py`
- Never spawn parallel requests without the human's approval
- Always use the `USER_AGENT` header (includes contact info for respectful scraping)
- Don't keep curling the same endpoint, pipe it to a <name makelaar>.dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
- Don't keep curling the same endpoint, pipe it to a <name makelaar>.dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
- Don't over-investigate pagination — confirm card count on page 1, assume it's consistent across pages, move on. Never fetch multiple pages just to verify the per-page count.
---

View File

@@ -1,4 +1,39 @@
# SSR
# OG Online / realtime-listings (fastest — API)
Check out the add_scraper_context.md, let's add a new scraper.
**Broker:** [name]
**Base URL:** [e.g. https://www.mybroker.nl]
**Cities to include:** [e.g. {"Den Haag", "Voorburg"} — omit if broker is single-city]
_(No further investigation needed — OG Online platform is fully understood.)_
# Realworks CMS (one-liner — SSR)
Check out the add_scraper_context.md, let's add a new scraper.
**Broker:** [name]
**Base URL:** [e.g. https://www.mybroker.nl]
_(No further investigation needed — Realworks platform is fully understood.)_
# SURE WordPress Plugin (SSR)
Check out the add_scraper_context.md, let's add a new scraper.
**Broker:** [name]
**Base URL:** [e.g. https://www.mybroker.nl]
**Card selector:** [a.card-house or a.card--house]
**City filter:** [city name(s) to include, or "single city — no filter needed"]
**Cards per page:** [e.g. 15]
_(Detail page always uses #kenmerken li span span — no further investigation needed.)_
# SSR (custom)
Check out the add_scraper_context.md, let's add a new scraper.
**Broker:** [name]
@@ -16,7 +51,7 @@ Check out the add_scraper_context.md, let's add a new scraper.
**Notes:** [auth, JS rendering, price filter in URL, etc.]
# API
# API (custom)
Check out the add_scraper_context.md, let's add a new scraper.