docs: update scraper prompt with OG Online, SURE, Realworks patterns and postcode tip

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-04 23:32:37 +02:00
parent 75c5b6f26d
commit 7096220203
2 changed files with 76 additions and 12 deletions

View File

@@ -138,7 +138,7 @@ def fetch_vdaal() -> list[RawListing]:
- `_text(soup, selector)` — Get inner text from element - `_text(soup, selector)` — Get inner text from element
- `_src(soup, selector)` — Get src or data-src attribute - `_src(soup, selector)` — Get src or data-src attribute
- `_extract_postcode(text)` — Regex postcode from any text - `_extract_postcode(text)` — Regex postcode from any text
- `_infer_stad(postcode)` — Simple lookup: 26002629 → Delft, 31003135 → Schiedam - `_infer_stad(postcode)` — Simple lookup: 26002629 → Delft, 31003135 → Schiedam (Den Haag not in this helper; use the city value from the broker directly)
--- ---
@@ -203,19 +203,40 @@ Secrets (API keys, webhook URLs) are **environment variables**, not in config.
--- ---
## CMS Detection Tool ## Platform / CMS Quick Identification
Before investigating a broker's HTML manually, prod the human in the loop to run `autoscraper.py` from the project root: Before investigating a broker's HTML manually, check for known platforms in this order:
### 1. OG Online / realtime-listings (API — fastest)
Check if `https://<base>/nl/realtime-listings/consumer` returns JSON (with header `X-Requested-With: XMLHttpRequest`). If yes, this is a 10-line addition to `api.py`. Known brokers: bjornd, moerman, vandaal, elzenaar, doen.
Fields: `isSales`, `statusOrig`, `salesPrice`, `address`, `zipcode`, `city`, `rooms`, `bedrooms`, `livingSurface`, `plotSurface`, `dateOfConstruction`, `energyLabel`, `type`, `photo`, `url`.
Add a `_CITIES` set to filter by city if the broker covers a wide area. Skip statuses `"rented"` and `"rented_ur"`.
### 2. Realworks CMS (SSR — one liner)
Run `autoscraper.py` or check HTML for `li.aanbodEntry`. If detected:
```python
def fetch_mybroker() -> list[RawListing]:
return fetch_realworks("https://www.mybroker.nl", "mybroker")
```
### 3. SURE WordPress Plugin (SSR — ~50 lines)
Check HTML for `sure-` CSS classes or `?sure_koop_huur=koop` filter. Two card variants:
- `a.card-house` (single dash) — e.g. Olsthoorn
- `a.card--house` (double dash) — e.g. Borgdorff
Both use `?sure_koop_huur=koop` to filter buy listings and `/page/{N}/` pagination. Detail page always has `#kenmerken li span span` pairs with labels like `status`, `soort woonhuis`/`soort woning`/`soort bouw`, `bouwjaar`, `gebruiksoppervlakte wonen`, `perceeloppervlakte`, `aantal slaapkamers`, `energielabel`. Postcode is often **not** available on the detail page.
Terminate pagination when `len(cards) < expected_per_page` (typically 15 for SURE).
### 4. Unknown CMS
Run the autoscraper tool:
```bash ```bash
python autoscraper.py listings <listings-url> python autoscraper.py listings <listings-url>
python autoscraper.py details <detail-page-url> python autoscraper.py details <detail-page-url>
``` ```
It prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
If the broker uses a known CMS, the tool prints the exact code to add — no further investigation needed. Currently detected CMSes:
- **Realworks** → prints a ready-to-paste `fetch_realworks(...)` one-liner for `ssr.py`
If the CMS is unknown, the tool prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
## Important Notes ## Important Notes
@@ -240,6 +261,13 @@ status = _STATUS_MAP.get(item.get("status"), "beschikbaar")
### Postcode Extraction ### Postcode Extraction
Always aim for the **Dutch postcode format** (4 digits + 2 letters, e.g., `"2611CA"`). The travel time calculation depends on it. If a broker only provides the address string, use `_extract_postcode(address)`. Always aim for the **Dutch postcode format** (4 digits + 2 letters, e.g., `"2611CA"`). The travel time calculation depends on it. If a broker only provides the address string, use `_extract_postcode(address)`.
If a postcode field contains extra text (e.g., `"2522GW Den Haag"`), extract cleanly with:
```python
m = re.search(r"\d{4}\s*[A-Z]{2}", raw.upper())
postcode = m.group(0).replace(" ", "") if m else None
```
Never just `.replace(" ", "")` — that produces garbage like `"2522GWDenHaag"`.
### Price Handling ### Price Handling
Prices are **integers** (euros), never floats. Use `parse_prijs()` for HTML. Prices are **integers** (euros), never floats. Use `parse_prijs()` for HTML.
@@ -273,6 +301,7 @@ The database stores this as JSON in the `extra` column.
- Never spawn parallel requests without the human's approval - Never spawn parallel requests without the human's approval
- Always use the `USER_AGENT` header (includes contact info for respectful scraping) - Always use the `USER_AGENT` header (includes contact info for respectful scraping)
- Don't keep curling the same endpoint, pipe it to a <name makelaar>.dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that. - Don't keep curling the same endpoint, pipe it to a <name makelaar>.dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
- Don't over-investigate pagination — confirm card count on page 1, assume it's consistent across pages, move on. Never fetch multiple pages just to verify the per-page count.
--- ---

View File

@@ -1,4 +1,39 @@
# SSR # OG Online / realtime-listings (fastest — API)
Check out the add_scraper_context.md, let's add a new scraper.
**Broker:** [name]
**Base URL:** [e.g. https://www.mybroker.nl]
**Cities to include:** [e.g. {"Den Haag", "Voorburg"} — omit if broker is single-city]
_(No further investigation needed — OG Online platform is fully understood.)_
# Realworks CMS (one-liner — SSR)
Check out the add_scraper_context.md, let's add a new scraper.
**Broker:** [name]
**Base URL:** [e.g. https://www.mybroker.nl]
_(No further investigation needed — Realworks platform is fully understood.)_
# SURE WordPress Plugin (SSR)
Check out the add_scraper_context.md, let's add a new scraper.
**Broker:** [name]
**Base URL:** [e.g. https://www.mybroker.nl]
**Card selector:** [a.card-house or a.card--house]
**City filter:** [city name(s) to include, or "single city — no filter needed"]
**Cards per page:** [e.g. 15]
_(Detail page always uses #kenmerken li span span — no further investigation needed.)_
# SSR (custom)
Check out the add_scraper_context.md, let's add a new scraper. Check out the add_scraper_context.md, let's add a new scraper.
**Broker:** [name] **Broker:** [name]
@@ -16,7 +51,7 @@ Check out the add_scraper_context.md, let's add a new scraper.
**Notes:** [auth, JS rendering, price filter in URL, etc.] **Notes:** [auth, JS rendering, price filter in URL, etc.]
# API # API (custom)
Check out the add_scraper_context.md, let's add a new scraper. Check out the add_scraper_context.md, let's add a new scraper.