From 709622020384567f9ee72245d3589201c4d33935 Mon Sep 17 00:00:00 2001 From: Mark Kalsbeek Date: Sat, 4 Apr 2026 23:32:37 +0200 Subject: [PATCH] docs: update scraper prompt with OG Online, SURE, Realworks patterns and postcode tip Co-Authored-By: Claude Sonnet 4.6 --- add_scraper_context.md | 49 +++++++++++++++++++++++++++++++++--------- new_scraper_prompt.md | 39 +++++++++++++++++++++++++++++++-- 2 files changed, 76 insertions(+), 12 deletions(-) diff --git a/add_scraper_context.md b/add_scraper_context.md index f2eaab5..52b8a90 100644 --- a/add_scraper_context.md +++ b/add_scraper_context.md @@ -138,7 +138,7 @@ def fetch_vdaal() -> list[RawListing]: - `_text(soup, selector)` — Get inner text from element - `_src(soup, selector)` — Get src or data-src attribute - `_extract_postcode(text)` — Regex postcode from any text -- `_infer_stad(postcode)` — Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam +- `_infer_stad(postcode)` — Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam (Den Haag not in this helper; use the city value from the broker directly) --- @@ -203,19 +203,40 @@ Secrets (API keys, webhook URLs) are **environment variables**, not in config. --- -## CMS Detection Tool +## Platform / CMS Quick Identification -Before investigating a broker's HTML manually, prod the human in the loop to run `autoscraper.py` from the project root: +Before investigating a broker's HTML manually, check for known platforms in this order: + +### 1. OG Online / realtime-listings (API — fastest) +Check if `https:///nl/realtime-listings/consumer` returns JSON (with header `X-Requested-With: XMLHttpRequest`). If yes, this is a 10-line addition to `api.py`. Known brokers: bjornd, moerman, vandaal, elzenaar, doen. + +Fields: `isSales`, `statusOrig`, `salesPrice`, `address`, `zipcode`, `city`, `rooms`, `bedrooms`, `livingSurface`, `plotSurface`, `dateOfConstruction`, `energyLabel`, `type`, `photo`, `url`. + +Add a `_CITIES` set to filter by city if the broker covers a wide area. Skip statuses `"rented"` and `"rented_ur"`. + +### 2. Realworks CMS (SSR — one liner) +Run `autoscraper.py` or check HTML for `li.aanbodEntry`. If detected: +```python +def fetch_mybroker() -> list[RawListing]: + return fetch_realworks("https://www.mybroker.nl", "mybroker") +``` + +### 3. SURE WordPress Plugin (SSR — ~50 lines) +Check HTML for `sure-` CSS classes or `?sure_koop_huur=koop` filter. Two card variants: +- `a.card-house` (single dash) — e.g. Olsthoorn +- `a.card--house` (double dash) — e.g. Borgdorff + +Both use `?sure_koop_huur=koop` to filter buy listings and `/page/{N}/` pagination. Detail page always has `#kenmerken li span span` pairs with labels like `status`, `soort woonhuis`/`soort woning`/`soort bouw`, `bouwjaar`, `gebruiksoppervlakte wonen`, `perceeloppervlakte`, `aantal slaapkamers`, `energielabel`. Postcode is often **not** available on the detail page. + +Terminate pagination when `len(cards) < expected_per_page` (typically 15 for SURE). + +### 4. Unknown CMS +Run the autoscraper tool: ```bash python autoscraper.py listings python autoscraper.py details ``` - -If the broker uses a known CMS, the tool prints the exact code to add — no further investigation needed. Currently detected CMSes: - -- **Realworks** → prints a ready-to-paste `fetch_realworks(...)` one-liner for `ssr.py` - -If the CMS is unknown, the tool prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development. +It prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development. ## Important Notes @@ -240,6 +261,13 @@ status = _STATUS_MAP.get(item.get("status"), "beschikbaar") ### Postcode Extraction Always aim for the **Dutch postcode format** (4 digits + 2 letters, e.g., `"2611CA"`). The travel time calculation depends on it. If a broker only provides the address string, use `_extract_postcode(address)`. +If a postcode field contains extra text (e.g., `"2522GW Den Haag"`), extract cleanly with: +```python +m = re.search(r"\d{4}\s*[A-Z]{2}", raw.upper()) +postcode = m.group(0).replace(" ", "") if m else None +``` +Never just `.replace(" ", "")` — that produces garbage like `"2522GWDenHaag"`. + ### Price Handling Prices are **integers** (euros), never floats. Use `parse_prijs()` for HTML. @@ -272,7 +300,8 @@ The database stores this as JSON in the `extra` column. - Nominatim (geocoding) has a 1 req/s limiter built into `huizenbot.py` - Never spawn parallel requests without the human's approval - Always use the `USER_AGENT` header (includes contact info for respectful scraping) -- Don't keep curling the same endpoint, pipe it to a .dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that. +- Don't keep curling the same endpoint, pipe it to a .dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that. +- Don't over-investigate pagination — confirm card count on page 1, assume it's consistent across pages, move on. Never fetch multiple pages just to verify the per-page count. --- diff --git a/new_scraper_prompt.md b/new_scraper_prompt.md index b7db7d8..2e7d1b0 100644 --- a/new_scraper_prompt.md +++ b/new_scraper_prompt.md @@ -1,4 +1,39 @@ -# SSR +# OG Online / realtime-listings (fastest — API) + +Check out the add_scraper_context.md, let's add a new scraper. + +**Broker:** [name] +**Base URL:** [e.g. https://www.mybroker.nl] +**Cities to include:** [e.g. {"Den Haag", "Voorburg"} — omit if broker is single-city] + +_(No further investigation needed — OG Online platform is fully understood.)_ + + +# Realworks CMS (one-liner — SSR) + +Check out the add_scraper_context.md, let's add a new scraper. + +**Broker:** [name] +**Base URL:** [e.g. https://www.mybroker.nl] + +_(No further investigation needed — Realworks platform is fully understood.)_ + + +# SURE WordPress Plugin (SSR) + +Check out the add_scraper_context.md, let's add a new scraper. + +**Broker:** [name] +**Base URL:** [e.g. https://www.mybroker.nl] +**Card selector:** [a.card-house or a.card--house] +**City filter:** [city name(s) to include, or "single city — no filter needed"] +**Cards per page:** [e.g. 15] + +_(Detail page always uses #kenmerken li span span — no further investigation needed.)_ + + +# SSR (custom) + Check out the add_scraper_context.md, let's add a new scraper. **Broker:** [name] @@ -16,7 +51,7 @@ Check out the add_scraper_context.md, let's add a new scraper. **Notes:** [auth, JS rendering, price filter in URL, etc.] -# API +# API (custom) Check out the add_scraper_context.md, let's add a new scraper.