docs: update scraper prompt with OG Online, SURE, Realworks patterns and postcode tip
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -138,7 +138,7 @@ def fetch_vdaal() -> list[RawListing]:
|
||||
- `_text(soup, selector)` — Get inner text from element
|
||||
- `_src(soup, selector)` — Get src or data-src attribute
|
||||
- `_extract_postcode(text)` — Regex postcode from any text
|
||||
- `_infer_stad(postcode)` — Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam
|
||||
- `_infer_stad(postcode)` — Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam (Den Haag not in this helper; use the city value from the broker directly)
|
||||
|
||||
---
|
||||
|
||||
@@ -203,19 +203,40 @@ Secrets (API keys, webhook URLs) are **environment variables**, not in config.
|
||||
|
||||
---
|
||||
|
||||
## CMS Detection Tool
|
||||
## Platform / CMS Quick Identification
|
||||
|
||||
Before investigating a broker's HTML manually, prod the human in the loop to run `autoscraper.py` from the project root:
|
||||
Before investigating a broker's HTML manually, check for known platforms in this order:
|
||||
|
||||
### 1. OG Online / realtime-listings (API — fastest)
|
||||
Check if `https://<base>/nl/realtime-listings/consumer` returns JSON (with header `X-Requested-With: XMLHttpRequest`). If yes, this is a 10-line addition to `api.py`. Known brokers: bjornd, moerman, vandaal, elzenaar, doen.
|
||||
|
||||
Fields: `isSales`, `statusOrig`, `salesPrice`, `address`, `zipcode`, `city`, `rooms`, `bedrooms`, `livingSurface`, `plotSurface`, `dateOfConstruction`, `energyLabel`, `type`, `photo`, `url`.
|
||||
|
||||
Add a `_CITIES` set to filter by city if the broker covers a wide area. Skip statuses `"rented"` and `"rented_ur"`.
|
||||
|
||||
### 2. Realworks CMS (SSR — one liner)
|
||||
Run `autoscraper.py` or check HTML for `li.aanbodEntry`. If detected:
|
||||
```python
|
||||
def fetch_mybroker() -> list[RawListing]:
|
||||
return fetch_realworks("https://www.mybroker.nl", "mybroker")
|
||||
```
|
||||
|
||||
### 3. SURE WordPress Plugin (SSR — ~50 lines)
|
||||
Check HTML for `sure-` CSS classes or `?sure_koop_huur=koop` filter. Two card variants:
|
||||
- `a.card-house` (single dash) — e.g. Olsthoorn
|
||||
- `a.card--house` (double dash) — e.g. Borgdorff
|
||||
|
||||
Both use `?sure_koop_huur=koop` to filter buy listings and `/page/{N}/` pagination. Detail page always has `#kenmerken li span span` pairs with labels like `status`, `soort woonhuis`/`soort woning`/`soort bouw`, `bouwjaar`, `gebruiksoppervlakte wonen`, `perceeloppervlakte`, `aantal slaapkamers`, `energielabel`. Postcode is often **not** available on the detail page.
|
||||
|
||||
Terminate pagination when `len(cards) < expected_per_page` (typically 15 for SURE).
|
||||
|
||||
### 4. Unknown CMS
|
||||
Run the autoscraper tool:
|
||||
```bash
|
||||
python autoscraper.py listings <listings-url>
|
||||
python autoscraper.py details <detail-page-url>
|
||||
```
|
||||
|
||||
If the broker uses a known CMS, the tool prints the exact code to add — no further investigation needed. Currently detected CMSes:
|
||||
|
||||
- **Realworks** → prints a ready-to-paste `fetch_realworks(...)` one-liner for `ssr.py`
|
||||
|
||||
If the CMS is unknown, the tool prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
|
||||
It prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
|
||||
|
||||
## Important Notes
|
||||
|
||||
@@ -240,6 +261,13 @@ status = _STATUS_MAP.get(item.get("status"), "beschikbaar")
|
||||
### Postcode Extraction
|
||||
Always aim for the **Dutch postcode format** (4 digits + 2 letters, e.g., `"2611CA"`). The travel time calculation depends on it. If a broker only provides the address string, use `_extract_postcode(address)`.
|
||||
|
||||
If a postcode field contains extra text (e.g., `"2522GW Den Haag"`), extract cleanly with:
|
||||
```python
|
||||
m = re.search(r"\d{4}\s*[A-Z]{2}", raw.upper())
|
||||
postcode = m.group(0).replace(" ", "") if m else None
|
||||
```
|
||||
Never just `.replace(" ", "")` — that produces garbage like `"2522GWDenHaag"`.
|
||||
|
||||
### Price Handling
|
||||
Prices are **integers** (euros), never floats. Use `parse_prijs()` for HTML.
|
||||
|
||||
@@ -272,7 +300,8 @@ The database stores this as JSON in the `extra` column.
|
||||
- Nominatim (geocoding) has a 1 req/s limiter built into `huizenbot.py`
|
||||
- Never spawn parallel requests without the human's approval
|
||||
- Always use the `USER_AGENT` header (includes contact info for respectful scraping)
|
||||
- Don't keep curling the same endpoint, pipe it to a <name makelaar>.dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
|
||||
- Don't keep curling the same endpoint, pipe it to a <name makelaar>.dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
|
||||
- Don't over-investigate pagination — confirm card count on page 1, assume it's consistent across pages, move on. Never fetch multiple pages just to verify the per-page count.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -1,4 +1,39 @@
|
||||
# SSR
|
||||
# OG Online / realtime-listings (fastest — API)
|
||||
|
||||
Check out the add_scraper_context.md, let's add a new scraper.
|
||||
|
||||
**Broker:** [name]
|
||||
**Base URL:** [e.g. https://www.mybroker.nl]
|
||||
**Cities to include:** [e.g. {"Den Haag", "Voorburg"} — omit if broker is single-city]
|
||||
|
||||
_(No further investigation needed — OG Online platform is fully understood.)_
|
||||
|
||||
|
||||
# Realworks CMS (one-liner — SSR)
|
||||
|
||||
Check out the add_scraper_context.md, let's add a new scraper.
|
||||
|
||||
**Broker:** [name]
|
||||
**Base URL:** [e.g. https://www.mybroker.nl]
|
||||
|
||||
_(No further investigation needed — Realworks platform is fully understood.)_
|
||||
|
||||
|
||||
# SURE WordPress Plugin (SSR)
|
||||
|
||||
Check out the add_scraper_context.md, let's add a new scraper.
|
||||
|
||||
**Broker:** [name]
|
||||
**Base URL:** [e.g. https://www.mybroker.nl]
|
||||
**Card selector:** [a.card-house or a.card--house]
|
||||
**City filter:** [city name(s) to include, or "single city — no filter needed"]
|
||||
**Cards per page:** [e.g. 15]
|
||||
|
||||
_(Detail page always uses #kenmerken li span span — no further investigation needed.)_
|
||||
|
||||
|
||||
# SSR (custom)
|
||||
|
||||
Check out the add_scraper_context.md, let's add a new scraper.
|
||||
|
||||
**Broker:** [name]
|
||||
@@ -16,7 +51,7 @@ Check out the add_scraper_context.md, let's add a new scraper.
|
||||
**Notes:** [auth, JS rendering, price filter in URL, etc.]
|
||||
|
||||
|
||||
# API
|
||||
# API (custom)
|
||||
|
||||
Check out the add_scraper_context.md, let's add a new scraper.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user