Compare commits
3 Commits
bfd69e3542
...
7096220203
| Author | SHA1 | Date | |
|---|---|---|---|
| 7096220203 | |||
| 75c5b6f26d | |||
| 6beae1133b |
@@ -138,7 +138,7 @@ def fetch_vdaal() -> list[RawListing]:
|
||||
- `_text(soup, selector)` — Get inner text from element
|
||||
- `_src(soup, selector)` — Get src or data-src attribute
|
||||
- `_extract_postcode(text)` — Regex postcode from any text
|
||||
- `_infer_stad(postcode)` — Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam
|
||||
- `_infer_stad(postcode)` — Simple lookup: 2600–2629 → Delft, 3100–3135 → Schiedam (Den Haag not in this helper; use the city value from the broker directly)
|
||||
|
||||
---
|
||||
|
||||
@@ -203,19 +203,40 @@ Secrets (API keys, webhook URLs) are **environment variables**, not in config.
|
||||
|
||||
---
|
||||
|
||||
## CMS Detection Tool
|
||||
## Platform / CMS Quick Identification
|
||||
|
||||
Before investigating a broker's HTML manually, prod the human in the loop to run `autoscraper.py` from the project root:
|
||||
Before investigating a broker's HTML manually, check for known platforms in this order:
|
||||
|
||||
### 1. OG Online / realtime-listings (API — fastest)
|
||||
Check if `https://<base>/nl/realtime-listings/consumer` returns JSON (with header `X-Requested-With: XMLHttpRequest`). If yes, this is a 10-line addition to `api.py`. Known brokers: bjornd, moerman, vandaal, elzenaar, doen.
|
||||
|
||||
Fields: `isSales`, `statusOrig`, `salesPrice`, `address`, `zipcode`, `city`, `rooms`, `bedrooms`, `livingSurface`, `plotSurface`, `dateOfConstruction`, `energyLabel`, `type`, `photo`, `url`.
|
||||
|
||||
Add a `_CITIES` set to filter by city if the broker covers a wide area. Skip statuses `"rented"` and `"rented_ur"`.
|
||||
|
||||
### 2. Realworks CMS (SSR — one liner)
|
||||
Run `autoscraper.py` or check HTML for `li.aanbodEntry`. If detected:
|
||||
```python
|
||||
def fetch_mybroker() -> list[RawListing]:
|
||||
return fetch_realworks("https://www.mybroker.nl", "mybroker")
|
||||
```
|
||||
|
||||
### 3. SURE WordPress Plugin (SSR — ~50 lines)
|
||||
Check HTML for `sure-` CSS classes or `?sure_koop_huur=koop` filter. Two card variants:
|
||||
- `a.card-house` (single dash) — e.g. Olsthoorn
|
||||
- `a.card--house` (double dash) — e.g. Borgdorff
|
||||
|
||||
Both use `?sure_koop_huur=koop` to filter buy listings and `/page/{N}/` pagination. Detail page always has `#kenmerken li span span` pairs with labels like `status`, `soort woonhuis`/`soort woning`/`soort bouw`, `bouwjaar`, `gebruiksoppervlakte wonen`, `perceeloppervlakte`, `aantal slaapkamers`, `energielabel`. Postcode is often **not** available on the detail page.
|
||||
|
||||
Terminate pagination when `len(cards) < expected_per_page` (typically 15 for SURE).
|
||||
|
||||
### 4. Unknown CMS
|
||||
Run the autoscraper tool:
|
||||
```bash
|
||||
python autoscraper.py listings <listings-url>
|
||||
python autoscraper.py details <detail-page-url>
|
||||
```
|
||||
|
||||
If the broker uses a known CMS, the tool prints the exact code to add — no further investigation needed. Currently detected CMSes:
|
||||
|
||||
- **Realworks** → prints a ready-to-paste `fetch_realworks(...)` one-liner for `ssr.py`
|
||||
|
||||
If the CMS is unknown, the tool prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
|
||||
It prints structural diagnostics (card selectors, field patterns, pagination) to guide manual adapter development.
|
||||
|
||||
## Important Notes
|
||||
|
||||
@@ -240,6 +261,13 @@ status = _STATUS_MAP.get(item.get("status"), "beschikbaar")
|
||||
### Postcode Extraction
|
||||
Always aim for the **Dutch postcode format** (4 digits + 2 letters, e.g., `"2611CA"`). The travel time calculation depends on it. If a broker only provides the address string, use `_extract_postcode(address)`.
|
||||
|
||||
If a postcode field contains extra text (e.g., `"2522GW Den Haag"`), extract cleanly with:
|
||||
```python
|
||||
m = re.search(r"\d{4}\s*[A-Z]{2}", raw.upper())
|
||||
postcode = m.group(0).replace(" ", "") if m else None
|
||||
```
|
||||
Never just `.replace(" ", "")` — that produces garbage like `"2522GWDenHaag"`.
|
||||
|
||||
### Price Handling
|
||||
Prices are **integers** (euros), never floats. Use `parse_prijs()` for HTML.
|
||||
|
||||
@@ -272,7 +300,8 @@ The database stores this as JSON in the `extra` column.
|
||||
- Nominatim (geocoding) has a 1 req/s limiter built into `huizenbot.py`
|
||||
- Never spawn parallel requests without the human's approval
|
||||
- Always use the `USER_AGENT` header (includes contact info for respectful scraping)
|
||||
- Don't keep curling the same endpoint, pipe it to a <name makelaar>.dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
|
||||
- Don't keep curling the same endpoint, pipe it to a <name makelaar>.dump and then rg through it to find what you need. Can also pipe it through the bsprettify.py and then rg that.
|
||||
- Don't over-investigate pagination — confirm card count on page 1, assume it's consistent across pages, move on. Never fetch multiple pages just to verify the per-page count.
|
||||
|
||||
---
|
||||
|
||||
|
||||
25
makelaars.md
25
makelaars.md
@@ -1,4 +1,4 @@
|
||||
# Verkoopmakelaars Delft & Schiedam
|
||||
# Verkoopmakelaars Delft, Leiden, Den Haag & Schiedam
|
||||
|
||||
## Delft
|
||||
|
||||
@@ -13,13 +13,17 @@
|
||||
| [x] | ZO makelaars | zomakelaars.nl | Van Foreestweg 4 |
|
||||
| [ ] | Marloes Makelaars | — | Maerten Trompstraat 28 |
|
||||
| [ ] | Makelaarskantoor J.E. Mouthaan | — | Julianalaan 43 |
|
||||
| [ ] | Olsthoorn Makelaars Delft | olsthoornmakelaars.nl | Noordeinde 51 |
|
||||
| [ ] | Post Makelaardij (v/h Bayense) | postmakelaardij.nl | Spoorsingel 1a |
|
||||
| [ ] | Morris NVM Makelaars | morrismakelaardij.nl | — |
|
||||
| [x] | Olsthoorn Makelaars Delft | olsthoornmakelaars.nl | Noordeinde 51 |
|
||||
| [x] | Post Makelaardij (v/h Bayense) | postmakelaardij.nl | Spoorsingel 1a |
|
||||
| [x] | Morris NVM Makelaars | morrismakelaardij.nl | — |
|
||||
| [ ] | Prinsenstad Makelaardij | — | — |
|
||||
| [ ] | Oude Delft Makelaardij | — | — |
|
||||
| [ ] | Dijksman Woningmakelaars | — | — |
|
||||
| [ ] | CORPOwonen | — | — |
|
||||
| [ ] | Bergklis Makelaars | bergklis.nl | — |
|
||||
| [ ] | Van Gulden Makelaardij | vanguldenmakelaardij.nl | Zaïrestraat 1 |
|
||||
| [ ] | Van der Togt Makelaardij | vdtmakelaardij.nl | — (Voorburg, actief in Delft) |
|
||||
|
||||
|
||||
## Schiedam
|
||||
|
||||
@@ -38,6 +42,19 @@
|
||||
| [x] | Schieland Borsboom NVM Makelaars | schielandborsboom.nl | (Rotterdam, actief in Schiedam) |
|
||||
|
||||
|
||||
## Den Haag
|
||||
|
||||
| Done | Naam | Website | Adres |
|
||||
|------|------|---------|-------|
|
||||
| [skip] | Yuvam Makelaardij | yuvammakelaardij.nl | — (connection refused) |
|
||||
| [x] | 88 Makelaars | 88makelaars.nl | — |
|
||||
| [skip] | DIVA Makelaars | divamakelaars.nl | — (alleen Maartensdijk, niet Den Haag) |
|
||||
| [x] | Elzenaar NVM Makelaars | elzenaar.com | — |
|
||||
| [skip] | Frisia Makelaars | frisiamakelaars.nl | — (SPA/Vue, geen API) |
|
||||
| [x] | Borgdorff Makelaars | borgdorff.nl | — (vestiging Den Haag) |
|
||||
| [skip] | SMASH Makelaars | smashmakelaars.nl | — (te klein, geen API) |
|
||||
| [x] | DOEN NVM Makelaars | doenmakelaars.com | Doezastraat 30 (Leiden, ook actief in Den Haag) |
|
||||
|
||||
## Leiden
|
||||
|
||||
| Done | Naam | Website | Adres |
|
||||
|
||||
@@ -1,4 +1,39 @@
|
||||
# SSR
|
||||
# OG Online / realtime-listings (fastest — API)
|
||||
|
||||
Check out the add_scraper_context.md, let's add a new scraper.
|
||||
|
||||
**Broker:** [name]
|
||||
**Base URL:** [e.g. https://www.mybroker.nl]
|
||||
**Cities to include:** [e.g. {"Den Haag", "Voorburg"} — omit if broker is single-city]
|
||||
|
||||
_(No further investigation needed — OG Online platform is fully understood.)_
|
||||
|
||||
|
||||
# Realworks CMS (one-liner — SSR)
|
||||
|
||||
Check out the add_scraper_context.md, let's add a new scraper.
|
||||
|
||||
**Broker:** [name]
|
||||
**Base URL:** [e.g. https://www.mybroker.nl]
|
||||
|
||||
_(No further investigation needed — Realworks platform is fully understood.)_
|
||||
|
||||
|
||||
# SURE WordPress Plugin (SSR)
|
||||
|
||||
Check out the add_scraper_context.md, let's add a new scraper.
|
||||
|
||||
**Broker:** [name]
|
||||
**Base URL:** [e.g. https://www.mybroker.nl]
|
||||
**Card selector:** [a.card-house or a.card--house]
|
||||
**City filter:** [city name(s) to include, or "single city — no filter needed"]
|
||||
**Cards per page:** [e.g. 15]
|
||||
|
||||
_(Detail page always uses #kenmerken li span span — no further investigation needed.)_
|
||||
|
||||
|
||||
# SSR (custom)
|
||||
|
||||
Check out the add_scraper_context.md, let's add a new scraper.
|
||||
|
||||
**Broker:** [name]
|
||||
@@ -16,7 +51,7 @@ Check out the add_scraper_context.md, let's add a new scraper.
|
||||
**Notes:** [auth, JS rendering, price filter in URL, etc.]
|
||||
|
||||
|
||||
# API
|
||||
# API (custom)
|
||||
|
||||
Check out the add_scraper_context.md, let's add a new scraper.
|
||||
|
||||
|
||||
@@ -307,6 +307,135 @@ def fetch_vandaal() -> list[RawListing]:
|
||||
return listings
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Elzenaar NVM Makelaars (Den Haag) — OG Online platform
|
||||
# ---------------------------------------------------------------------------
|
||||
# Zelfde platform als bjornd/moerman/vandaal.
|
||||
|
||||
_ELZENAAR_BASE = "https://www.elzenaar.com"
|
||||
_ELZENAAR_SKIP = {"rented", "rented_ur"}
|
||||
_ELZENAAR_CITIES = {"Den Haag", "Voorburg", "Rijswijk"}
|
||||
|
||||
_ELZENAAR_STATUS_MAP = {
|
||||
"available": "beschikbaar",
|
||||
"under_bid": "onder_bod",
|
||||
"under_option": "onder_bod",
|
||||
"sold": "verkocht",
|
||||
"sold_ur": "verkocht",
|
||||
}
|
||||
|
||||
|
||||
def fetch_elzenaar() -> list[RawListing]:
|
||||
data = fetch_json(
|
||||
f"{_ELZENAAR_BASE}/nl/realtime-listings/consumer",
|
||||
headers={"X-Requested-With": "XMLHttpRequest"},
|
||||
)
|
||||
|
||||
listings = []
|
||||
for item in data:
|
||||
if not item.get("isSales"):
|
||||
continue
|
||||
if item.get("statusOrig") in _ELZENAAR_SKIP:
|
||||
continue
|
||||
if item.get("city") not in _ELZENAAR_CITIES:
|
||||
continue
|
||||
if item.get("salesPrice", 0) > config.MAX_PRICE:
|
||||
continue
|
||||
|
||||
postcode = (item.get("zipcode") or "").replace(" ", "") or None
|
||||
perceel = item.get("plotSurface") or None
|
||||
if perceel == 0:
|
||||
perceel = None
|
||||
|
||||
raw_year = item.get("dateOfConstruction") or ""
|
||||
bouwjaar = int(raw_year) if raw_year.isdigit() else None
|
||||
|
||||
listings.append(RawListing(
|
||||
url=_ELZENAAR_BASE + item["url"],
|
||||
source_makelaar="elzenaar",
|
||||
status=_ELZENAAR_STATUS_MAP.get(item.get("statusOrig", ""), "beschikbaar"),
|
||||
adres=item.get("address") or None,
|
||||
postcode=postcode,
|
||||
stad=item.get("city") or None,
|
||||
prijs=item.get("salesPrice") or None,
|
||||
woningtype=item.get("type") or None,
|
||||
woonoppervlak=item.get("livingSurface") or None,
|
||||
perceeloppervlak=perceel,
|
||||
kamers=item.get("rooms") or None,
|
||||
slaapkamers=item.get("bedrooms") or None,
|
||||
bouwjaar=bouwjaar,
|
||||
energielabel=item.get("energyLabel") or None,
|
||||
hero_image_url=item.get("photo") or None,
|
||||
))
|
||||
|
||||
log.info("elzenaar: %d koopwoningen opgehaald", len(listings))
|
||||
return listings
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# DOEN NVM Makelaars (Den Haag / Leiden / Voorburg) — OG Online platform
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_DOEN_BASE = "https://www.doenmakelaars.com"
|
||||
_DOEN_SKIP = {"rented", "rented_ur"}
|
||||
_DOEN_CITIES = {"Den Haag", "Leiden", "Voorburg", "Leidschendam", "Rijswijk", "Wassenaar", "Zoetermeer"}
|
||||
|
||||
_DOEN_STATUS_MAP = {
|
||||
"available": "beschikbaar",
|
||||
"under_bid": "onder_bod",
|
||||
"under_option": "onder_bod",
|
||||
"sold": "verkocht",
|
||||
"sold_ur": "verkocht",
|
||||
}
|
||||
|
||||
|
||||
def fetch_doen() -> list[RawListing]:
|
||||
data = fetch_json(
|
||||
f"{_DOEN_BASE}/nl/realtime-listings/consumer",
|
||||
headers={"X-Requested-With": "XMLHttpRequest"},
|
||||
)
|
||||
|
||||
listings = []
|
||||
for item in data:
|
||||
if not item.get("isSales"):
|
||||
continue
|
||||
if item.get("statusOrig") in _DOEN_SKIP:
|
||||
continue
|
||||
if item.get("city") not in _DOEN_CITIES:
|
||||
continue
|
||||
if item.get("salesPrice", 0) > config.MAX_PRICE:
|
||||
continue
|
||||
|
||||
postcode = (item.get("zipcode") or "").replace(" ", "") or None
|
||||
perceel = item.get("plotSurface") or None
|
||||
if perceel == 0:
|
||||
perceel = None
|
||||
|
||||
raw_year = item.get("dateOfConstruction") or ""
|
||||
bouwjaar = int(raw_year) if raw_year.isdigit() else None
|
||||
|
||||
listings.append(RawListing(
|
||||
url=_DOEN_BASE + item["url"],
|
||||
source_makelaar="doen",
|
||||
status=_DOEN_STATUS_MAP.get(item.get("statusOrig", ""), "beschikbaar"),
|
||||
adres=item.get("address") or None,
|
||||
postcode=postcode,
|
||||
stad=item.get("city") or None,
|
||||
prijs=item.get("salesPrice") or None,
|
||||
woningtype=item.get("type") or None,
|
||||
woonoppervlak=item.get("livingSurface") or None,
|
||||
perceeloppervlak=perceel,
|
||||
kamers=item.get("rooms") or None,
|
||||
slaapkamers=item.get("bedrooms") or None,
|
||||
bouwjaar=bouwjaar,
|
||||
energielabel=item.get("energyLabel") or None,
|
||||
hero_image_url=item.get("photo") or None,
|
||||
))
|
||||
|
||||
log.info("doen: %d koopwoningen opgehaald", len(listings))
|
||||
return listings
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# SCRAPERS — exporteer hier alle actieve API adapters
|
||||
# ---------------------------------------------------------------------------
|
||||
@@ -316,4 +445,6 @@ SCRAPERS = {
|
||||
'ooms': fetch_ooms,
|
||||
'moerman': fetch_moerman,
|
||||
'vandaal': fetch_vandaal,
|
||||
'elzenaar': fetch_elzenaar,
|
||||
'doen': fetch_doen,
|
||||
}
|
||||
|
||||
@@ -1292,6 +1292,588 @@ def fetch_roepman() -> list[RawListing]:
|
||||
return listings
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Post Makelaardij (v/h Bayense) — Delft & omgeving
|
||||
# ---------------------------------------------------------------------------
|
||||
# Custom Tailwind CSS site; covers Delft, Pijnacker, Rijswijk etc.
|
||||
# Filter for Delft only.
|
||||
|
||||
_POST_BASE = "https://www.postmakelaardij.nl"
|
||||
|
||||
_POST_STATUS_MAP = {
|
||||
"te koop": "beschikbaar",
|
||||
"onder bod": "onder_bod",
|
||||
"verkocht": "verkocht",
|
||||
}
|
||||
|
||||
|
||||
def _post_detail(detail_url: str) -> dict:
|
||||
"""Fetch Post Makelaardij detail page and extract kenmerken."""
|
||||
try:
|
||||
soup = fetch_soup(detail_url)
|
||||
|
||||
# Energielabel from CSS class: energielabel-{letter}
|
||||
energielabel = None
|
||||
for el in soup.select('[class]'):
|
||||
for cls in el.get('class', []):
|
||||
if cls.startswith('energielabel-') and cls != 'energielabel':
|
||||
energielabel = cls.replace('energielabel-', '').upper()
|
||||
break
|
||||
if energielabel:
|
||||
break
|
||||
|
||||
# Woonoppervlak, perceeloppervlak, slaapkamers from icon spans
|
||||
woonoppervlak = None
|
||||
perceeloppervlak = None
|
||||
slaapkamers = None
|
||||
for span in soup.select('span.object-info-icon-text'):
|
||||
txt = span.get_text(strip=True)
|
||||
if 'slaapkamer' in txt:
|
||||
m = re.search(r'(\d+)', txt)
|
||||
slaapkamers = int(m.group(1)) if m else None
|
||||
elif 'perceel' in txt:
|
||||
perceeloppervlak = parse_m2(txt)
|
||||
elif 'm²' in txt or 'm2' in txt:
|
||||
woonoppervlak = parse_m2(txt)
|
||||
|
||||
return {
|
||||
"woonoppervlak": woonoppervlak,
|
||||
"perceeloppervlak": perceeloppervlak,
|
||||
"slaapkamers": slaapkamers,
|
||||
"energielabel": energielabel,
|
||||
}
|
||||
except Exception as e:
|
||||
log.warning("post: detail fetch fout %s: %s", detail_url, e)
|
||||
return {}
|
||||
|
||||
|
||||
def fetch_post() -> list[RawListing]:
|
||||
"""Fetch Post Makelaardij listings; only Delft, only koop."""
|
||||
listings = []
|
||||
page = 1
|
||||
|
||||
while True:
|
||||
url = f"{_POST_BASE}/woningaanbod/koop?page={page}"
|
||||
soup = fetch_soup(url)
|
||||
cards = soup.select("article")
|
||||
if not cards:
|
||||
break
|
||||
|
||||
for card in cards:
|
||||
try:
|
||||
# URL — first link in image slider
|
||||
a_tag = card.select_one("a[href]")
|
||||
if not a_tag:
|
||||
continue
|
||||
href = a_tag["href"]
|
||||
detail_url = href if href.startswith("http") else _POST_BASE + href
|
||||
|
||||
# Postcode + city from span.custom-postcode-text
|
||||
pc_el = card.select_one("span.custom-postcode-text")
|
||||
if not pc_el:
|
||||
continue
|
||||
pc_parts = pc_el.get_text(strip=True).split()
|
||||
if len(pc_parts) < 3:
|
||||
continue
|
||||
postcode = pc_parts[0] + pc_parts[1] # "2613BD"
|
||||
stad = " ".join(pc_parts[2:]) # "Delft"
|
||||
|
||||
# Filter: only Delft
|
||||
if stad.lower() != "delft":
|
||||
continue
|
||||
|
||||
# Price — filter early
|
||||
prijs = parse_prijs(_text(card, "span.price-block"))
|
||||
if prijs and prijs > config.MAX_PRICE:
|
||||
continue
|
||||
|
||||
# Status from span.status text
|
||||
status_text = (_text(card, "span.status") or "").lower()
|
||||
status = _POST_STATUS_MAP.get(status_text, "beschikbaar")
|
||||
|
||||
# Address
|
||||
adres = _text(card, "h4.custom-address-text")
|
||||
|
||||
# Hero: first img in article
|
||||
img = card.select_one("img")
|
||||
hero = img["src"] if img else None
|
||||
|
||||
kk = _post_detail(detail_url)
|
||||
|
||||
listings.append(RawListing(
|
||||
url=detail_url,
|
||||
source_makelaar="post",
|
||||
status=status,
|
||||
adres=adres,
|
||||
postcode=postcode,
|
||||
stad=stad,
|
||||
prijs=prijs,
|
||||
hero_image_url=hero,
|
||||
woonoppervlak=kk.get("woonoppervlak"),
|
||||
perceeloppervlak=kk.get("perceeloppervlak"),
|
||||
slaapkamers=kk.get("slaapkamers"),
|
||||
energielabel=kk.get("energielabel"),
|
||||
))
|
||||
if config.APP_ENV == "dev":
|
||||
break
|
||||
except Exception as e:
|
||||
log.warning("post: parse fout: %s", e)
|
||||
|
||||
if len(cards) < 12:
|
||||
break
|
||||
page += 1
|
||||
|
||||
log.info("post: %d listings opgehaald", len(listings))
|
||||
return listings
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Morris NVM Makelaars (Delft) — Realworks CMS
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def fetch_morris() -> list[RawListing]:
|
||||
return fetch_realworks("https://www.morrismakelaardij.nl", "morris")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Olsthoorn Makelaars Delft (SURE WordPress plugin)
|
||||
# ---------------------------------------------------------------------------
|
||||
# Covers Delft, Den Haag, Naaldwijk etc — we filter for Delft only.
|
||||
# Detail page has no postcode; leave as None.
|
||||
|
||||
_OLSTHOORN_BASE = "https://www.olsthoornmakelaars.nl"
|
||||
|
||||
_OLSTHOORN_STATUS_MAP = {
|
||||
"badge-available": "beschikbaar",
|
||||
"badge-bid": "onder_bod",
|
||||
"badge-option": "onder_bod",
|
||||
"badge-sold": "verkocht",
|
||||
}
|
||||
|
||||
_OLSTHOORN_DETAIL_STATUS_MAP = {
|
||||
"beschikbaar": "beschikbaar",
|
||||
"onder bod": "onder_bod",
|
||||
"onder optie": "onder_bod",
|
||||
"verkocht": "verkocht",
|
||||
}
|
||||
|
||||
|
||||
def _olsthoorn_detail(detail_url: str) -> dict:
|
||||
"""Fetch Olsthoorn detail page; extract kenmerken from #kenmerken li pairs."""
|
||||
try:
|
||||
soup = fetch_soup(detail_url)
|
||||
kv: dict[str, str] = {}
|
||||
for li in soup.select("#kenmerken li"):
|
||||
spans = li.select("span")
|
||||
if len(spans) >= 2:
|
||||
label = spans[0].get_text(strip=True).lower()
|
||||
value = spans[1].get_text(strip=True)
|
||||
kv[label] = value
|
||||
return {
|
||||
"status": kv.get("status", "").lower(),
|
||||
"woningtype": kv.get("soort object") or kv.get("soort woning") or kv.get("soort bouw"),
|
||||
"bouwjaar": kv.get("bouwjaar"),
|
||||
"woonoppervlak": kv.get("gebruiksoppervlakte"),
|
||||
"perceeloppervlak": kv.get("perceeloppervlakte"),
|
||||
"kamers": kv.get("aantal kamers"),
|
||||
"slaapkamers": kv.get("aantal slaapkamers"),
|
||||
"energielabel": kv.get("energielabel"),
|
||||
}
|
||||
except Exception as e:
|
||||
log.warning("olsthoorn: detail fetch fout %s: %s", detail_url, e)
|
||||
return {}
|
||||
|
||||
|
||||
def fetch_olsthoorn() -> list[RawListing]:
|
||||
"""Fetch Olsthoorn Makelaars listings; only Delft, only koop."""
|
||||
listings = []
|
||||
page = 1
|
||||
|
||||
while True:
|
||||
if page == 1:
|
||||
url = f"{_OLSTHOORN_BASE}/wonen?sure_koop_huur=koop"
|
||||
else:
|
||||
url = f"{_OLSTHOORN_BASE}/wonen/page/{page}/?sure_koop_huur=koop"
|
||||
|
||||
soup = fetch_soup(url)
|
||||
cards = soup.select("a.card-house")
|
||||
if not cards:
|
||||
break
|
||||
|
||||
for card in cards:
|
||||
try:
|
||||
href = card.get("href", "")
|
||||
if not href:
|
||||
continue
|
||||
detail_url = href if href.startswith("http") else _OLSTHOORN_BASE + href
|
||||
|
||||
# Filter: only Delft
|
||||
stad_el = card.select_one("h2.card__title")
|
||||
stad = stad_el.get_text(strip=True) if stad_el else None
|
||||
if not stad or stad.lower() != "delft":
|
||||
continue
|
||||
|
||||
# Price from bold tag — filter early before detail fetch
|
||||
prijs_b = card.select_one("b")
|
||||
prijs = parse_prijs(prijs_b.get_text() if prijs_b else None)
|
||||
if prijs and prijs > config.MAX_PRICE:
|
||||
continue
|
||||
|
||||
# Status from badge class on label span
|
||||
label_span = card.select_one("span.card-house__label")
|
||||
status = "beschikbaar"
|
||||
if label_span:
|
||||
for cls in label_span.get("class", []):
|
||||
if cls in _OLSTHOORN_STATUS_MAP:
|
||||
status = _OLSTHOORN_STATUS_MAP[cls]
|
||||
break
|
||||
|
||||
# Address: second <p> under .short--info (collapse internal whitespace)
|
||||
adres_p = card.select("div.short--info > p")
|
||||
if adres_p:
|
||||
adres = " ".join(adres_p[0].get_text().split())
|
||||
else:
|
||||
adres = None
|
||||
|
||||
# Hero image: largest source srcset
|
||||
src_tag = card.select_one('picture source[media="(min-width:1024px)"]')
|
||||
hero = src_tag.get("data-srcset") if src_tag else None
|
||||
if hero and not hero.startswith("http"):
|
||||
hero = _OLSTHOORN_BASE + hero
|
||||
|
||||
# Woonoppervlak + kamers + energielabel from card data icons
|
||||
woonoppervlak_card = None
|
||||
kamers_card = None
|
||||
energielabel_card = None
|
||||
for data_div in card.select("div.data"):
|
||||
inner = data_div.select_one("span.date__inner")
|
||||
if not inner:
|
||||
continue
|
||||
txt = inner.get_text(strip=True)
|
||||
if data_div.select_one("i.icon-sizes"):
|
||||
woonoppervlak_card = parse_m2(txt)
|
||||
elif data_div.select_one("i.icon-door"):
|
||||
m = re.search(r"(\d+)", txt)
|
||||
kamers_card = int(m.group(1)) if m else None
|
||||
elif data_div.select_one("i.icon-energylabel"):
|
||||
energielabel_card = txt or None
|
||||
|
||||
kk = _olsthoorn_detail(detail_url)
|
||||
|
||||
# Refine status from detail page
|
||||
detail_status = _OLSTHOORN_DETAIL_STATUS_MAP.get(kk.get("status", ""), "")
|
||||
if detail_status:
|
||||
status = detail_status
|
||||
|
||||
listings.append(RawListing(
|
||||
url=detail_url,
|
||||
source_makelaar="olsthoorn",
|
||||
status=status,
|
||||
adres=adres,
|
||||
postcode=None, # not exposed by broker
|
||||
stad=stad,
|
||||
prijs=prijs,
|
||||
hero_image_url=hero,
|
||||
woningtype=kk.get("woningtype"),
|
||||
bouwjaar=int(kk["bouwjaar"]) if kk.get("bouwjaar") else None,
|
||||
woonoppervlak=parse_m2(kk.get("woonoppervlak")) or woonoppervlak_card,
|
||||
perceeloppervlak=parse_m2(kk.get("perceeloppervlak")),
|
||||
kamers=int(kk["kamers"]) if kk.get("kamers") else kamers_card,
|
||||
slaapkamers=int(kk["slaapkamers"]) if kk.get("slaapkamers") else None,
|
||||
energielabel=kk.get("energielabel") or energielabel_card,
|
||||
))
|
||||
if config.APP_ENV == "dev":
|
||||
break
|
||||
except Exception as e:
|
||||
log.warning("olsthoorn: parse fout: %s", e)
|
||||
|
||||
if len(cards) < 15:
|
||||
break
|
||||
page += 1
|
||||
|
||||
log.info("olsthoorn: %d listings opgehaald", len(listings))
|
||||
return listings
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 88 Makelaars (Den Haag) — Custom WordPress theme
|
||||
# ---------------------------------------------------------------------------
|
||||
# Cards on /ons-aanbod/page/{N}/; details in div.listing_detail kv pairs.
|
||||
|
||||
_88_BASE = "https://88makelaars.nl"
|
||||
|
||||
_88_STATUS_MAP = {
|
||||
"te koop": "beschikbaar",
|
||||
"beschikbaar": "beschikbaar",
|
||||
"onder bod": "onder_bod",
|
||||
"onder optie": "onder_bod",
|
||||
"verkocht onder voorbehoud": "verkocht",
|
||||
"verkocht": "verkocht",
|
||||
}
|
||||
|
||||
|
||||
def _88makelaars_detail(detail_url: str) -> dict:
|
||||
"""Fetch 88makelaars detail page; extract kenmerken from div.listing_detail kv pairs."""
|
||||
try:
|
||||
soup = fetch_soup(detail_url)
|
||||
kv: dict[str, str] = {}
|
||||
for div in soup.select("div.listing_detail"):
|
||||
txt = div.get_text(strip=True)
|
||||
if ":" in txt:
|
||||
label, _, value = txt.partition(":")
|
||||
kv[label.strip().lower()] = value.strip()
|
||||
raw_pc = kv.get("postcode") or ""
|
||||
pc_match = re.search(r"\d{4}\s*[A-Z]{2}", raw_pc.upper())
|
||||
postcode = pc_match.group(0).replace(" ", "") if pc_match else None
|
||||
return {
|
||||
"postcode": postcode,
|
||||
"slaapkamers": kv.get("slaapkamers"),
|
||||
"woonoppervlak": kv.get("woning grootte"),
|
||||
"energielabel": kv.get("energieklasse"),
|
||||
"woningtype": kv.get("soort woning"),
|
||||
}
|
||||
except Exception as e:
|
||||
log.warning("88makelaars: detail fetch fout %s: %s", detail_url, e)
|
||||
return {}
|
||||
|
||||
|
||||
def fetch_88makelaars() -> list[RawListing]:
|
||||
"""Fetch 88 Makelaars listings (Den Haag only)."""
|
||||
listings = []
|
||||
page = 1
|
||||
|
||||
while True:
|
||||
if page == 1:
|
||||
url = f"{_88_BASE}/ons-aanbod/"
|
||||
else:
|
||||
url = f"{_88_BASE}/ons-aanbod/page/{page}/"
|
||||
soup = fetch_soup(url)
|
||||
cards = soup.select("div.property_listing")
|
||||
if not cards:
|
||||
break
|
||||
|
||||
for card in cards:
|
||||
try:
|
||||
# URL from carousel
|
||||
a_tag = card.select_one(".property_unit_carousel a[href]")
|
||||
if not a_tag:
|
||||
continue
|
||||
detail_url = a_tag["href"]
|
||||
if not detail_url.startswith("http"):
|
||||
detail_url = _88_BASE + detail_url
|
||||
|
||||
# City — last link in property_location_image
|
||||
loc_links = card.select(".property_location_image a")
|
||||
stad = loc_links[-1].get_text(strip=True) if loc_links else None
|
||||
if not stad or stad.lower() != "den haag":
|
||||
continue
|
||||
|
||||
# Price
|
||||
prijs = parse_prijs(_text(card, ".listing_unit_price_wrapper"))
|
||||
if prijs and prijs > config.MAX_PRICE:
|
||||
continue
|
||||
|
||||
# Status
|
||||
status_text = (_text(card, ".ribbon-inside") or "").lower()
|
||||
status = _88_STATUS_MAP.get(status_text, "beschikbaar")
|
||||
|
||||
# Address
|
||||
adres = _text(card, "h4 a") or _text(card, "h4")
|
||||
|
||||
# Surface + rooms
|
||||
woonoppervlak_card = parse_m2(_text(card, "span.infosize"))
|
||||
kamers_card = None
|
||||
rooms_txt = _text(card, "span.inforoom")
|
||||
if rooms_txt:
|
||||
m = re.search(r"(\d+)", rooms_txt)
|
||||
kamers_card = int(m.group(1)) if m else None
|
||||
|
||||
# Hero: first active carousel image
|
||||
img = card.select_one(".item.active img")
|
||||
hero = img.get("src") or img.get("data-original") if img else None
|
||||
|
||||
kk = _88makelaars_detail(detail_url)
|
||||
|
||||
listings.append(RawListing(
|
||||
url=detail_url,
|
||||
source_makelaar="88makelaars",
|
||||
status=status,
|
||||
adres=adres,
|
||||
postcode=kk.get("postcode"),
|
||||
stad="Den Haag",
|
||||
prijs=prijs,
|
||||
hero_image_url=hero,
|
||||
woningtype=kk.get("woningtype"),
|
||||
woonoppervlak=parse_m2(kk.get("woonoppervlak")) or woonoppervlak_card,
|
||||
kamers=kamers_card,
|
||||
slaapkamers=int(kk["slaapkamers"]) if kk.get("slaapkamers") else None,
|
||||
energielabel=kk.get("energielabel"),
|
||||
))
|
||||
if config.APP_ENV == "dev":
|
||||
break
|
||||
except Exception as e:
|
||||
log.warning("88makelaars: parse fout: %s", e)
|
||||
|
||||
if len(cards) < 10:
|
||||
break
|
||||
page += 1
|
||||
|
||||
log.info("88makelaars: %d listings opgehaald", len(listings))
|
||||
return listings
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Borgdorff Makelaars (Den Haag / Westland) — SURE WordPress plugin
|
||||
# ---------------------------------------------------------------------------
|
||||
# Covers Den Haag ('s-gravenhage), Monster, Naaldwijk etc. Filter for Den Haag.
|
||||
# Same SURE plugin as Schieland Borsboom but uses a.card--house (double dash).
|
||||
# No postcode on detail page.
|
||||
|
||||
_BORGDORFF_BASE = "https://www.borgdorff.nl"
|
||||
_BORGDORFF_DEN_HAAG = {"'s-gravenhage", "den haag"}
|
||||
|
||||
_BORGDORFF_BADGE_MAP = {
|
||||
"badge--info": "beschikbaar",
|
||||
"badge--warning": "onder_bod",
|
||||
"badge--danger": "verkocht",
|
||||
}
|
||||
|
||||
|
||||
def _borgdorff_detail(detail_url: str) -> dict:
|
||||
"""Fetch Borgdorff detail page; extract #kenmerken li span pairs."""
|
||||
try:
|
||||
soup = fetch_soup(detail_url)
|
||||
kv: dict[str, str] = {}
|
||||
for li in soup.select("#kenmerken li"):
|
||||
spans = li.select("span")
|
||||
if len(spans) >= 2:
|
||||
label = spans[0].get_text(strip=True).lower()
|
||||
value = spans[1].get_text(strip=True)
|
||||
kv[label] = value
|
||||
return {
|
||||
"status": kv.get("status", "").lower(),
|
||||
"woningtype": kv.get("soort woonhuis") or kv.get("soort woning") or kv.get("soort bouw"),
|
||||
"bouwjaar": kv.get("bouwjaar"),
|
||||
"woonoppervlak": kv.get("gebruiksoppervlakte wonen") or kv.get("gebruiksoppervlakte"),
|
||||
"perceeloppervlak": kv.get("perceeloppervlakte"),
|
||||
"slaapkamers": kv.get("aantal slaapkamers"),
|
||||
"energielabel": kv.get("energielabel"),
|
||||
}
|
||||
except Exception as e:
|
||||
log.warning("borgdorff: detail fetch fout %s: %s", detail_url, e)
|
||||
return {}
|
||||
|
||||
|
||||
def fetch_borgdorff() -> list[RawListing]:
|
||||
"""Fetch Borgdorff listings; only Den Haag / 's-gravenhage, only koop."""
|
||||
listings = []
|
||||
page = 1
|
||||
|
||||
while True:
|
||||
if page == 1:
|
||||
url = f"{_BORGDORFF_BASE}/wonen?sure_koop_huur=koop"
|
||||
else:
|
||||
url = f"{_BORGDORFF_BASE}/wonen/page/{page}/?sure_koop_huur=koop"
|
||||
|
||||
soup = fetch_soup(url)
|
||||
cards = soup.select("a.card--house")
|
||||
if not cards:
|
||||
break
|
||||
|
||||
for card in cards:
|
||||
try:
|
||||
href = card.get("href", "")
|
||||
if not href:
|
||||
continue
|
||||
detail_url = href if href.startswith("http") else _BORGDORFF_BASE + href
|
||||
|
||||
# Filter: only Den Haag
|
||||
stad_el = card.select_one("p.lead-two")
|
||||
stad = stad_el.get_text(strip=True) if stad_el else None
|
||||
if not stad or stad.lower() not in _BORGDORFF_DEN_HAAG:
|
||||
continue
|
||||
|
||||
# Price — filter early
|
||||
prijs = parse_prijs(_text(card, "p.strong"))
|
||||
if prijs and prijs > config.MAX_PRICE:
|
||||
continue
|
||||
|
||||
# Status from badge class
|
||||
label_span = card.select_one("span.card-house__label")
|
||||
status = "beschikbaar"
|
||||
if label_span:
|
||||
for cls in label_span.get("class", []):
|
||||
if cls in _BORGDORFF_BADGE_MAP:
|
||||
status = _BORGDORFF_BADGE_MAP[cls]
|
||||
break
|
||||
|
||||
# Address
|
||||
adres = _text(card, "h4")
|
||||
|
||||
# Hero: largest source srcset
|
||||
src_tag = card.select_one('picture source[media="(min-width:1280px)"]')
|
||||
hero = src_tag.get("srcset") if src_tag else None
|
||||
if not hero:
|
||||
img = card.select_one("img[data-src]")
|
||||
hero = img.get("data-src") if img else None
|
||||
if hero and not hero.startswith("http"):
|
||||
hero = _BORGDORFF_BASE + hero
|
||||
|
||||
# Surface + bedrooms from data icons
|
||||
woonoppervlak_card = None
|
||||
slaapkamers_card = None
|
||||
for data_div in card.select("div.data"):
|
||||
inner = data_div.select_one("p.small")
|
||||
if not inner:
|
||||
continue
|
||||
txt = inner.get_text(strip=True)
|
||||
if data_div.select_one("i.icon-surface"):
|
||||
woonoppervlak_card = parse_m2(txt)
|
||||
elif data_div.select_one("i.icon-bed"):
|
||||
m = re.search(r"(\d+)", txt)
|
||||
slaapkamers_card = int(m.group(1)) if m else None
|
||||
|
||||
kk = _borgdorff_detail(detail_url)
|
||||
|
||||
# Refine status from detail page
|
||||
detail_status_map = {
|
||||
"beschikbaar": "beschikbaar",
|
||||
"onder bod": "onder_bod",
|
||||
"onder optie": "onder_bod",
|
||||
"verkocht": "verkocht",
|
||||
}
|
||||
if kk.get("status"):
|
||||
status = detail_status_map.get(kk["status"], status)
|
||||
|
||||
listings.append(RawListing(
|
||||
url=detail_url,
|
||||
source_makelaar="borgdorff",
|
||||
status=status,
|
||||
adres=adres,
|
||||
postcode=None, # not exposed by broker
|
||||
stad=stad,
|
||||
prijs=prijs,
|
||||
hero_image_url=hero,
|
||||
woningtype=kk.get("woningtype"),
|
||||
bouwjaar=int(kk["bouwjaar"]) if kk.get("bouwjaar") else None,
|
||||
woonoppervlak=parse_m2(kk.get("woonoppervlak")) or woonoppervlak_card,
|
||||
perceeloppervlak=parse_m2(kk.get("perceeloppervlak")),
|
||||
slaapkamers=int(kk["slaapkamers"]) if kk.get("slaapkamers") else slaapkamers_card,
|
||||
energielabel=kk.get("energielabel"),
|
||||
))
|
||||
if config.APP_ENV == "dev":
|
||||
break
|
||||
except Exception as e:
|
||||
log.warning("borgdorff: parse fout: %s", e)
|
||||
|
||||
if len(cards) < 15:
|
||||
break
|
||||
page += 1
|
||||
|
||||
log.info("borgdorff: %d listings opgehaald", len(listings))
|
||||
return listings
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# SCRAPERS — exporteer hier alle actieve SSR adapters
|
||||
# ---------------------------------------------------------------------------
|
||||
@@ -1309,4 +1891,9 @@ SCRAPERS = {
|
||||
'vwmakelaars': fetch_vwmakelaars,
|
||||
'roepman': fetch_roepman,
|
||||
'zomakelaars': fetch_zomakelaars,
|
||||
'post': fetch_post,
|
||||
'morris': fetch_morris,
|
||||
'olsthoorn': fetch_olsthoorn,
|
||||
'88makelaars': fetch_88makelaars,
|
||||
'borgdorff': fetch_borgdorff,
|
||||
}
|
||||
|
||||
@@ -16,7 +16,7 @@ logging.basicConfig(
|
||||
)
|
||||
|
||||
# --- change this to test a different adapter ---
|
||||
ADAPTER = SCRAPERS['zomakelaars']
|
||||
ADAPTER = SCRAPERS['post']
|
||||
|
||||
if __name__ == "__main__":
|
||||
print(f"Testing adapter: {ADAPTER.__name__}")
|
||||
|
||||
Reference in New Issue
Block a user