refactoring
This commit is contained in:
347
README.md
347
README.md
@ -1,11 +1,352 @@
|
||||
# kleinanzeigen-boosted
|
||||
# Kleinanzeigen Boosted
|
||||
|
||||
***WIP***
|
||||
A web-based map visualization tool for searching and exploring listings from kleinanzeigen.de with real-time geographic display on OpenStreetMap.
|
||||
|
||||
## Features
|
||||
|
||||
- 🗺️ Interactive map visualization
|
||||
- 🔍 Advanced search with price range (more options in future)
|
||||
- 📍 Automatic geocoding of listings via Nominatim API
|
||||
- ⚡ Parallel scraping with concurrent workers
|
||||
- 📊 Prometheus-compatible metrics endpoint
|
||||
- 🎯 Real-time progress tracking with ETA
|
||||
- 💾 ZIP code caching to minimize API calls
|
||||
- 🌐 User location display on map
|
||||
|
||||
## Architecture
|
||||
|
||||
**Backend**: Flask API server with multi-threaded scraping
|
||||
**Frontend**: Vanilla JavaScript with Leaflet.js for maps
|
||||
**Data Sources**: kleinanzeigen.de, OpenStreetMap/Nominatim
|
||||
|
||||
## Requirements
|
||||
|
||||
```
|
||||
### Python Packages
|
||||
|
||||
```bash
|
||||
pip install flask flask-cors beautifulsoup4 lxml urllib3 requests
|
||||
```
|
||||
|
||||
### System Requirements
|
||||
|
||||
- Python 3.8+
|
||||
- nginx (for production deployment)
|
||||
|
||||
## Installation
|
||||
|
||||
### 1. Create System User
|
||||
|
||||
```bash
|
||||
mkdir -p /home/kleinanzeigenscraper/
|
||||
useradd --system -K MAIL_DIR=/dev/null kleinanzeigenscraper -d /home/kleinanzeigenscraper
|
||||
chown -R kleinanzeigenscraper:kleinanzeigenscraper /home/kleinanzeigenscraper
|
||||
```
|
||||
|
||||
### 2. Clone Repository
|
||||
|
||||
```bash
|
||||
cd /home/kleinanzeigenscraper/
|
||||
mkdir git
|
||||
cd git
|
||||
git clone https://git.mosad.xyz/localhorst/kleinanzeigen-boosted.git
|
||||
cd kleinanzeigen-boosted
|
||||
git checkout main
|
||||
```
|
||||
|
||||
### 3. Install Dependencies
|
||||
|
||||
```bash
|
||||
pip install flask flask-cors beautifulsoup4 lxml urllib3 requests
|
||||
```
|
||||
|
||||
### 4. Configure Application
|
||||
|
||||
Create `config.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"server": {
|
||||
"host": "127.0.0.1",
|
||||
"port": 5000,
|
||||
"debug": false
|
||||
},
|
||||
"scraping": {
|
||||
"session_timeout": 300,
|
||||
"listings_per_page": 25,
|
||||
"max_workers": 5,
|
||||
"min_workers": 2,
|
||||
"rate_limit_delay": 0.5,
|
||||
"geocoding_delay": 1.0
|
||||
},
|
||||
"cache": {
|
||||
"zip_cache_file": "zip_cache.json"
|
||||
},
|
||||
"apis": {
|
||||
"nominatim": {
|
||||
"url": "https://nominatim.openstreetmap.org/search",
|
||||
"user_agent": "kleinanzeigen-scraper"
|
||||
},
|
||||
"kleinanzeigen": {
|
||||
"base_url": "https://www.kleinanzeigen.de"
|
||||
}
|
||||
},
|
||||
"user_agents": [
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Create Systemd Service
|
||||
|
||||
Create `/lib/systemd/system/kleinanzeigenscraper.service`:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Kleinanzeigen Scraper API
|
||||
After=network.target systemd-networkd-wait-online.service
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=kleinanzeigenscraper
|
||||
WorkingDirectory=/home/kleinanzeigenscraper/git/kleinanzeigen-boosted/backend/
|
||||
ExecStart=/usr/bin/python3 scrape_proxy.py
|
||||
Restart=on-failure
|
||||
RestartSec=10
|
||||
StandardOutput=append:/var/log/kleinanzeigenscraper.log
|
||||
StandardError=append:/var/log/kleinanzeigenscraper.log
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
### 6. Enable and Start Service
|
||||
|
||||
```bash
|
||||
systemctl daemon-reload
|
||||
systemctl enable kleinanzeigenscraper.service
|
||||
systemctl start kleinanzeigenscraper.service
|
||||
systemctl status kleinanzeigenscraper.service
|
||||
```
|
||||
|
||||
### 7. Configure nginx Reverse Proxy
|
||||
|
||||
Create `/etc/nginx/sites-available/kleinanzeigenscraper`:
|
||||
|
||||
```nginx
|
||||
server {
|
||||
listen 80;
|
||||
server_name your-domain.com;
|
||||
|
||||
# Redirect HTTP to HTTPS
|
||||
return 301 https://$server_name$request_uri;
|
||||
}
|
||||
|
||||
server {
|
||||
listen 443 ssl http2;
|
||||
server_name your-domain.com;
|
||||
|
||||
ssl_certificate /path/to/ssl/cert.pem;
|
||||
ssl_certificate_key /path/to/ssl/key.pem;
|
||||
|
||||
# Security headers
|
||||
add_header X-Frame-Options "SAMEORIGIN" always;
|
||||
add_header X-Content-Type-Options "nosniff" always;
|
||||
add_header X-XSS-Protection "1; mode=block" always;
|
||||
|
||||
location / {
|
||||
proxy_pass http://127.0.0.1:5000;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_read_timeout 300;
|
||||
}
|
||||
|
||||
location /api/ {
|
||||
proxy_pass http://127.0.0.1:5000/api/;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_read_timeout 300;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Enable site:
|
||||
|
||||
```bash
|
||||
ln -s /etc/nginx/sites-available/kleinanzeigenscraper /etc/nginx/sites-enabled/
|
||||
nginx -t
|
||||
systemctl reload nginx
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### `POST /api/search`
|
||||
Start a new search session.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"search_term": "Fahrrad",
|
||||
"num_listings": 25,
|
||||
"min_price": 0,
|
||||
"max_price": 1000
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"session_id": "uuid-string",
|
||||
"total": 25
|
||||
}
|
||||
```
|
||||
|
||||
### `GET /api/scrape/<session_id>`
|
||||
Get the next scraped listing from an active session.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"complete": false,
|
||||
"listing": {
|
||||
"title": "Mountain Bike",
|
||||
"price": 450,
|
||||
"id": 123456,
|
||||
"zip_code": "76593",
|
||||
"address": "Gernsbach",
|
||||
"date_added": "2025-11-20",
|
||||
"image": "https://...",
|
||||
"url": "https://...",
|
||||
"lat": 48.7634,
|
||||
"lon": 8.3344
|
||||
},
|
||||
"progress": {
|
||||
"current": 5,
|
||||
"total": 25
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### `POST /api/scrape/<session_id>/cancel`
|
||||
Cancel an active scraping session and delete cached listings.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"cancelled": true,
|
||||
"message": "Session deleted"
|
||||
}
|
||||
```
|
||||
|
||||
### `GET /api/health`
|
||||
Health check endpoint.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"status": "ok"
|
||||
}
|
||||
```
|
||||
|
||||
### `GET /api/metrics`
|
||||
Prometheus-compatible metrics endpoint.
|
||||
|
||||
**Response** (text/plain):
|
||||
```
|
||||
# HELP search_requests_total Total number of search requests
|
||||
# TYPE search_requests_total counter
|
||||
search_requests_total 42
|
||||
|
||||
# HELP scrape_requests_total Total number of scrape requests
|
||||
# TYPE scrape_requests_total counter
|
||||
scrape_requests_total 1050
|
||||
|
||||
# HELP uptime_seconds Application uptime in seconds
|
||||
# TYPE uptime_seconds gauge
|
||||
uptime_seconds 86400
|
||||
|
||||
# HELP active_sessions Number of active scraping sessions
|
||||
# TYPE active_sessions gauge
|
||||
active_sessions 2
|
||||
|
||||
# HELP cache_size Number of cached ZIP codes
|
||||
# TYPE cache_size gauge
|
||||
zip_code_cache_size 150
|
||||
|
||||
# HELP kleinanzeigen_http_responses_total HTTP responses from kleinanzeigen.de
|
||||
# TYPE kleinanzeigen_http_responses_total counter
|
||||
kleinanzeigen_http_responses_total{code="200"} 1000
|
||||
kleinanzeigen_http_responses_total{code="error"} 5
|
||||
|
||||
# HELP nominatim_http_responses_total HTTP responses from Nominatim API
|
||||
# TYPE nominatim_http_responses_total counter
|
||||
nominatim_http_responses_total{code="200"} 150
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
### Server Configuration
|
||||
- `host`: Bind address (default: 0.0.0.0)
|
||||
- `port`: Port number (default: 5000)
|
||||
- `debug`: Debug mode (default: false)
|
||||
|
||||
### Scraping Configuration
|
||||
- `session_timeout`: Session expiry in seconds (default: 300)
|
||||
- `listings_per_page`: Listings per page on kleinanzeigen.de (default: 25)
|
||||
- `max_workers`: Number of parallel scraping threads (default: 4)
|
||||
- `min_workers`: Number of parallel scraping threads (default: 2)
|
||||
- `rate_limit_delay`: Delay between batches in seconds (default: 0.5)
|
||||
- `geocoding_delay`: Delay between geocoding requests (default: 1.0)
|
||||
|
||||
### Cache Configuration
|
||||
- `zip_cache_file`: Path to ZIP code cache file (default: zip_cache.json)
|
||||
|
||||
## Monitoring
|
||||
|
||||
View logs:
|
||||
```bash
|
||||
tail -f /var/log/kleinanzeigenscraper.log
|
||||
```
|
||||
|
||||
Check service status:
|
||||
```bash
|
||||
systemctl status kleinanzeigenscraper.service
|
||||
```
|
||||
|
||||
Monitor metrics (Prometheus):
|
||||
```bash
|
||||
curl http://localhost:5000/api/metrics
|
||||
```
|
||||
|
||||
## Development
|
||||
|
||||
Run in debug mode:
|
||||
```bash
|
||||
python3 scrape_proxy.py
|
||||
```
|
||||
|
||||
Frontend files are located in `web/`:
|
||||
- `index.html` - Main HTML file
|
||||
- `css/style.css` - Stylesheet
|
||||
- `js/config.js` - Configuration
|
||||
- `js/map.js` - Map functions
|
||||
- `js/ui.js` - UI functions
|
||||
- `js/api.js` - API communication
|
||||
- `js/app.js` - Main application
|
||||
|
||||
## License
|
||||
|
||||
This project is provided as-is for educational purposes. Respect kleinanzeigen.de's terms of service and robots.txt when using this tool.
|
||||
|
||||
## Credits
|
||||
|
||||
Built with:
|
||||
- Flask (Python web framework)
|
||||
- Leaflet.js (Interactive maps)
|
||||
- BeautifulSoup4 (HTML parsing)
|
||||
- OpenStreetMap & Nominatim (Geocoding)
|
||||
Reference in New Issue
Block a user