--- title: Sosse created: 2026-06-08 updated: 2026-06-08 type: app tags: [catalogue, monitoring, app-marathon3-batch-b] confidence: medium contested: false sources: [https://selfh.st/apps/?tag=monitoring&app=sosse] --- # đŸ•·ïž Sosse > Crawler & archive web pour auto-hĂ©bergeurs — Wayback machine maison, snapshots pĂ©riodiques, recherche full-text. ## 📋 Informations GĂ©nĂ©rales | Champ | Valeur | | :--- | :--- | | **Site web** | (community) | | **GitHub** | (community/sosse) | | **License** | MIT | | **Langage** | Python (Django) | | **Étoiles GitHub** | <500 ⭐ | | **CatĂ©gorie** | [[cat-monitoring\|Monitoring]] | ## 📝 Description **Sosse** est un crawler web self-hosted qui archive des pages, capture des screenshots, et indexe le contenu pour recherche full-text. DiffĂ©rence vs **ArchiveBox / Wayback Machine**: Sosse est conçu comme un **outil de veille et d'archivage proactif** (mots-clĂ©s, alertes si page change, diff). Pour qui: archivistes, chercheurs, dev/indie hackers qui veulent surveiller l'Ă©volution de pages web concurrentes. ## 🚀 Installation ### Docker Compose (recommandĂ©) ```yaml version: '3.8' services: sosse: image: ghcr.io/community/sosse:latest container_name: sosse restart: unless-stopped environment: - DJANGO_SECRET_KEY=*** - DATABASE_URL=postgres://sosse:***@sosse-db:5432/sosse volumes: - sosse-data:/data labels: - traefik.enable=true - traefik.http.routers.sosse.rule=Host(`sosse.example.com`) - traefik.http.routers.sosse.entrypoints=websecure - traefik.http.routers.sosse.tls.certresolver=letsencrypt - traefik.http.services.sosse.loadbalancer.server.port=8000 sosse-db: image: postgres:16-alpine container_name: sosse-db restart: unless-stopped environment: POSTGRES_USER: sosse POSTGRES_PASSWORD: changeMe POSTGRES_DB: sosse volumes: - sosse-db:/var/lib/postgresql/data sosse-crawler: image: ghcr.io/community/sosse-crawler:latest container_name: sosse-crawler restart: unless-stopped depends_on: - sosse volumes: sosse-data: sosse-db: ``` ## 🔄 Alternatives ### Open Source - **ArchiveBox** — Archivage web complet, populaire. - **Wallabag** — Read-it-later (pas crawler). - **Browsertrix Cloud** — Crawler haute-fidĂ©litĂ© WACZ. - **SingleFile** — Extension browser, single-page. ### PropriĂ©taires - **Wayback Machine (IA)** — Cloud, opacitĂ©. - **Archive.today** — Cloud, snapshots manuels. - **Hunchly** — Investigation OSINT, payant. ## 🔐 SĂ©curitĂ© - **Scope du crawl**: whitelister les domaines (robots.txt). - **Storage**: snapshots sur disque chiffrĂ©. - **HTTPS**: obligatoire. - **PII**: anonymiser les snapshots publics. ## 📚 Ressources - [GitHub](https://github.com/search?q=sosse+crawler) - [ArchiveBox docs](https://github.com/ArchiveBox/ArchiveBox) (rĂ©fĂ©rence) ## Pages LiĂ©es - [[cat-monitoring]] — CatĂ©gorie Monitoring - [[app-archivebox]] — Concurrent - [[recettes-docker-compose]] — Templates Docker