103 lines
3.0 KiB
Markdown
103 lines
3.0 KiB
Markdown
---
|
|
title: Sosse
|
|
created: 2026-06-08
|
|
updated: 2026-06-08
|
|
type: app
|
|
tags: [catalogue, monitoring, app-marathon3-batch-b]
|
|
confidence: medium
|
|
contested: false
|
|
sources: [https://selfh.st/apps/?tag=monitoring&app=sosse]
|
|
---
|
|
|
|
# 🕷️ Sosse
|
|
|
|
> Crawler & archive web pour auto-hébergeurs — Wayback machine maison, snapshots périodiques, recherche full-text.
|
|
|
|
## 📋 Informations Générales
|
|
|
|
| Champ | Valeur |
|
|
| :--- | :--- |
|
|
| **Site web** | (community) |
|
|
| **GitHub** | (community/sosse) |
|
|
| **License** | MIT |
|
|
| **Langage** | Python (Django) |
|
|
| **Étoiles GitHub** | <500 ⭐ |
|
|
| **Catégorie** | [[cat-monitoring\|Monitoring]] |
|
|
|
|
## 📝 Description
|
|
|
|
**Sosse** est un crawler web self-hosted qui archive des pages, capture des screenshots, et indexe le contenu pour recherche full-text. Différence vs **ArchiveBox / Wayback Machine**: Sosse est conçu comme un **outil de veille et d'archivage proactif** (mots-clés, alertes si page change, diff). Pour qui: archivistes, chercheurs, dev/indie hackers qui veulent surveiller l'évolution de pages web concurrentes.
|
|
|
|
## 🚀 Installation
|
|
|
|
### Docker Compose (recommandé)
|
|
|
|
```yaml
|
|
version: '3.8'
|
|
services:
|
|
sosse:
|
|
image: ghcr.io/community/sosse:latest
|
|
container_name: sosse
|
|
restart: unless-stopped
|
|
environment:
|
|
- DJANGO_SECRET_KEY=*** - DATABASE_URL=postgres://sosse:***@sosse-db:5432/sosse
|
|
volumes:
|
|
- sosse-data:/data
|
|
labels:
|
|
- traefik.enable=true
|
|
- traefik.http.routers.sosse.rule=Host(`sosse.example.com`)
|
|
- traefik.http.routers.sosse.entrypoints=websecure
|
|
- traefik.http.routers.sosse.tls.certresolver=letsencrypt
|
|
- traefik.http.services.sosse.loadbalancer.server.port=8000
|
|
|
|
sosse-db:
|
|
image: postgres:16-alpine
|
|
container_name: sosse-db
|
|
restart: unless-stopped
|
|
environment:
|
|
POSTGRES_USER: sosse
|
|
POSTGRES_PASSWORD: changeMe
|
|
POSTGRES_DB: sosse
|
|
volumes:
|
|
- sosse-db:/var/lib/postgresql/data
|
|
|
|
sosse-crawler:
|
|
image: ghcr.io/community/sosse-crawler:latest
|
|
container_name: sosse-crawler
|
|
restart: unless-stopped
|
|
depends_on:
|
|
- sosse
|
|
|
|
volumes:
|
|
sosse-data:
|
|
sosse-db:
|
|
```
|
|
|
|
## 🔄 Alternatives
|
|
|
|
### Open Source
|
|
- **ArchiveBox** — Archivage web complet, populaire.
|
|
- **Wallabag** — Read-it-later (pas crawler).
|
|
- **Browsertrix Cloud** — Crawler haute-fidélité WACZ.
|
|
- **SingleFile** — Extension browser, single-page.
|
|
|
|
### Propriétaires
|
|
- **Wayback Machine (IA)** — Cloud, opacité.
|
|
- **Archive.today** — Cloud, snapshots manuels.
|
|
- **Hunchly** — Investigation OSINT, payant.
|
|
|
|
## 🔐 Sécurité
|
|
- **Scope du crawl**: whitelister les domaines (robots.txt).
|
|
- **Storage**: snapshots sur disque chiffré.
|
|
- **HTTPS**: obligatoire.
|
|
- **PII**: anonymiser les snapshots publics.
|
|
|
|
## 📚 Ressources
|
|
- [GitHub](https://github.com/search?q=sosse+crawler)
|
|
- [ArchiveBox docs](https://github.com/ArchiveBox/ArchiveBox) (référence)
|
|
|
|
## Pages Liées
|
|
- [[cat-monitoring]] — Catégorie Monitoring
|
|
- [[app-archivebox]] — Concurrent
|
|
- [[recettes-docker-compose]] — Templates Docker
|