How We Indexed 1.14 Million Court Decisions in 3 Weeks
People sometimes ask how a young company has a database more comprehensive than what established providers have built over years. The answer is unspectacular: we did the obvious thing that nobody does. We went directly to the source.
This article is not marketing. It is an experience report: what we built, which decisions we made, and what went wrong.
Day 1: The Realisation
Swiss law is public. Completely. Free of charge.
Fedlex publishes all federal statutes. The 26 cantons publish their statutes. The Federal Supreme Court publishes its decisions. The Federal Administrative Court publishes its decisions. 115 courts publish online.
Everything is there. On official government servers. In structured formats. No paywall. No licence agreement.
Yet law firms and companies pay thousands of francs per year for access to databases that fundamentally do the same thing: collect this public data, process it, and make it searchable. The processing has value. The monopoly on the data itself does not exist.
We asked ourselves: what happens if we take the same public sources, but with modern infrastructure? Not 2005 technology. Instead: PostgreSQL with full-text search in four languages. Vector embeddings for semantic search. A citation graph with 1.42 million edges. All on a single server.
Week 1: Scraping
The Sources
| Source | Type | Volume | Format |
|---|---|---|---|
| Fedlex | Federal statutes | 4,800+ | XML/HTML |
| 26 cantonal portals | Cantonal statutes | 23,000+ | HTML (26 different formats) |
| BGer | Federal Supreme Court decisions | ~70,000 | HTML |
| BVGer | Federal Administrative Court | 91,582 | HTML |
| Cantonal courts | Cantonal decisions | ~980,000 | HTML (113 different formats) |
| SHAB | Official Gazette | 2,500,000 | XML |
| FINMA | Regulation | 27 tables | HTML/PDF |
The 26-Canton Challenge
Every canton has its own portal. Its own HTML format. Its own URL structure. No API. No unified interface.
Zurich delivers clean HTML with consistent structure. Bern uses a CMS from the 2000s with deeply nested frames. Appenzell Innerrhoden has a static website probably maintained by hand.
We wrote a separate parser for each canton. 26 parsers. Some with 50 lines of code, some with 500. The work was not intellectually demanding. It was patient.
Lesson 1: The bottleneck in data capture is not technology. It is the willingness to work through 26 different HTML formats without taking shortcuts.
Scraper Architecture
We deliberately built simply:
- Python scripts. No framework. No Scrapy. No Selenium.
requestsfor HTTP.BeautifulSoupfor HTML parsing.psycopg2for PostgreSQL.- Each source its own script. No attempt to build a universal abstraction.
- Incremental updates: each scraper remembers where it left off. On the next run, it only fetches new entries.
Lesson 2: Generic scraping frameworks save time when sources are similar. When every source is a special case, a simple script per source is faster to write and easier to debug.
Frequency
Everything runs at night. A cron job at 02:00 starts the pipeline. Fedlex first, then cantons alphabetically, then courts, then SHAB. Total runtime: 2-4 hours, depending on server response times.
We do not overload servers. Maximum parallelism: 2 concurrent requests per source. Pauses between requests. We are guests on public servers and behave accordingly.
Week 2: Structuring and Indexing
The Schema
Having raw data is worthless if it is not structured. Our database schema:
- laws: 27,795 statutes with metadata (SR number, title, entry into force date, canton, language, status)
- law_units: 2.02 million law units (articles, paragraphs, items). Each unit references its statute. Each has the full text in up to four languages.
- decisions: 1.14 million court decisions with metadata (court, date, docket number, legal area)
- citations: 1.42 million citation edges. Which decision cites which article? Which article is cited by which decisions?
Full-Text Search
PostgreSQL has built-in full-text search. For four languages, we needed four different text search configurations:
germanfor DEfrenchfor FRitalianfor ITenglishfor EN
Each law unit has a tsvector index per language. Search uses ts_rank for relevance sorting. For most queries, this is sufficiently fast: under 200ms for full-text search across 2 million entries.
Lesson 3: PostgreSQL full-text search is underestimated. For structured text corpora with known languages, it delivers 90% of the quality of Elasticsearch at 10% of the operational complexity.
Embeddings
Full-text search finds words. Semantic search finds concepts.
Example: a search for “dismissal protection during illness” should also find decisions mentioning “blocking period” and “Art. 336c CO” without containing the term “dismissal protection.”
We converted all 2.02 million law units and 1.14 million decisions into 384-dimensional vectors using the all-MiniLM-L6-v2 model. The model is small (80 MB), fast, and multilingual. Not the best embedding model on the market, but entirely sufficient for an initial index.
The vectors sit in PostgreSQL with the pgvector extension. HNSW index with m=16 and ef_construction=64. Approximate nearest neighbour search across 3 million vectors in under 50ms.
Lesson 4: Start with the simplest model that works. A poor embedding that is live beats a perfect embedding still in evaluation.
The Citation Graph
Court decisions cite statutes. Statutes reference other statutes. Later decisions confirm, refine, or overturn earlier decisions.
These relationships are the real value. Not the text itself, which is public. But the connections.
We extract citations automatically from decision texts. RegEx patterns for: “Art. 336c CO,” “BGE 148 III 25,” “BVGer A-1234/2025,” cantonal decision references. 1.42 million edges. Stored in a PostgreSQL table. Queryable with recursive CTEs.
What this enables:
- Which decisions cite a specific article?
- How has case law on an article evolved over time?
- Are there contradictions between courts?
- Which articles are cited most frequently (and are thus most contested)?
Lesson 5: The citation graph is not a feature. It is the product. Everything else is infrastructure.
Week 3: Quality Assurance
What Went Wrong
Problem 1: 34,000 structural nodes without text. Law units like “Second Title” or “Third Section” are structural elements, not content. We initially indexed them, which polluted search results. Solution: identify these nodes and exclude them from the embedding pipeline. 34,000 entries cleaned.
Problem 2: ~12,000 decisions without text. Some cantonal courts publish only headnotes, not full texts. The scrapers created these entries but with empty text fields. Solution: marked, not deleted. The metadata (court, date, citations) still has value.
Problem 3: Encoding chaos with French texts. Two cantonal portals deliver Latin-1 instead of UTF-8. Accents became question marks. Solution: explicit encoding detection with chardet before parsing.
Problem 4: Duplicates in cantonal statutes. Some cantons publish the same statute under different URLs (consolidated version and amendment act). Our deduplication is based on SR number + canton + entry into force date. This caught 98% of duplicates. The remaining 2% were manually cleaned.
What We Got Right
- No data invented. Every entry in our database has a source URL pointing to an official government server. Not a single data point comes from a third-party source.
- Incremental, not monolithic. Each scraper can be restarted individually. If one cantonal portal is offline, the rest continues.
- Everything in one database. PostgreSQL. No Elasticsearch cluster, no Redis instance, no separate vector store. One database that does everything. Less infrastructure, fewer failure points.
The Numbers
| Metric | Value |
|---|---|
| Statutes | 27,795 |
| Law units | 2,020,000 |
| Court decisions | 1,140,000 |
| Citation edges | 1,420,000 |
| SHAB entries | 2,500,000 |
| FINMA tables | 27 |
| Database size | 41 GB |
| Development time | 3 weeks |
| Infrastructure costs | CHF 0 (local server) |
| Running costs | ~CHF 15/month (AI inference for alerts) |
What This Means
We do not own exclusive data. We own a pipeline that unlocks public data better than any existing provider. That is no secret. The data sources are open to everyone.
The advantage lies in the execution: 26 cantons correctly parsed, 115 courts covered, 1.42 million citation edges extracted, four languages indexed, everything searchable in a single query. Anyone can replicate this who is willing to spend three weeks writing parsers for 26 different HTML formats.
Most are not.
This article is for informational purposes and does not constitute legal advice.