How We Indexed 1.14 Million Court Decisions in 3 Weeks

A look behind the scenes: how Mont Virtua built the largest source-verified Swiss legal database. Technical decisions, mistakes, and lessons.

How We Indexed 1.14 Million Court Decisions in 3 Weeks

People sometimes ask how a young company has a database more comprehensive than what established providers have built over years. The answer is unspectacular: we did the obvious thing that nobody does. We went directly to the source.

This article is not marketing. It is an experience report: what we built, which decisions we made, and what went wrong.

Day 1: The Realisation

Swiss law is public. Completely. Free of charge.

Fedlex publishes all federal statutes. The 26 cantons publish their statutes. The Federal Supreme Court publishes its decisions. The Federal Administrative Court publishes its decisions. 115 courts publish online.

Everything is there. On official government servers. In structured formats. No paywall. No licence agreement.

Yet law firms and companies pay thousands of francs per year for access to databases that fundamentally do the same thing: collect this public data, process it, and make it searchable. The processing has value. The monopoly on the data itself does not exist.

We asked ourselves: what happens if we take the same public sources, but with modern infrastructure? Not 2005 technology. Instead: PostgreSQL with full-text search in four languages. Vector embeddings for semantic search. A citation graph with 1.42 million edges. All on a single server.

Week 1: Scraping

The Sources

Source Type Volume Format
Fedlex Federal statutes 4,800+ XML/HTML
26 cantonal portals Cantonal statutes 23,000+ HTML (26 different formats)
BGer Federal Supreme Court decisions ~70,000 HTML
BVGer Federal Administrative Court 91,582 HTML
Cantonal courts Cantonal decisions ~980,000 HTML (113 different formats)
SHAB Official Gazette 2,500,000 XML
FINMA Regulation 27 tables HTML/PDF

The 26-Canton Challenge

Every canton has its own portal. Its own HTML format. Its own URL structure. No API. No unified interface.

Zurich delivers clean HTML with consistent structure. Bern uses a CMS from the 2000s with deeply nested frames. Appenzell Innerrhoden has a static website probably maintained by hand.

We wrote a separate parser for each canton. 26 parsers. Some with 50 lines of code, some with 500. The work was not intellectually demanding. It was patient.

Lesson 1: The bottleneck in data capture is not technology. It is the willingness to work through 26 different HTML formats without taking shortcuts.

Scraper Architecture

We deliberately built simply:

  • Python scripts. No framework. No Scrapy. No Selenium.
  • requests for HTTP. BeautifulSoup for HTML parsing. psycopg2 for PostgreSQL.
  • Each source its own script. No attempt to build a universal abstraction.
  • Incremental updates: each scraper remembers where it left off. On the next run, it only fetches new entries.

Lesson 2: Generic scraping frameworks save time when sources are similar. When every source is a special case, a simple script per source is faster to write and easier to debug.

Frequency

Everything runs at night. A cron job at 02:00 starts the pipeline. Fedlex first, then cantons alphabetically, then courts, then SHAB. Total runtime: 2-4 hours, depending on server response times.

We do not overload servers. Maximum parallelism: 2 concurrent requests per source. Pauses between requests. We are guests on public servers and behave accordingly.

Week 2: Structuring and Indexing

The Schema

Having raw data is worthless if it is not structured. Our database schema:

  • laws: 27,795 statutes with metadata (SR number, title, entry into force date, canton, language, status)
  • law_units: 2.02 million law units (articles, paragraphs, items). Each unit references its statute. Each has the full text in up to four languages.
  • decisions: 1.14 million court decisions with metadata (court, date, docket number, legal area)
  • citations: 1.42 million citation edges. Which decision cites which article? Which article is cited by which decisions?

PostgreSQL has built-in full-text search. For four languages, we needed four different text search configurations:

  • german for DE
  • french for FR
  • italian for IT
  • english for EN

Each law unit has a tsvector index per language. Search uses ts_rank for relevance sorting. For most queries, this is sufficiently fast: under 200ms for full-text search across 2 million entries.

Lesson 3: PostgreSQL full-text search is underestimated. For structured text corpora with known languages, it delivers 90% of the quality of Elasticsearch at 10% of the operational complexity.

Embeddings

Full-text search finds words. Semantic search finds concepts.

Example: a search for “dismissal protection during illness” should also find decisions mentioning “blocking period” and “Art. 336c CO” without containing the term “dismissal protection.”

We converted all 2.02 million law units and 1.14 million decisions into 384-dimensional vectors using the all-MiniLM-L6-v2 model. The model is small (80 MB), fast, and multilingual. Not the best embedding model on the market, but entirely sufficient for an initial index.

The vectors sit in PostgreSQL with the pgvector extension. HNSW index with m=16 and ef_construction=64. Approximate nearest neighbour search across 3 million vectors in under 50ms.

Lesson 4: Start with the simplest model that works. A poor embedding that is live beats a perfect embedding still in evaluation.

The Citation Graph

Court decisions cite statutes. Statutes reference other statutes. Later decisions confirm, refine, or overturn earlier decisions.

These relationships are the real value. Not the text itself, which is public. But the connections.

We extract citations automatically from decision texts. RegEx patterns for: “Art. 336c CO,” “BGE 148 III 25,” “BVGer A-1234/2025,” cantonal decision references. 1.42 million edges. Stored in a PostgreSQL table. Queryable with recursive CTEs.

What this enables:

  • Which decisions cite a specific article?
  • How has case law on an article evolved over time?
  • Are there contradictions between courts?
  • Which articles are cited most frequently (and are thus most contested)?

Lesson 5: The citation graph is not a feature. It is the product. Everything else is infrastructure.

Week 3: Quality Assurance

What Went Wrong

Problem 1: 34,000 structural nodes without text. Law units like “Second Title” or “Third Section” are structural elements, not content. We initially indexed them, which polluted search results. Solution: identify these nodes and exclude them from the embedding pipeline. 34,000 entries cleaned.

Problem 2: ~12,000 decisions without text. Some cantonal courts publish only headnotes, not full texts. The scrapers created these entries but with empty text fields. Solution: marked, not deleted. The metadata (court, date, citations) still has value.

Problem 3: Encoding chaos with French texts. Two cantonal portals deliver Latin-1 instead of UTF-8. Accents became question marks. Solution: explicit encoding detection with chardet before parsing.

Problem 4: Duplicates in cantonal statutes. Some cantons publish the same statute under different URLs (consolidated version and amendment act). Our deduplication is based on SR number + canton + entry into force date. This caught 98% of duplicates. The remaining 2% were manually cleaned.

What We Got Right

  • No data invented. Every entry in our database has a source URL pointing to an official government server. Not a single data point comes from a third-party source.
  • Incremental, not monolithic. Each scraper can be restarted individually. If one cantonal portal is offline, the rest continues.
  • Everything in one database. PostgreSQL. No Elasticsearch cluster, no Redis instance, no separate vector store. One database that does everything. Less infrastructure, fewer failure points.

The Numbers

Metric Value
Statutes 27,795
Law units 2,020,000
Court decisions 1,140,000
Citation edges 1,420,000
SHAB entries 2,500,000
FINMA tables 27
Database size 41 GB
Development time 3 weeks
Infrastructure costs CHF 0 (local server)
Running costs ~CHF 15/month (AI inference for alerts)

What This Means

We do not own exclusive data. We own a pipeline that unlocks public data better than any existing provider. That is no secret. The data sources are open to everyone.

The advantage lies in the execution: 26 cantons correctly parsed, 115 courts covered, 1.42 million citation edges extracted, four languages indexed, everything searchable in a single query. Anyone can replicate this who is willing to spend three weeks writing parsers for 26 different HTML formats.

Most are not.


This article is for informational purposes and does not constitute legal advice.

Back to Insights

Related articles