WIP · Phase 0 proof of concept

Better European language models, aligned with European values

Open models are trained on English first, everything else second. We think we can do better: higher quality output for European languages, more aligned with how Europeans communicate and what we value. This is our proof of concept, starting with Dutch. Let's be honest: this is a fine-tune, not a new foundation model. But a good fine-tune on the right data, properly measured, is a useful starting point.

How we measure

Two tracks: automated benchmarks and human evaluation.

EuroEval (euroeval.com) is a benchmarking framework for European languages. It tests reading comprehension, summarization, NER, and linguistic quality across 12+ languages including Dutch. It's maintained by the Alexandra Institute, funded by the EU's TrustLLM project, and comes with a Python package so anyone can reproduce the results.

Arena eval: we also run a blind human comparison. Same prompt goes to the baseline and our tuned model. Dutch speakers see both outputs side by side (anonymized) and pick which one is better. This gives us something benchmarks can't: "do real people actually prefer this model's Dutch?"

Starting point

We run EuroEval on a few strong open models first and pick the best one as our base. Current shortlist:

Llama 3.1 8B Instruct
Qwen2.5 7B Instruct
Gemma 2 9B Instruct

Whichever scores best on Dutch becomes the base for fine-tuning.

Training data

Parliamentary debate transcripts from the Tweede Kamer and EU Parliament. Official plenary records, publicly available.
Europarl parallel corpus for cross-lingual alignment.
Government publications: Rijksoverheid docs, legal texts, policy briefs.
Clean Dutch web text (CommonCrawl/OSCAR-derived, deduplicated, PII-filtered).
Instruction data generated from the above: summarization, translation, Q&A, entity extraction.

No copyrighted books, no unauthorized subtitles. Every source gets a license check.

Method

QLoRA fine-tuning on EU-hosted GPUs (a single A100 is enough for a 7-9B model).

Run EuroEval baselines.
QLoRA instruction tuning, mostly Dutch data with a small multilingual stability mix.
Optional: short continued pretraining on raw Dutch text if fluency needs work.
Re-run EuroEval + arena eval. Publish deltas.

Cost

Evaluation: ~2-8 GPU-hours per model. €5-22 each.
QLoRA training: ~7-28 GPU-hours per run. €19-77.
Continued pretraining (optional): €400+ for a full billion-token run.

Realistic total: €200-500 including iteration. All compute on EU providers. Price reference: OVHcloud A100 at €2.75/hour.

Timeline

Volunteer project, not a corporate sprint.

Month 1: Set up EuroEval, run baselines, collect data sources.
Month 2-3: Data pipeline + first training experiments.
Month 3-4: Iterate on data mix and training. Run arena eval.
Month 4-5: Publish: adapters, dataset card, recipe, evaluation report.

Deliverables

EuroEval scores for baselines and our model (reproducible).
Arena eval results with Dutch speakers.
Adapters (or full weights) on Hugging Face.
Training recipe, dataset card, and a write-up of what worked and what didn't.

Join on Matrix FAQ Home