Model [new] - Pythia Belarus
1. Overview The Pythia Belarus Model is a suite of autoregressive language models (sizes 14M–2.8B parameters) designed for transparent research on Belarusian natural language processing. Built on EleutherAI’s Pythia framework, it emphasizes full data provenance, cultural alignment, and linguistic fidelity for Belarusian—a language with diglossia (Belarusian vs. Russian dominance), Latin/Cyrillic orthographies, and under-resourced NLP status.
| Source | Fraction | Size (GB) | Notes | |--------|----------|-----------|-------| | Belarusian Web Crawl (by.tld, gov.by, news portals) | 40% | 120 GB | Filtered for Belarusian Cyrillic >80% | | Belarusian Text Corpus (taraschkiewica + narkamaŭka) | 20% | 60 GB | Includes 19th–21st century literature, legal texts | | Slavic Wikipedia dump (be, be-tarask, ru, pl, uk) | 15% | 45 GB | Cross-lingual alignment | | Parallel corpora (Belarusian–Russian, Belarusian–Polish) | 10% | 30 GB | OPUS, ParaCrawl | | Belarusian social media (Telegram, X) | 10% | 30 GB | Pseudonymized, hate speech filtered | | Synthetic data (back-translated from Russian) | 5% | 15 GB | Domain: education, admin forms | pythia belarus model