Soro: A Lightweight Foundation Model and Chatbot for Tajik
Soro is a new family of Tajik-specialized conversational large language models (LLMs) designed for deployment in environments with tight compute and connectivity constraints, specifically in Tajikistan. Developed from open-weight Gemma 3 checkpoints, Soro underwent a two-stage training process. First, it received Tajik-only continual pretraining on a curated 1.9-billion-token corpus, which included filtered web text, PDF documents, and curriculum-aligned educational materials. This was followed by supervised instruction tuning using 40,000 Tajik teacher-style examples. To address the limited coverage of Tajik in standard benchmarks and enable rigorous evaluation, the researchers introduced a new suite of Tajik benchmarks. These benchmarks, open-sourced on Hugging Face, cover general knowledge, linguistic competence, and school and university entrance exam domains. On these new Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines, while also retaining strong English performance on standard datasets. The project further demonstrated that FP8 and INT4 quantization of Soro preserves most Tajik-language gains, reducing memory requirements for edge deployment. This development supports an ongoing education-sector pilot and planned scale-out across schools in Tajikistan.
Developers gain access to specialized, resource-efficient LLMs and benchmarks for low-resource languages, enabling broader application development.