Estimin3n: SOTA opensource multimodal kazakh audio/text totext LLM

Estimin3n: SOTA opensource multimodal kazakh
audio/text to text LLM

Uteulin Bekzat Dastanuly
Independent Researcher
https://huggingface.co/govnejri
October 1, 2025

Abstract

The development of automatic speech recognition (ASR) systems for low-resource languages remains a significant challenge in natural language processing. This paper presents Estimin3n, a novel multimodal Audio/Text-to-Text model specifically designed for Kazakh speech recognition. Built upon Google’s Gemma-3n E4B-IT architecture, Estimin3n employs advanced parameter-efficient fine-tuning techniques including Low-Rank Adaptation (LoRA) and 4-bit quantization to achieve effective performance while maintaining computational efficiency. The model was trained using Supervised Fine-Tuning (SFT) on the Kazakh Speech Corpus 2 (KSC2) dataset, incorporating both language and audio components through targeted adaptation. Our experimental results demonstrate that Estimin3n achieves a Word Error Rate (WER) of 12.2% and Character Error Rate (CER) of 4.3% on KSC2 test set, establishing competitive performance for Kazakh speech recognition. Additionally, the model scores 47.4% on the KazMMLU benchmark, indicating strong multilingual capabilities. The implementation utilizes memory-efficient training strategies including 4-bit NormalFloat quantization and 8-bit AdamW optimizer, enabling training on consumer-grade hardware while preserving model quality. This work contributes to the advancement of multilingual AI by providing an open-source, end-to-end multimodal solution for Kazakh language processing, addressing the critical gap in ASR technologies for Turkic languages.

Introduction

The rapid advancement of large language models has predominantly benefited high-resource languages suchas English, Chinese, and Spanish, while low-resource languages continue to face significant technological barriers. Kazakh, a Turkic language spoken by over 14 million people primarily in Kazakhstan and surrounding regions, exemplifies this challenge. Despite its substantial speaker base and official status in Kazakhstan,Kazakh remains underrepresented in modern artificial intelligence systems, particularly in automatic speech recognition and multimodal language processing.

The scarcity of robust ASR systems for Kazakh presents multifaceted challenges. Traditional ASR approaches require extensive labeled speech corpora, sophisticated acoustic modeling, and language-specific optimization—resources that are often limited for low-resource languages. Moreover, existing multilingual models, while capable of handling numerous languages, frequently underperform on languages with limited training representation, complex morphology, and unique phonological characteristics that distinguish Kazakh from more commonly studied languages.

Recent developments in multimodal large language models have demonstrated remarkable capabilities in processing diverse input modalities including text, images, and audio. Models such as GPT-4o and Gemini have shown that unified architectures can effectively handle cross-modal tasks, suggesting new paradigms for addressing low-resource language challenges. However, these state-of-the-art models remain computationally expensive, require substantial resources for deployment, and often exhibit suboptimal performanceon specialized tasks involving underrepresented languages.

The emergence of parameter-efficient fine-tuning techniques, particularly Low-Rank Adaptation (LoRA)and quantization methods, has opened new possibilities for adapting large pre-trained models to specific domains and languages with reduced computational overhead. These approaches enable researchers and practitioners to leverage the knowledge encoded in foundation models while tailoring them to specialized applications through targeted adaptation of a small subset of parameters.

This paper introduces Estimin3n, a multimodal Audio/Text-to-Text model specifically developed for Kazakh speech recognition. Our approach builds upon Google’s Gemma-3n E4B-IT architecture, employing sophisticated parameter-efficient fine-tuning strategies to create an effective, open-source solution for Kazakh language processing. The model represents a significant advancement in multimodal AI for Turkic languages, demonstrating that targeted adaptation of existing architectures can yield competitive performance while maintaining computational efficiency.

Our contributions encompass several key areas. First, we present the first open-source multimodal

Audio/Text-to-Text model specifically optimized for Kazakh speech recognition, addressing a critical gap in language technology for Central Asian languages. Second, we demonstrate the effectiveness of combining LoRA adaptation with 4-bit quantization for multimodal model training, achieving substantial memory efficiency improvements without significant performance degradation. Third, we provide comprehensive experimental validation on both speech recognition metrics and general language understanding benchmarks, establishing baseline performance for future research in Kazakh language AI.

The development of Estimin3n involved extensive experimentation with parameter-efficient training techniques, requiring approximately 69 hours of training on RTX 3090 GPU with 24GB VRAM. Our approach successfully combines advanced quantization techniques with targeted LoRA adaptation, enabling effective model training within computational constraints typical of academic and independent research environments. The resulting model achieves competitive performance metrics while maintaining compatibility with consumer-grade hardware for inference and deployment.

Related Work

Automatic Speech Recognition for Low-Resource Languages

The development of automatic speech recognition systems for low-resource languages has been a longstanding challenge in speech technology. Traditional approaches relied heavily on Hidden Markov Models and Gaussian Mixture Models, which required substantial amounts of phonetically transcribed data and language-specific acoustic modeling. Early work in Kazakh ASR demonstrated the fundamental challenges associated with the language’s complex morphology, vowel harmony, and limited availability of annotated speech corpora.

Recent advances in neural speech recognition have introduced new paradigms for low-resource language modeling. Wav2Vec2.0 and its multilingual variant Wav2Vec2-XLSR have shown promising results for various underrepresented languages through self-supervised pre-training on large multilingual speech corpora followed by fine-tuning on target language data. Research by Mussakhojayeva et al. demonstrated that Wav2Vec2-XLSR fine-tuned on Kazakh data achieved character error rates as low as 1.9% and word error rates of 8.9% on specific test sets, establishing strong baselines for Kazakh speech recognition performance.

The Universal Speech Model (USM) developed by Google represents another significant advancement in multilingual speech recognition, demonstrating the potential for large-scale self-supervised learning approaches. However, these models typically require substantial computational resources for training and deployment, limiting their accessibility for researchers and practitioners working with specific low-resource languages. Additionally, most existing approaches focus exclusively on speech-to-text transcription rather than integrating speech recognition capabilities within broader multimodal language understanding frameworks.

Transfer learning approaches have shown particular promise for low-resource speech recognition. Studies have demonstrated that models pre-trained on high-resource languages such as Russian can be effectively adapted to Kazakh, leveraging phonetic similarities between related languages. This approach has proven especially relevant for Kazakh-Russian bilingual scenarios, which are common in Kazakhstan’s multilingual environment.

Multimodal Large Language Models

The emergence of multimodal large language models represents a paradigm shift in artificial intelligence,

enabling unified architectures capable of processing and generating content across diverse modalities. GPT-4V and GPT-4o have demonstrated remarkable capabilities in handling text, images, and audio inputs within single conversational interfaces, establishing new benchmarks for multimodal AI performance.

Google’s Gemini family of models has further advanced the field through native multimodal training,

processing text, images, audio, and video inputs through unified transformer architectures. These models demonstrate that multimodal training can improve performance on individual modalities compared to specialized single-modal models, suggesting synergistic effects from cross-modal learning.

Recent developments in open-source multimodal models have democratized access to advanced AI capa bilities. Models such as LLaVA, InstructBLIP, and the Flamingo family have shown that effective multimodal models can be developed through careful combination of pre-trained vision and language components. How- ever, most existing open-source multimodal models focus primarily on vision-language tasks, with limited attention to audio processing and speech recognition capabilities.

The Gemma family of models, including the recently released Gemma-3n variants, represents Google’s commitment to open-source AI development. Gemma-3n specifically introduces native audio processing capabilities through integration with Universal Speech Model encoders, enabling end-to-end speech recognition and audio understanding tasks. The architecture’s support for multiple input modalities makes it an ideal foundation for developing specialized multimodal applications for low-resource languages.

Parameter-Efficient Fine-Tuning Techniques

Parameter-efficient fine-tuning has emerged as a crucial technique for adapting large pre-trained models to specific tasks while minimizing computational requirements and avoiding catastrophic forgetting. Low-Rank Adaptation (LoRA) represents one of the most successful approaches in this domain, decomposing weight updates into low-rank matrices that can be trained efficiently while keeping the majority of model parameters frozen. The theoretical foundation of LoRA rests on the hypothesis that adaptation to new tasks occurs in a low-rank subspace of the full parameter space. By constraining weight updates to low-rank factorizations, LoRA achieves substantial parameter reduction while maintaining model performance. Recent extensions of LoRA, including AdaLoRA and LoRA with different rank assignments across layers, have further improved the efficiency and effectiveness of this approach. Quantization techniques have gained prominence as complementary approaches to parameter-efficient

fine-tuning. 4-bit quantization methods, particularly the NormalFloat 4 (NF4) data type introduced in the QLoRA paper, enable dramatic memory reduction during both training and inference. The combination of LoRA with 4-bit quantization, known as QLoRA, has demonstrated that large language models can be fine-tuned effectively on consumer-grade hardware without significant performance degradation.

Recent work has extended parameter-efficient fine-tuning to multimodal architectures, addressing the

unique challenges associated with adapting models that process multiple input modalities. These approaches must carefully balance adaptation across different modal components while maintaining the model’s ability to perform cross-modal reasoning and generation tasks.

Kazakh Language Technology

Kazakhstan’s multilingual environment, with Kazakh as the state language alongside Russian as an official

language and English as an international communication medium, creates unique requirements for language technology development. The ISSAI research institute at Nazarbayev University has been instrumental in developing Kazakh language resources, including the Kazakh Speech Corpus (KSC) and its expanded version KSC2, which contains over 1,200 hours of transcribed speech data.

The development of Kazakh large language models has gained momentum with the recent introduction of ISSAI KazLLM, available in 8-billion and 70-billion parameter versions. These models, built on the Llama architecture and trained on over 150 billion tokens in Kazakh, Russian, English, and Turkish, represent significant progress in Kazakh natural language processing. Performance evaluations on the KazMMLU benchmark demonstrate competitive results compared to international models, with the 70-billion parameter version achieving scores exceeding 62% on Kazakh language tasks.

Despite these advances, a significant gap remains in multimodal AI capabilities for Kazakh. Existing

models focus primarily on text generation and understanding, with limited integration of speech recognition and audio processing capabilities. This limitation restricts the development of comprehensive AI assistants and interactive systems that can serve Kazakhstan’s multilingual population effectively.

The challenge of code-switching between Kazakh and Russian presents additional complexity for AI

system development. Many Kazakh speakers regularly switch between languages within conversations, requiring AI systems capable of handling multilingual inputs and generating appropriate responses that respect linguistic and cultural contexts. Current multimodal models have limited capabilities in handling such code switching scenarios, particularly for language pairs involving Turkic and Slavic languages.

Methodology

Base Model Architecture

Estimin3n is built upon Google’s Gemma-3n E4B-IT architecture, a state-of-the-art multimodal transformer

model designed for efficient on-device deployment. The Gemma-3n family represents a significant advancement in mobile-first AI architectures, incorporating the MatFormer (Matryoshka Transformer) design that enables nested parameter optimization and selective activation patterns. The E4B variant contains approximately 8 billion parameters with 4 billion effective parameters activated during inference, achieved through the model’s sophisticated parameter efficiency mechanisms. The architecture integrates several key innovations that make it particularly suitable for multimodal applications. The MatFormer design enables elastic inference patterns, allowing the model to dynamically adjust computational requirements based on task complexity and available resources. This flexibility proves

essential for speech recognition tasks, which may require varying levels of computational intensity depending on audio quality, speaker characteristics, and linguistic complexity. Per-Layer Embedding (PLE) caching represents another crucial architectural feature, enabling efficient parameter management during both training and inference. This technique allows a significant portion of the model’s parameters to be stored in CPU memory while maintaining core transformer weights in accelerator memory, substantially reducing VRAM requirements without impacting performance. For Estimin3n development, PLE caching proved instrumental in enabling training within the memory constraints of consumer-grade GPUs.

The native audio processing capabilities of Gemma-3n derive from integration with Universal Speech

Model components, providing robust speech encoding and tokenization mechanisms. The audio encoder

generates tokens at approximately 6.25 tokens per second of input audio, creating dense representations that capture both acoustic and linguistic information. This tokenization rate balances representational fidelity with computational efficiency, enabling effective processing of speech inputs up to 30 seconds in length within the model’s context window limitations. The multimodal architecture processes audio inputs through dedicated encoding pathways that are sub- sequently integrated with the model’s text processing components through cross-attention mechanisms. This design enables the model to perform joint reasoning over audio and text inputs, supporting complex tasks such as speech recognition, audio captioning, and multimodal dialogue understanding. The integration preserves the model’s text-only capabilities while adding sophisticated audio processing functionality.

Dataset Specification

The training of Estimin3n utilized the Kazakh Speech Corpus 2 (KSC2), the most comprehensive open-source speech dataset available for the Kazakh language. KSC2 represents a substantial expansion of previous Kazakh speech resources, containing approximately 1,200 hours of high-quality transcribed speech data comprising over 600,000 utterances from diverse speakers across different regions, age groups, and gender distributions. The dataset encompasses multiple domains and speaking styles to ensure robust model performance across varied use cases. Training data includes read speech from literature, news broadcasts, parliamentary and audio processing modules received targeted adaptations, proceedings, podcast conversations, and spontaneous speech samples. This diversity ensures that Estimin3n can handle various acoustic conditions, speaking rates, and linguistic registers commonly encountered in real-world applications. A particularly valuable aspect of KSC2 is its inclusion of code-switching utterances between Kazakh and

Russian, reflecting natural language use patterns among bilingual speakers in Kazakhstan. These samples

provide crucial training data for developing models capable of handling multilingual inputs, a common

requirement for AI systems deployed in Kazakhstan’s multilingual environment. The code-switching data

comprises approximately 15% of the total corpus, providing substantial exposure to this linguistically complex phenomenon.

Audio quality standards in KSC2 maintain professional recording specifications with sampling rates

of 16kHz and consistent amplitude normalization. All recordings underwent quality assurance procedures

including manual transcription verification by native Kazakh speakers and automated quality checks for

audio integrity. This rigorous curation process ensures that training data meets the standards required for

high-performance speech recognition model development.

The dataset’s transcriptions follow standardized orthographic conventions for Kazakh, using Cyrillic

script with proper handling of Kazakh-specific characters and diacritical marks. Transcription guidelines

address common challenges in Kazakh orthography, including vowel harmony representation, morphological variation, and loanword integration from Russian and other languages. This standardization facilitates

consistent model training and evaluation across different research groups and applications.

Parameter-Efficient Fine-Tuning Strategy

The development of Estimin3n employed an sophisticated parameter-efficient fine-tuning approach combining Low-Rank Adaptation with advanced quantization techniques. This methodology enabled effective model adaptation while maintaining computational feasibility within research constraints. The fine-tuning strategy encompassed three primary components: model quantization, LoRA adaptation, and supervised fine-tuning optimization. 4-bit quantization using the NormalFloat 4 (NF4) data type provided the foundation for memory-efficient training. NF4 quantization reduces memory requirements by approximately 75% compared to full-precision training while preserving model performance through theoretically-motivated quantization schemes designed for normally-distributed weights. The quantization process utilized double quantization techniques, applying secondary quantization to quantization constants themselves, achieving additional memory savings of

approximately 0.4 bits per parameter.

The LoRA adaptation strategy targeted both language and audio processing components of the Gemma-

3n architecture. Language model components received LoRA adaptations in self-attention layers including query projection (q_proj), key projection (k_proj), value projection (v_proj), and output projection

(o_proj) matrices. Additionally, feed-forward network components including gate projection (gate_proj),

up projection (up_proj), and down projection (down_proj) layers incorporated LoRA adaptations to enable

comprehensive language model fine-tuning.

Audio processing components received specialized LoRA adaptations targeting critical speech recognition pathways. The embedding projection layer, which transforms audio encoder outputs into the language model’s representation space, received LoRA adaptation with rank 8 to enable effective cross-modal alignment. Post-processing layers (post) and linear transformation components (linear_start, linear_end) also incorporated LoRA adaptations to facilitate optimal audio-to-text generation performance.

The LoRA configuration employed rank 8 decompositions with alpha scaling factor 16, providing a

balance between adaptation capacity and parameter efficiency. This configuration results in approximately 0.1% additional trainable parameters compared to the base model while enabling comprehensive adaptation across both modalities. The scaling factor alpha controls the magnitude of LoRA updates, with the 2:1 ratio to rank providing stable training dynamics and effective adaptation strength. Gradient checkpointing techniques further enhanced memory efficiency by recomputing intermediate activations during backpropagation rather than storing them in memory. This approach trades computational efficiency for memory savings, enabling training of larger models within memory constraints. The implementation utilized PyTorch’s gradient checkpointing functionality with reentrant checkpointing disabled to avoid potential memory fragmentation issues.

Training Configuration and Optimization

The supervised fine-tuning process employed advanced optimization techniques designed to maximize training efficiency while maintaining model stability. The training configuration utilized an 8-bit AdamW optimizer, reducing optimizer state memory requirements by approximately 50% compared to standard 32-bit optimization while preserving convergence characteristics through careful gradient scaling and numerical precision management.

Data formatting followed a conversational structure with system, user, and assistant roles to align with

the model’s instruction-following training paradigm. Each training example consisted of a system message establishing the transcription task context, a user message containing the audio input with a transcription request, and an assistant message providing the ground-truth transcription. This format enables the model to understand speech recognition as a conversational task while maintaining compatibility with its pre-trained instruction-following capabilities.

Batch processing utilized micro-batching with gradient accumulation to maximize training efficiency

within memory constraints. The effective batch size of 2 (batch size 1 with 2 gradient accumulation steps)

provided stable gradient estimates while remaining within VRAM limitations. Dynamic padding within

batches optimized memory utilization by minimizing unnecessary computation on padding tokens while

maintaining consistent sequence lengths for efficient matrix operations.

Learning rate scheduling employed a cosine annealing schedule with warm-up to ensure stable convergence throughout the training process. The maximum learning rate of 5e-5 was selected through preliminary experiments as providing optimal convergence speed without inducing training instability. The warm-up period comprising 10% of total training steps enabled smooth optimization initialization while the cosine decay schedule promoted convergence to high-quality local minima.

Gradient clipping with maximum norm 0.3 provided additional training stability by preventing gradient

explosion events that could destabilize the quantized training process. Weight decay regularization with

coefficient 0.01 encouraged parameter sparsity and improved model generalization performance. These regularization techniques proved particularly important when training quantized models, which can exhibit different optimization dynamics compared to full-precision training.

The training process required approximately 69 hours on NVIDIA RTX 3090 GPU with 24GB VRAM,

demonstrating the computational feasibility of the approach for research environments with limited resources.

Model checkpointing every 100 training steps enabled recovery from potential hardware failures while monitoring training metrics facilitated early detection of potential optimization issues.

Experimental Setup and Results

Training Infrastructure and Implementation

The experimental implementation of Estimin3n leveraged advanced distributed training techniques to maximize resource utilization across available hardware. The training infrastructure consisted of NVIDIA RTX 3090 GPU with 24GB VRAM. This configuration provided sufficient computational capacity for multimodal model training while representing hardware accessible to many research institutions and independent researchers. The software stack integrated several cutting-edge frameworks optimized for efficient large language model training. The Unsloth library provided accelerated implementations of attention mechanisms and gradient computation routines, achieving substantial speed improvements compared to standard PyTorch implementations. Integration with Transformers library ensured compatibility with the broader ecosystem of pretrained models and evaluation tools while maintaining access to the latest architectural innovations. Memory management strategies proved crucial for enabling training within hardware constraints. The implementation utilized mixed-precision training with automatic loss scaling to maintain numerical stability while reducing memory footprint. Dynamic loss scaling techniques automatically adjusted scaling factors based on gradient magnitude statistics, preventing both underflow and overflow conditions that could destabilize quantized training processes.

Data loading and preprocessing optimizations minimized computational overhead during training. Au-

dio data underwent real-time preprocessing including resampling to 16kHz, amplitude normalization, and

dynamic range compression to ensure consistent input characteristics. Parallel data loading with multi-

ple worker processes prevented I/O bottlenecks from impacting training throughput while careful memory management avoided excessive RAM utilization during data preprocessing.

Evaluation Metrics and Benchmarks

The evaluation of Estimin3n employed comprehensive metrics addressing both speech recognition performance and general language understanding capabilities. For speech recognition assessment, Word Error Rate (WER) and Character Error Rate (CER) provided standard measures of transcription accuracy. These metrics capture different aspects of model performance, with WER focusing on lexical accuracy and CER emphasizing phonetic and orthographic precision.

WER calculation followed standard speech recognition evaluation protocols, utilizing dynamic program-

ming alignment between generated transcriptions and ground-truth references. The metric accounts for

substitutions, insertions, and deletions at the word level, providing insights into the model’s ability to rec-

ognize complete lexical units. Normalization procedures handled punctuation, capitalization, and number

formatting consistently across all evaluations.

CER evaluation complemented WER analysis by examining transcription accuracy at the character

level, providing finer-grained insights into model performance. This metric proves particularly valuable for

morphologically complex languages such as Kazakh, where word-level errors may result from morphological rather than phonetic recognition failures. Character-level analysis enables identification of specific phonetic patterns or orthographic conventions that present challenges for the model.

The KazMMLU benchmark evaluation assessed the model’s general language understanding capabili-

ties across diverse domains including science, technology, engineering, mathematics, humanities, and social sciences. This evaluation provides crucial context for understanding how the model’s speech recognition specialization affects its broader language processing abilities. The benchmark comprises over 10,000 questions in Kazakh covering various difficulty levels and subject areas.

Additional qualitative evaluations examined the model’s performance on challenging scenarios including

code-switching between Kazakh and Russian, recognition of technical terminology, and handling of varied acoustic conditions. These evaluations provide insights into real-world deployment scenarios and identify areas requiring future research and development efforts.

Speech Recognition Performance

The speech recognition evaluation of Estimin3n demonstrates competitive performance on the KSC2 test

set, achieving a Word Error Rate of 12.2% and Character Error Rate of 4.3%. These results establish strong

baseline performance for multimodal Kazakh speech recognition while highlighting the effectiveness of the parameter-efficient fine-tuning approach employed in model development.

The WER performance of 12.2% represents a significant achievement for a multimodal model trained with

parameter-efficient techniques on limited hardware resources. Comparative analysis with existing Kazakh

ASR systems reveals that while specialized speech-only models such as fine-tuned Wav2Vec2-XLSR achieve lower error rates, Estimin3n’s multimodal capabilities and efficient architecture provide unique advantages for integrated applications requiring both speech and text processing capabilities.

Character Error Rate performance of 4.3% indicates strong phonetic and orthographic recognition ca-

pabilities, particularly important for Kazakh language processing given its complex morphological structure and distinctive phonological features. The relatively low CER compared to WER suggests that the model effectively captures phonetic information while occasionally struggling with word boundary detection and morphological analysis.Error analysis reveals several patterns in model performance across different linguistic and acoustic conditions. The model demonstrates strong performance on read speech and formal register utterances, consistent with the substantial representation of these conditions in the training data. Performance on spontaneous speech and informal register shows greater variability, indicating opportunities for future training data expansion and model refinement.

Code-switching performance presents both challenges and opportunities for Estimin3n. The model

demonstrates reasonable recognition of Kazakh-Russian codeswitching patterns common in the training

data, though performance degrades for complex multilingual utterances with rapid language switching. This behavior reflects the natural distribution of code-switching patterns in the training corpus while indicating potential areas for targeted data augmentation and model improvement.

Acoustic robustness evaluation across different recording conditions reveals generally stable performance, though the model shows sensitivity to significant domain mismatches such as noisy environments or substantially different microphone characteristics. These findings align with expectations given the controlled recording conditions prevalent in the KSC2 training data and suggest directions for future data collection and domain adaptation efforts.

Language Understanding Evaluation

The evaluation of Estimin3n on the KazMMLU benchmark yielded a score of 47.4%, providing important

insights into the model’s general language processing capabilities following speech recognition specialization.

This performance demonstrates that the parameter-efficient fine-tuning approach successfully adapted the model for speech recognition while preserving substantial general language understanding abilities.

Comparative analysis with other models evaluated on KazMMLU reveals Estimin3n’s competitive posi-

tioning within the landscape of Kazakh language models. The model’s performance exceeds several larger general-purpose models that lack language-specific adaptation, while remaining below specialized text-only models such as ISSAI KazLLM variants that focus exclusively on language understanding tasks without multimodal capabilities.

Subject-area analysis within KazMMLU reveals varying performance across different domains, reflecting

both the model’s training characteristics and the inherent challenges of different question types. Performance on scientific and technical subjects shows relative strength, potentially benefiting from the model’s exposure to formal language during speech recognition training. Humanities and social science performance indicates

areas where additional cultural and contextual knowledge could enhance model capabilities.

The evaluation demonstrates that multimodal training and speech recognition specialization do not fun-

damentally compromise the model’s language understanding abilities, supporting the viability of unified

architectures for diverse language processing tasks. However, the results also indicate that achieving state-of-the-art performance across all modalities simultaneously remains challenging, suggesting potential benefits from task-specific optimization strategies.

Computational Efficiency Analysis

The computational efficiency characteristics of Estimin3n demonstrate the effectiveness of the parameter-

efficient training approach employed in model development. Training completion within 69 hours on RTX

3090 GPU represents a substantial achievement for multimodal model development, enabling research and development within resource constraints typical of academic institutions and independent researchers.

Memory utilization analysis reveals that the combination of 4-bit quantization, LoRA adaptation, and

gradient checkpointing successfully enabled training within the 24GB VRAM constraints of each GPU. Peak memory utilization remained below 22GB per GPU throughout training, providing sufficient margin for stable operation while maximizing resource utilization. This efficiency enables scaling to larger models or longer training periods within similar hardware constraints.

Inference efficiency characteristics support the model’s viability for real-world deployment scenarios. The

quantized model requires approximately 3GB of VRAM for inference, enabling deployment on consumer-

grade hardware and mobile devices. Processing speed averages approximately 50 tokens per second on

RTX 3090 hardware, supporting real-time speech recognition applications with appropriate input buffering strategies.

The model’s deployment flexibility extends to multiple hardware configurations through support for

various quantization levels and deployment formats. The 8-bit GGUF format enables CPU-only inference

on systems without dedicated accelerators, while merged 16-bit formats provide optimal performance on

GPU-equipped systems. This flexibility supports diverse deployment scenarios from mobile applications to cloud-based services.

Discussion

Model Performance Analysis

The experimental results of Estimin3n reveal several important insights regarding multimodal model de-

velopment for low-resource languages. The achieved WER of 12.2% and CER of 4.3% demonstrate that

parameter-efficient fine-tuning approaches can successfully adapt large multimodal architectures for specialized speech recognition tasks while maintaining computational feasibility within research constraints. The performance gap between Estimin3n and specialized ASR models such as fine-tuned Wav2Vec2-XLSR reflects the inherent trade-offs associated with multimodal architectures. While dedicated speech recognition models can achieve lower error rates through architectural specialization and focused optimization, multimodal models provide integrated capabilities that enable more sophisticated applications requiring joint reasoning over multiple input modalities.

The KazMMLU benchmark score of 47.4% provides crucial context for understanding the model’s broader

language processing capabilities. This performance indicates that speech recognition specialization does not fundamentally compromise general language understanding, supporting the viability of unified architectures for comprehensive language processing applications. However, the results also highlight the challenges associated with achieving optimal performance across all modalities simultaneously.

Error pattern analysis reveals specific areas where Estimin3n exhibits both strengths and limitations.

The model demonstrates strong performance on formal register speech and standard vocabulary, reflecting the characteristics of the KSC2 training data. Performance variability on spontaneous speech and technical terminology indicates opportunities for targeted data augmentation and specialized fine-tuning approaches.

The model’s handling of code-switching between Kazakh and Russian presents both promising results

and areas for improvement. While the model successfully processes many bilingual utterances, complex code switching patterns occasionally result in recognition failures. This behavior reflects both the complexity of multilingual processing and the distribution of code-switching patterns in the training data.

Technical Innovation Assessment

The development of Estimin3n demonstrates several technical innovations in parameter-efficient multimodal model training. The successful combination of 4-bit quantization with LoRA adaptation for multimodal architectures represents a significant advancement in efficient AI model development. The approach enables training of sophisticated models on consumer-grade hardware while maintaining competitive performance

characteristics.

The targeted LoRA adaptation strategy applied to both language and audio processing components

proves effective for comprehensive multimodal fine-tuning. The selective adaptation of attention mecha-

nisms, feed-forward networks, and cross-modal projection layers enables substantial model customization with minimal parameter overhead. This approach provides a template for similar adaptations of other multimodal architectures to specialized domains and languages.

The integration of advanced memory management techniques including gradient checkpointing, mixed-

precision training, and dynamic batching enables training within hardware constraints that would otherwise

prohibit multimodal model development. These techniques collectively enable researchers with limited re

sources to pursue sophisticated AI research previously accessible only to well-funded institutions.

The model’s deployment flexibility through multiple quantization and format options demonstrates prac-

tical considerations essential for real-world applications. The availability of CPU-compatible formats ensures accessibility across diverse hardware configurations while GPU-optimized variants provide performance advantages for resource-rich deployment scenarios.

Implications for Low-Resource Language Technology

The successful development of Estimin3n has broader implications for language technology development in underrepresented linguistic communities. The demonstration that competitive multimodal models can be developed for low-resource languages using parameter-efficient techniques provides a roadmap for similar efforts across other linguistic communities facing technological underrepresentation.

The open-source nature of both the model and training methodology democratizes access to advanced

AI capabilities for Kazakh language applications. Researchers, developers, and organizations can build upon this foundation to create specialized applications addressing specific needs within Kazakhstan’s multilingual environment. This accessibility supports the development of local AI ecosystems and technological capabilities.

The model’s multilingual capabilities, demonstrated through performance on both Kazakh speech recog-

nition and Russian language understanding, address practical requirements for AI systems deployed in multilingual societies. This characteristic proves particularly valuable for Kazakhstan, where bilingual competence represents a practical necessity for comprehensive AI assistance and interaction systems.

The computational efficiency achievements of Estimin3n indicate that advanced AI capabilities need not

require substantial infrastructure investments, reducing barriers to adoption for organizations and individuals with limited computational resources. This accessibility supports more equitable distribution of AI benefits across different economic and institutional contexts.

Limitations and Future Research Directions

Several limitations of the current Estimin3n implementation indicate important directions for future research and development efforts. The model’s performance on spontaneous speech and informal register indicates opportunities for training data diversification and domain adaptation techniques. Expanding the training corpus to include more varied speaking styles and acoustic conditions could enhance model robustness across

deployment scenarios.

The handling of complex code-switching patterns presents both technical and linguistic challenges requiring additional research attention. Future work could explore specialized training techniques for multilingual processing, potentially incorporating dedicated code-switching datasets and targeted loss functions that encourage appropriate language switching behavior.

The model’s current limitation to 30-second audio inputs constrains applicability to longer-form speech

recognition tasks such as lecture transcription or extended conversation processing. Future research could investigate techniques for extending context windows or implementing sliding window approaches that enable processing of longer audio sequences while maintaining recognition accuracy.

Integration with larger language models and more recent architectural innovations presents opportunities

for performance improvements. As foundation models continue to evolve, adaptation techniques developed for Estimin3n could be applied to more capable base architectures, potentially achieving superior performance across all evaluation metrics.

The development of evaluation benchmarks specifically designed for multilingual speech recognition in

code-switching scenarios would support more comprehensive model assessment and comparison. Such benchmarks would facilitate research progress while providing standardized evaluation protocols for future model

development efforts.

Conclusion

This paper presents Estimin3n, a novel multimodal Audio/Text-to-Text model specifically developed for Kazakh speech recognition through parameter-efficient adaptation of Google’s Gemma-3n architecture. The research demonstrates that advanced multimodal AI capabilities can be successfully developed for low-resource languages using sophisticated parameter-efficient fine-tuning techniques while maintaining computational feasibility within typical research constraints.

The achieved performance metrics, including 12.2% WER and 4.3% CER on KSC2 test data along with 47.4% on the KazMMLU benchmark, establish competitive baseline performance for multimodal Kazakh language processing. These results demonstrate that the parameter-efficient approach successfully adapts the foundation model for specialized speech recognition tasks while preserving general language understanding capabilities essential for comprehensive AI applications.

The technical contributions of this work extend beyond the specific model implementation to encompass methodological innovations in efficient multimodal AI development. The successful combination of 4-bit quantization with targeted LoRA adaptation provides a template for similar adaptations across other languages and domains. The computational efficiency achievements enable researchers with limited resources to pursue sophisticated multimodal AI research previously accessible only to well-funded institutions.The open-source release of Estimin3n addresses a critical gap in language technology for Central Asianlanguages while supporting the development of local AI capabilities in Kazakhstan. The model’s multi- lingual competence and deployment flexibility enable diverse applications ranging from voice assistants to transcription services, supporting technological advancement across Kazakhstan’s multilingual society.

Future research directions encompass both technical improvements and expanded applications. Enhance ments to spontaneous speech recognition, code-switching handling, and extended context processing could substantially improve model utility for real-world deployment scenarios. Integration with evolving foundation model architectures presents opportunities for achieving state-of-the-art performance across all modalities while maintaining the computational efficiency advantages demonstrated in this work.

The broader implications of this research support the viability of developing competitive AI technologies for underrepresented languages using efficient adaptation techniques. This approach provides a sustainable path for reducing technological inequities across linguistic communities while fostering local AI development capabilities and technological sovereignty. The success of Estimin3n demonstrates that advanced AI capabilities can be democratized through careful application of parameter-efficient techniques and open-source development practices.

The development of Estimin3n represents a significant step toward comprehensive multimodal AI support for Turkic languages and provides a foundation for future research in multilingual speech recognition and understanding. The model’s competitive performance, computational efficiency, and open accessibility establish it as a valuable resource for the Kazakh language technology community and a demonstration of the potential for parameter-efficient multimodal AI development in resource-constrained environments.

References

[1] Khassanov, Y., Mussakhojayeva, S., Mirzakhmetov, A., Adiyev, A., Nurpeiissov, M., and Varol, H.

A. (2021). A crowdsourced open-source Kazakh speech corpus and initial speech recognition baseline.

In Proceedings of the 16th Conference of the European Chapter of the Association for Computational

Linguistics: Main Volume, pages 697–706.

[2] Mussakhojayeva, S., Khassanov, Y., and Varol, H. A. (2022). KSC2: An industrial-scale open-source

Kazakh speech corpus. In Proceedings of the 23rd INTERSPEECH Conference, pages 1367–1371.

[3] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). LoRA:

Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

[4] Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of

quantized LLMs. arXiv preprint arXiv:2305.14314.

[5] Togmanov, M., Mukhituly, N., Turmakhan, D., Mansurov, J., Goloburda, M., Sakip, A., Xie, Z., Wang,

Y., Syzdykov, B., Laiyk, N., Aji, A. F., Kochmar, E., Nakov, P., and Koto, F. (2025). KazMMLU:

Evaluating language models on Kazakh, Russian, and regional knowledge of Kazakhstan. arXiv preprint

arXiv:2502.12829.

[6] Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T., Alayrac, J. B., Soricut, R., Lazaridou,

A., Firat, O., Schrittwieser, J., et al. (2024). Gemma: Open models based on Gemini research and

technology. arXiv preprint arXiv:2403.08295.

[7] Gemma Team (2025). Gemma 3n: Multimodal models for on-device applications. Technical report,

Google DeepMind.

[8] Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., von Platen, P., Saraf, Y.,

Pino, J., et al. (2021). Wav2vec 2.0: A framework for self-supervised learning of speech representations.

Advances in Neural Information Processing Systems, 34:12449–12460.

[9] Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech

recognition via large-scale weak supervision. In International Conference on Machine Learning, pages

28492–28518.

[10] Zhang, Y., Park, D. S., Han, W., Qin, J., Gulati, A., Shor, J., Jansen, A., Xu, Y., Huang, Y., Wang, S.,

et al. (2023). Universal Speech Model (USM): Scaling up speech technology for 100+ languages. arXiv

preprint arXiv:2303.01037.

[11] OpenAI (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.

[12] Gemini Team (2023). Gemini: A family of highly capable multimodal models. arXiv preprint

arXiv:2312.11805.

[13] Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023). Visual instruction tuning. arXiv preprint

arXiv:2304.08485.

[14] Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican,

K., Reynolds, M., et al. (2022). Flamingo: a visual language model for few-shot learning. Advances in

Neural Information Processing Systems, 35:23716–23736.

[15] Li, J., Li, D., Xiong, C., and Hoi, S. (2023). BLIP-2: Bootstrapping vision-language pre-training with

frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.

[16] Manakul, P., Sun, G., Sirichotedumrong, W., Tharnpipitchai, K., and Pipatanakul, K. (2024). En-

hancing low-resource language and instruction following capabilities of audio language models. arXiv

preprint arXiv:2409.10999.

[17] Singh, S., Hou, F., and Wang, R. (2023). A novel self-training approach for low-resource speech recog-

nition. In Interspeech 2023, pages 1234–1238.

[18] Bartelds, M., San, N., McDonnell, B., Jurafsky, D., and Wieling, M. (2023). Making more of little data:

Improving low-resource automatic speech recognition using data augmentation. In Proceedings of the

61st Annual Meeting of the Association for Computational Linguistics, pages 1–15.

[19] Çoban, E. B., Mandel, M. I., and Devaney, J. (2024). What do MLLMs hear? Examining reasoning with

text and sound components in multimodal large language models. arXiv preprint arXiv:2406.04615.

[20] Belikova, J. and Kosenko, D. (2024). DeepPavlov at SemEval-2024 task 3: Multimodal large language

models in emotion reasoning. In Proceedings of the 18th International Workshop on Semantic Evaluation,

pages 1747–1757.

[21] Ma, R., Chen, T., Audhkhasi, K., and Ramabhadran, B. (2025). LegoSLM: Connecting LLM with

speech encoder using CTC posteriors. arXiv preprint arXiv:2505.11352.

[22] Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint

arXiv:1711.05101.

[23] Chen, T., Xu, B., Zhang, C., and Guestrin, C. (2016). Training deep nets with sublinear memory cost.

arXiv preprint arXiv:1604.06174.

[24] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R.,

Funtowicz, M., et al. (2020). Transformers: State-of-the-art natural language processing. In Proceedings

of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,

pages 38–45.

[25] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein,

N., Antiga, L., et al. (2019). PyTorch: An imperative style, high-performance deep learning library.

Advances in Neural Information Processing Systems, 32.

[26] Kenton, J. D. M. W. C. and Toutanova, L. K. (2019). BERT: Pre-training of deep bidirectional trans-

formers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.

[27] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Rozière, B., Goyal, N.,

Hambro, E., Azhar, F., et al. (2023). LLaMA: Open and efficient foundation language models. arXiv

preprint arXiv:2302.13971.

[28] Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F.,

Lengyel, G., Lample, G., Saulnier, L., et al. (2023). Mistral 7B. arXiv preprint arXiv:2310.06825.

[29] Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., et al.

(2024). Qwen2 technical report. arXiv preprint arXiv:2407.10671.

[30] DeepSeek-AI (2024). DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437.

[31] Mussakhojayeva, S., Khassanov, Y., and Varol, H. A. (2022). KSC2: An industrial-scale open-source

Kazakh speech corpus. In Interspeech 2022, pages 1367–1371. doi:10.21437/Interspeech.2022-421. ISSN

2958-1796.

[32] Gemma Team (2025). Gemma 3n. Google DeepMind. https://ai.google.dev/gemma/docs/gemma-3n