Job Openings
Research Internship: Self-Improving Data Processing Engine for State-of-the-Art Generative AI Speech Models
Full-time | Voice & Conversational AI | Global Enterprise AIÂ Platform
Duration: Â Â 4-8 Months
Location:   Switzerland (Europe), on-site at AGIGO’s Zurich Office
About AGIGO
AGIGO™ is the first enterprise-grade conversational AI platform that empowers enterprises to transform customer engagement and business performance with high-agency AI-agents - agents that match well-trained human customer agents in naturalness, responsiveness, and autonomous task resolution. Built for on-premises or hybrid deployment, with no reliance on third-party services, our proprietary platform gives enterprises full control, observability, and data sovereignty. Its unified core, tunable base models, and end-to-end design toolchain deliver context-aware, adaptable agents that engage directly with customers in real-time. Founded February 2025 in Switzerland by a team of 18 experienced AI pioneers, AGIGO is driven by a bold vision to lead the next major wave in AI by transforming how businesses interact with their customers.
Your Research Mission
The single biggest bottleneck in building next-generation AI models is not the models themselves, it is the data. In this research internship, you will design and build a self-improving data processing engine for large-scale ASR and TTS model training. You will not just clean data: you will develop a next-generation data engine that will directly power AGIGO production models. Your primary focus will be on solving the modern challenge of detecting and purging low-quality and synthetic data from massive, un-trusted web-scale corpora (for example from archive.org).
Phase 1: Establish State-of-the-Art Baseline
You will build a highly scalable and parallelized data processing pipeline inspired by industry best practices (such as e.g. NVIDIA's NeMo Granary). This will serve as the robust foundation for more advanced research. This phase involves implementing and optimizing a multistage workflow:
Audio Canonicalization: Ingesting raw audio in any format and standardizing it (e.g., resampling if needed, mono-channel, codec normalization, etc.).
Initial Transcription & Alignment: Employing a multi-pass approach using models like FasterWhisper for initial transcription, language identification (LID) and generating rough time-stamps.
Segmentation & Grooming: Implementing robust algorithms to slice long audio segments into clean sentence-like utterances, intelligently handling speaker turns and non-speech events.
Text Restoration & Normalization: Using powerful LLMs (e.g., Llama3) to restore punctuation and capitalization, followed by text normalization to handle numbers, acronyms and symbols.
Heuristic Filtering: Implementing a baseline set of filters for removing data based on duration, word count, character set, word-per-second ratios, repeat n-grams, perplexity, audio-text embedding scores, etc. In total, we have identified more than 50 different steps that can be performed to obtain metadata from audio files for filtering stages.
Phase 2: Establish State-of-the-Art Baseline
This is where you will move beyond the baseline and introduce novel research to tackle the most difficult data quality challenges. The goal is to replace simple heuristics with intelligent, model-based scoring functions.
Cross-Modal Coherency Scoring: A key innovation will be to assess if the audio and text are a good match. You will research and build a model that scores the cross-modal coherency, flagging inconsistencies like positive audio paired with text describing a negative event, which is a strong indicator of a mismatched or synthetic pair. This is a more advanced line of work and we will work on it if time allows it.
Prosody and Speaker Attribute Modeling: For TTS, flat and monotone audio could decrease the overall performance or perception of TTS quality. You will build a model to score the prosodic richness of audio clips, allowing us to select for more expressive training data or at least, use this data in different stages of training, e.g., SFT or RLHF.
Key Research Challenges
Intelligent Data Selection vs. Filtering: Instead of simply filtering out "bad" data, can we frame this as a data selection problem? You will explore techniques such as core set selection or active learning to intelligently select the most valuable subset of data for training a model, prioritizing samples that are high-quality, diverse, and informative. These data can be the core of post-training strategies, such as RLHF or SFT for autoregressive TTS training.
The Self-Improving Pipeline: Your ultimate goal is to create a feedback loop. Can the models within the pipeline (e.g., the synthetic speech detector, the quality estimators) be periodically retrained on newly flagged and verified data? This would create a self-correcting system that becomes more accurate and robust over time.
Uncertainty-Aware Processing: The pipeline should not just make binary keep/drop decisions. You will design it to output confidence scores for each quality metric, allowing us to automatically decrease/increase the confidence bar.
Your Impact
Your work will directly influence the performance of AGIGO’s future ASR and TTS production models. The core engine you build will be used for pretraining and fine-tuning our production speech models and contribute to the continual learning across AGIGO’s voice technology stack. You will see your research integrated into real-world systems and your code and models directly improve the quality of our speech recognizers and voice synthesis.
We are furthermore fully open to discuss to potentially enhance, modify, or expand the project scope based on your research insights, interests and expertise.
What You Bring
Required
- Master student (preferred) or PhD student in Computer Science, Machine Learning, or a related field
- Strong Python programming skills and Git
- Solid understanding of ML fundamentals and MLOps
- Hands-on experience with PyTorch
- Fluent in English, highly motivated, willingness to learn
Plus Points
- Experience with Hugging Face models (for LLMs, ASR, or "speech-LLMs")
- Hands-on experience with large-scale data processing pipelines
- Hands-on experience with audio AI (ASR/TTS) model training and development
What You Will Gain
- Direct product impact: your research and code used in AGIGO’s production platform
- Mentorship: work closely with our expert team of researchers and engineers
- Top-tier AIÂ infrastructure: access to GPU clusters with NVIDIA Hopper (H200) and Blackwell RTX GPUs
- Research visibility: we will actively support you in publishing your work at a top-tier conference or in a journal paper
- Disciplined and inspiring research environment: a team of sharp minds grounded in expertise, autonomy, and a shared pursuit of impactful breakthroughs
- Paid internship: market-level salary, flexible hours, and free coffee, drinks, fruits and snacks
- Career path: this internship may lead to a full-time permanent role in AGIGO's world-class AIÂ R&D team
How to Apply
To apply, please send your resume and a brief introduction to internships@agigo.ai with the subject line:
‍Research Internship – Self-Improving Data Processing Engine – [Your Full Name].
‍
By submitting your application, you agree to allow AGIGO to store and process your data for recruitment purposes. Unless otherwise requested, we may retain your data for up to one year to consider you for this or other future opportunities.
AGIGO™ is a registered trademark of AGIGO AG, Switzerland.‍