Job Openings
Research Internship: Multi-Axis Preference Optimization for Controllable & Expressive Text to Speech
Full-time | Voice & Conversational AI | Global Enterprise AIÂ Platform
Duration: Â Â 4-8 Months
Location:   Switzerland (Europe), on-site at AGIGO’s Zurich Office
About AGIGO
AGIGO™ is the first enterprise-grade conversational AI platform that empowers enterprises to transform customer engagement and business performance with high-agency AI-agents - agents that match well-trained human customer agents in naturalness, responsiveness, and autonomous task resolution. Built for on-premises or hybrid deployment, with no reliance on third-party services, our proprietary platform gives enterprises full control, observability, and data sovereignty. Its unified core, tunable base models, and end-to-end design toolchain deliver context-aware, adaptable agents that engage directly with customers in real-time. Founded February 2025 in Switzerland by a team of 18 experienced AI pioneers, AGIGO is driven by a bold vision to lead the next major wave in AI by transforming how businesses interact with their customers.
Your Research Mission
In this internship, your mission is - based on proprietary AGIGO technology - to build a system that gains fine-grained control over specific axes of speech, such as prosodic naturalness, emotional appropriateness, and speaker similarity, to create truly expressive and controllable Voice-AI capabilities. This has been shown to enhance perceived voice quality dramatically. Whereas traditional text-to-speech (TTS) models may sound flawless at first impression, they tend to lack the subtle nuances that make human speech so expressive. With your project, we will move beyond simple quality metrics and teach our LLM-based speech-synthesis models to understand and replicate human preference at a deeper and more refined level that captures and leverages the fine-grained subtle nuances of human expression. At the forefront of Voice-AI innovation, your project will further strengthen AGIGO’s leadership in voice synthesis and human-like Voice-AI agents.
Phase 1: Preference Data Engine
The foundation of this project is a scalable pipeline for capturing nuanced human preferences, moving far beyond a simple "A is better than B". You will design a data-gathering strategy to collect preference labels across several key axes:
Prosodic Naturalness: Which sample has more natural rhythm, pitch, and intonation for a given sentence?
Emotional Congruence: Given e.g. the text "I can't believe we won!", which audio sample sounds more genuinely excited?
Robustness & Artifacting: Which sample better handles complex text (acronyms, dates, foreign phrases) with fewer audio glitches? For instance, at inference time while performing streaming voice synthesis.
Speaker Similarity: In a voice cloning context, or when fine-tuning an autoregressive TTS with target speaker data, which sample is a more convincing match for the target speaker's timbre and cadence?
Bootstrapping with AI Labels: To scale data collection, you will investigate using AGIGO's best models to generate candidate audio and pre-filter comparisons. You will also explore training an objective reward model to act as an AI labeler, and enable the bootstrapping of a much larger preference dataset suitable for Direct Preference Optimization (DPO) or other Reinforcement Learning from Human Feedback (RLHF) approaches.
Phase 2: Advanced Preference Modeling
In this core research phase you will implement and extend state-of-the-art preference alignment algorithms for the unique domain of speech synthesis.
DPO for Speech: You will adapt the Direct Preference Optimization (DPO) framework for our autoregressive speech synthesis architecture. A key technical challenge is to apply an algorithm designed for discrete text tokens to continuous audio outputs.
Constitutional AI for Speech: We can define a "constitution" for our TTS voice (e.g., "The voice should always sound clear and helpful," "It should never sound aggressive unless explicitly prompted"). If time permits, you may in this regard explore methods to enforce these rules during preference learning, potentially enhancing the quality, safety, and reliability of our models.
Key Research Challenges
Latent Preference Discovery: Can we use unsupervised methods on large volumes of speech to automatically discover the latent features that define human preference? This is a high-risk, high-reward project to learn what "good" sounds like without explicit labels.
Compositional Control: If you successfully train on disentangled axes, can you then combine them at inference time? For example, generating speech that is"70% happy, 30% newscaster" to create novel, highly-controllable voices. Or, for instance, use multiple prompt voices for guided speech-synthesis.
Your Impact
Your final trained model and system will be integrated directly into AGIGO’s production voice-synthesis service, creating immediate, tangible impact by enhancing the expressiveness and realism of the human-like AI-Agents deployed through the AGIGO™ platform.
We value original thinking and encourage you to help shape and redefine the project’s direction as your research uncovers new insights. AGIGO fosters an open, collaborative environment where ideas can evolve freely - exceptional innovation often emerges where disciplines and perspectives intersect, and we actively support creative exploration that pushes the boundaries of what Voice-AI can achieve.
What You Bring
Required
- PhD student (preferred) or Master student in Computer Science, Machine Learning, or a related field
- Strong Python programming skills and Git
- Solid understanding of ML fundamentals and MLOps
- Hands-on experience with PyTorch
- Fluent in English, highly motivated, willingness to learn
Plus Points
- Experience with Hugging Face models (for LLMs, ASR, or "speech-LLMs")
- Familiarity with audio benchmarks
- Hands-on experience with RLHF in audio (ideally) or/and text domain
- Knowledge of speech, ASR, or/and TTS concepts
- Hands-on experience with large-scale data processing pipelines
- Hands-on experience with audio AI (ASR/TTS) model training and development
What You Will Gain
- Direct product impact: your research and code used in AGIGO’s production platform
- Mentorship: work closely with our expert team of researchers and engineers
- Top-tier AIÂ infrastructure: access to GPU clusters with NVIDIA Hopper (H200) and Blackwell RTX GPUs
- Research visibility: we will actively support you in publishing your work at a top-tier conference or in a journal paper
- Disciplined and inspiring research environment: a team of sharp minds grounded in expertise, autonomy, and a shared pursuit of impactful breakthroughs
- Paid internship: market-level salary, flexible hours, and free coffee, drinks, fruits and snacks
- Career path: this internship may lead to a full-time permanent role in AGIGO's world-class AIÂ R&D team
How to Apply
To apply, please send your resume and a brief introduction to internships@agigo.ai with the subject line:
‍Research Internship – Controllable & Expressive Text to Speech – [Your Full Name].
‍
By submitting your application, you agree to allow AGIGO to store and process your data for recruitment purposes. Unless otherwise requested, we may retain your data for up to one year to consider you for this or other future opportunities.
AGIGO™ is a registered trademark of AGIGO AG, Switzerland.‍