Custom Pronunciations
Teach Resemble AI how to pronounce specific words by providing reference audio. Custom pronunciations are scoped per team and are automatically applied during speech synthesis when enabled.
How It Works
When apply_custom_pronunciations is set to true in a synthesis request:
- Word detection — The input text is scanned for words and phrases that match entries in your pronunciation dictionary.
- Pronunciation lookup — Any matches are retrieved from your team’s custom pronunciation library.
- Guided synthesis — The matched pronunciation references guide the model on how to say those words before generating the full utterance.
- Output — The final audio is generated with the correct pronunciations applied. Only the main speech is included in the output.
Custom pronunciations add negligible latency to synthesis requests.
Quick Start
- Upload a pronunciation with a reference audio clip of the word spoken correctly.
- Wait for processing — the system processes your audio and prepares it for use (typically a few seconds).
- Synthesize with
apply_custom_pronunciations: true— Resemble will automatically detect matching words and apply your custom pronunciations.
Limitations
Language Support
Custom pronunciations are currently supported for English (en-us) only. Support for additional languages used by the multilingual model will be added in future releases.
Common Words
The feature works best with uncommon or domain-specific words (medical terms, brand names, proper nouns, technical jargon). If a custom pronunciation is created for a common word that the model already knows well, the model may favor its built-in pronunciation over the custom one. Custom pronunciations serve as guidance to the model, and common words have strong existing associations.
Speaker Similarity
The pronunciation reference audio should be recorded in a voice that is similar to the voice used during synthesis. If the pronunciation audio speaker sounds very different from the synthesis voice, it may cause undesirable results.
If the accent of the custom pronunciation differs from the target voice accent, there is a risk of the model switching accents during synthesis. For best results, record pronunciation references using the same speaker or a speaker with a similar vocal quality and accent.
Audio Requirements
Pronunciation audio should contain a single, clear utterance of the word. Avoid background noise, music, or multiple words in one clip.
Pronunciation Limit
There is a limit of 2 pronunciations per team on the free tier. Contact sales for additional custom vocabulary capacity.
Uniqueness
Each word must be unique within a team + language + domain combination. Attempting to create a duplicate returns a 422 Unprocessable Entity error. To change a pronunciation, delete the existing one and create a new one.
Best Practices
- Use short, clean audio clips. 1-3 seconds of a single word spoken clearly works best. Trim silence from the beginning and end.
- Match the synthesis voice. Record pronunciation references in a voice similar to your synthesis voice for the most natural results.
- Target uncommon words. Brand names, medical/scientific terms, foreign names, and technical jargon benefit the most. Common English words are already well-known to the model.
- Check status before relying on a pronunciation. Newly created pronunciations start as
"pending"and transition to"ready"once processing is complete. Only"ready"pronunciations are applied during synthesis. - Use the
activeflag to temporarily disable a pronunciation without deleting it. This is useful for A/B testing or troubleshooting.
FAQ
Q: What happens if apply_custom_pronunciations is omitted from the request?
It defaults to false. Pronunciation lookup is skipped entirely with zero added latency.
Q: How long does processing take after uploading a pronunciation?
Typically a few seconds. Poll the pronunciation’s status via GET /api/v2/pronunciations/:uuid to check when it transitions from "pending" to "ready".
Q: What happens if something goes wrong during pronunciation lookup? The system fails gracefully. If custom pronunciations cannot be retrieved, synthesis proceeds normally without them. Your requests will never fail due to a pronunciation lookup issue.
Q: Does this work with all Resemble voices and models? Custom pronunciations are compatible with all Chatterbox voices, including standard and cloned voices.
Q: What if my pronunciation isn’t being applied? Check the following:
- Is the pronunciation status
"ready"? - Is the pronunciation
active? - Is
"apply_custom_pronunciations": trueset in your synthesis request? - Is the word an exact match? (Words are matched case-insensitively, and adjacent two-word phrases are also checked.)
- Is the word too common? The model may favor its built-in pronunciation for well-known words.
Q: Can I use multi-word phrases as pronunciations?
Yes. The system checks for both individual words and adjacent two-word phrases. For example, a pronunciation for "manmay nakhashi" will match when those two words appear next to each other in the input text.
Q: What happens if I delete a pronunciation — is it removed immediately? Yes. The pronunciation is removed immediately and subsequent synthesis requests will no longer use it.
