Speech-to-Speech

Convert a donor recording into a target voice while preserving delivery and timing. Speech-to-speech uses the same synthesis endpoint as synchronous TTS, but you pass SSML that references a source recording.

Quick Example

$curl --request POST "https://f.cluster.resemble.ai/synthesize" \
> -H "Authorization: Bearer YOUR_API_TOKEN" \
> -H "Content-Type: application/json" \
> --data '{
> "voice_uuid": "55592656",
> "data": "<resemble:convert src=\"https://storage.googleapis.com/resemble-ai-docs-public-files/sts-donor-example.wav\"></resemble:convert>",
> "sample_rate": 48000,
> "output_format": "wav"
> }'

Endpoint

POST https://f.cluster.resemble.ai/synthesize

Request Body

FieldTypeDescription
voice_uuidstringTarget Resemble voice.
project_uuidstringOptional project to store the output clip.
titlestringOptional clip title.
datastringSSML containing <resemble:convert> with a WAV src URL (≤ 50 MB, ≤ 5 minutes).
precisionstringMULAW, PCM_16, PCM_24, or PCM_32 (default).
output_formatstringwav (default) or mp3.
sample_ratenumber8000, 16000, 22050, 32000, or 44100.

Response

Identical to synchronous TTS responses:

1{
2 "audio_content": "<base64>",
3 "audio_timestamps": {
4 "graph_chars": ["H", "i"],
5 "graph_times": [[0.03, 0.14], ...],
6 "phon_chars": ["HH", "AY"],
7 "phon_times": [[0.03, 0.14], ...]
8 },
9 "duration": 4.02,
10 "success": true,
11 "output_format": "wav"
12}

<resemble:convert> Attributes

AttributeDescription
srcHTTPS URL pointing to a WAV file with a single speaker. Files over 50 MB or 300 seconds are trimmed.
pitchOptional float -10.010.0 to transpose the donor.
promptOptional primer text to steer delivery (e.g. "Speak in a British accent."). Note: For STS, place prompt on the <resemble:convert> tag, not on <speak>.

Tip: Store donor files in cloud storage with signed URLs and revoke them after synthesis completes.

Prompting with Speech-to-Speech

You can use the prompt attribute to guide how the donor audio is converted to the target voice. Unlike text-to-speech where the prompt is placed on the <speak> root element, for speech-to-speech conversion you must place the prompt attribute directly on the <resemble:convert> tag.

Example with Prompt

$curl --request POST "https://f.cluster.resemble.ai/synthesize" \
> -H "Authorization: Bearer YOUR_API_TOKEN" \
> -H "Content-Type: application/json" \
> --data '{
> "voice_uuid": "55592656",
> "data": "<speak><resemble:convert src=\"https://storage.googleapis.com/resemble-ai-docs-public-files/sts-donor-example.wav\" prompt=\"Speak in a British accent.\"></resemble:convert></speak>",
> "sample_rate": 48000,
> "output_format": "wav"
> }'

The prompt attribute allows you to adjust:

  • Accent or dialect (e.g. “Speak in a British accent”)
  • Tone or emotion (e.g. “Speak with excitement”)
  • Speaking style (e.g. “Speak in a formal tone”)

This provides fine-grained control over how the donor audio’s delivery is transformed while maintaining the original timing and prosody.