Clip Timestamps

All clip APIs return a timestamps object that maps graphemes and phonemes to start/end offsets in seconds. Use it to highlight words in a transcript or drive lip-sync animations.

FieldDescription
graph_charsCharacters of the synthesized text (SSML stripped, entities unescaped).
graph_timesArray of [start, end] pairs aligned to graph_chars.
phon_charsIPA phonemes (UTF-8) with diacritics/stress markers.
phon_times[start, end] pairs aligned to phon_chars.

Example

1"timestamps": {
2 "graph_chars": ["H", "e", "y", " ", "t", "h", "e", "r", "e", "."],
3 "graph_times": [[0.0374, 0.1247], [0.0873, 0.1746], [0.1372, 0.2245], [0.1746, 0.3118], [0.2744, 0.3866], [0.2744, 0.3866], [0.3617, 0.4864], [0.4615, 0.5862], [0.4615, 0.5862], [0.5488, 0.6984]],
4 "phon_chars": ["h", "ˈe", "ɪ", " ", "ð", "ˈɛ", "ɹ", "."],
5 "phon_times": [[0.0374, 0.1247], [0.0873, 0.1746], [0.1372, 0.2245], [0.1746, 0.3118], [0.2744, 0.3866], [0.3617, 0.4864], [0.4615, 0.5862], [0.5488, 0.6984]]
6}

In this example, the character H starts at 0.0374s and ends at 0.1247s, while the phoneme ˈe spans 0.0873s to 0.1746s.

Streaming

Streaming synthesis encodes timestamps in the WAV headers. See Streaming over HTTP for the chunk layout.