For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Status
OverviewProductsManageAPI ReferenceTutorialsClient Libraries
OverviewProductsManageAPI ReferenceTutorialsClient Libraries
  • Voices
    • Overview
    • List Voices
    • Get Voice
    • Delete Voice
  • Recordings
    • Overview
    • List Recordings
    • Get Recording
    • Create Recording
    • Update Recording
    • Delete Recording
  • Projects & Clips
      • Overview
      • List Clips
      • Get Clip
      • Update Clip
      • Delete Clip
      • Clip Timestamps
  • Voice Settings
    • Overview
    • List Presets
    • Get Preset
    • Create Preset
    • Update Preset
    • Delete Preset
  • Custom Pronunciations
    • Overview
    • Create Pronunciation
    • Bulk Create from ZIP
    • List Pronunciations
    • Get Pronunciation
    • Toggle Active Status
    • Delete Pronunciation
  • Account & Billing
    • Overview
    • Get Account
    • Teams
    • Billing Usage
Status
LogoLogo
On this page
  • Example
  • Streaming
Projects & ClipsClips

Clip Timestamps

Was this page helpful?
Previous

Voice Settings Presets

Next
Built with

All clip APIs return a timestamps object that maps graphemes and phonemes to start/end offsets in seconds. Use it to highlight words in a transcript or drive lip-sync animations.

FieldDescription
graph_charsCharacters of the synthesized text (SSML stripped, entities unescaped).
graph_timesArray of [start, end] pairs aligned to graph_chars.
phon_charsIPA phonemes (UTF-8) with diacritics/stress markers.
phon_times[start, end] pairs aligned to phon_chars.

Example

1"timestamps": {
2 "graph_chars": ["H", "e", "y", " ", "t", "h", "e", "r", "e", "."],
3 "graph_times": [[0.0374, 0.1247], [0.0873, 0.1746], [0.1372, 0.2245], [0.1746, 0.3118], [0.2744, 0.3866], [0.2744, 0.3866], [0.3617, 0.4864], [0.4615, 0.5862], [0.4615, 0.5862], [0.5488, 0.6984]],
4 "phon_chars": ["h", "ˈe", "ɪ", " ", "ð", "ˈɛ", "ɹ", "."],
5 "phon_times": [[0.0374, 0.1247], [0.0873, 0.1746], [0.1372, 0.2245], [0.1746, 0.3118], [0.2744, 0.3866], [0.3617, 0.4864], [0.4615, 0.5862], [0.5488, 0.6984]]
6}

In this example, the character H starts at 0.0374s and ends at 0.1247s, while the phoneme ˈe spans 0.0873s to 0.1746s.

Streaming

Streaming synthesis encodes timestamps in the WAV headers. See Streaming over HTTP for the chunk layout.