Clip Timestamps
All clip APIs return a timestamps object that maps graphemes and phonemes to start/end offsets in seconds. Use it to highlight words in a transcript or drive lip-sync animations.
Example
In this example, the character H starts at 0.0374s and ends at 0.1247s, while the phoneme ˈe spans 0.0873s to 0.1746s.
Streaming
Streaming synthesis encodes timestamps in the WAV headers. See Streaming over HTTP for the chunk layout.
