All clip APIs return a timestamps object that maps graphemes and phonemes to start/end offsets in seconds. Use it to highlight words in a transcript or drive lip-sync animations.
In this example, the character H starts at 0.0374s and ends at 0.1247s, while the phoneme ˈe spans 0.0873s to 0.1746s.
Streaming synthesis encodes timestamps in the WAV headers. See Streaming over HTTP for the chunk layout.