Transcribes audio files into text using Cartesia’s Speech-to-Text API.
Upload an audio file and receive a complete transcription response. Supports arbitrarily long audio files with automatic intelligent chunking for longer audio.
Supported audio formats: flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm
Response format: Returns JSON with transcribed text, duration, and language. Include timestamp_granularities: ["word"] to get word-level timestamps.
Pricing: Batch transcription is priced at 1 credit per 2 seconds of audio processed.
For migrating from the OpenAI SDK, see our OpenAI Whisper to Cartesia Ink Migration Guide.
A short-lived access token to make API requests from a client.
API version header.
2024-06-10, 2024-11-13, 2025-04-16, 2026-03-01 "2026-03-01"
The encoding format to process the audio as. Required when uploading raw PCM data without a container header. If not specified, the audio file will be decoded automatically from its container (e.g. WAV, MP3, FLAC). For guidance on choosing an encoding, see Audio encodings.
The encoding format for audio data sent to the STT API. Must match the actual encoding of your audio. pcm_s16le is recommended for best performance. For detailed guidance on each format, see Audio encodings.
pcm_s16le, pcm_s32le, pcm_f16le, pcm_f32le, pcm_mulaw, pcm_alaw The sample rate of the audio in Hz.
ID of the model to use for transcription. Use ink-whisper for the latest Cartesia Whisper model.
The language of the input audio in ISO-639-1 format. Defaults to en.
en, zh, de, es, ru, ko, fr, ja, pt, tr, pl, ca, nl, ar, sv, it, id, hi, fi, vi, he, uk, el, ms, cs, ro, da, hu, ta, no, th, ur, hr, bg, lt, la, mi, ml, cy, sk, te, fa, lv, bn, sr, az, sl, kn, et, mk, br, eu, is, hy, ne, mn, bs, kk, sq, sw, gl, mr, pa, si, km, sn, yo, so, af, oc, ka, be, tg, sd, gu, am, yi, lo, uz, fo, ht, ps, tk, nn, mt, sa, lb, my, bo, tl, mg, as, tt, haw, ln, ha, ba, jw, su, yue The timestamp granularities to populate for this transcription. Currently only word level timestamps are supported.
The granularity of timestamps to include in the response.
Currently only word level timestamps are supported, providing start and end times for each word.
word The transcribed text.
The specified language of the input audio.
The duration of the input audio in seconds.
Word-level timestamps showing the start and end time of each word. Only included when [word] is passed into timestamp_granularities[].