Skip to main content
POST
/
stt
Speech-to-Text (Batch)
curl --request POST \
  --url https://api.cartesia.ai/stt \
  --header 'Authorization: Bearer <token>' \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: multipart/form-data' \
  --form file='@example-file' \
  --form 'model=<string>' \
  --form language=en \
  --form 'timestamp_granularities[]=word'
{
  "text": "<string>",
  "language": "<string>",
  "duration": 123,
  "words": [
    {
      "word": "<string>",
      "start": 123,
      "end": 123
    }
  ]
}

Authorizations

Authorization
string
header
required

A short-lived access token to make API requests from a client.

Headers

Cartesia-Version
enum<string>
required

API version header.

Available options:
2024-06-10,
2024-11-13,
2025-04-16,
2026-03-01
Example:

"2026-03-01"

Query Parameters

encoding
enum<string> | null

The encoding format to process the audio as. Required when uploading raw PCM data without a container header. If not specified, the audio file will be decoded automatically from its container (e.g. WAV, MP3, FLAC). For guidance on choosing an encoding, see Audio encodings. The encoding format for audio data sent to the STT API. Must match the actual encoding of your audio. pcm_s16le is recommended for best performance. For detailed guidance on each format, see Audio encodings.

Available options:
pcm_s16le,
pcm_s32le,
pcm_f16le,
pcm_f32le,
pcm_mulaw,
pcm_alaw
sample_rate
integer | null

The sample rate of the audio in Hz.

Body

multipart/form-data
file
file
model
string

ID of the model to use for transcription. Use ink-whisper for the latest Cartesia Whisper model.

language
enum<string> | null

The language of the input audio in ISO-639-1 format. Defaults to en.

Available options:
en,
zh,
de,
es,
ru,
ko,
fr,
ja,
pt,
tr,
pl,
ca,
nl,
ar,
sv,
it,
id,
hi,
fi,
vi,
he,
uk,
el,
ms,
cs,
ro,
da,
hu,
ta,
no,
th,
ur,
hr,
bg,
lt,
la,
mi,
ml,
cy,
sk,
te,
fa,
lv,
bn,
sr,
az,
sl,
kn,
et,
mk,
br,
eu,
is,
hy,
ne,
mn,
bs,
kk,
sq,
sw,
gl,
mr,
pa,
si,
km,
sn,
yo,
so,
af,
oc,
ka,
be,
tg,
sd,
gu,
am,
yi,
lo,
uz,
fo,
ht,
ps,
tk,
nn,
mt,
sa,
lb,
my,
bo,
tl,
mg,
as,
tt,
haw,
ln,
ha,
ba,
jw,
su,
yue
timestamp_granularities[]
enum<string>[] | null

The timestamp granularities to populate for this transcription. Currently only word level timestamps are supported.

The granularity of timestamps to include in the response.

Currently only word level timestamps are supported, providing start and end times for each word.

Available options:
word

Response

200 - application/json
text
string
required

The transcribed text.

language
string | null

The specified language of the input audio.

duration
number<double> | null

The duration of the input audio in seconds.

words
TranscriptionWord · object[] | null

Word-level timestamps showing the start and end time of each word. Only included when [word] is passed into timestamp_granularities[].