Skip to main content
POST
/
stt
Speech-to-Text (Batch)
curl --request POST \
  --url https://api.cartesia.ai/stt \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form 'model=<string>' \
  --form language=en \
  --form 'timestamp_granularities[]=word' \
  --form file=@example-file
{
  "text": "<string>",
  "language": "<string>",
  "duration": 123,
  "words": [
    {
      "word": "<string>",
      "start": 123,
      "end": 123
    }
  ]
}

Authorizations

X-API-Key
string
header
required

Headers

Cartesia-Version
enum<string>
required

API version header. Must be set to the API version, e.g. '2024-06-10'.

Available options:
2024-06-10,
2024-11-13,
2025-04-16
Example:

"2024-06-10"

Query Parameters

encoding
enum<string>

The encoding format to process the audio as. If not specified, the audio file will be decoded automatically.

Supported formats:

  • pcm_s16le - 16-bit signed integer PCM, little-endian (recommended for best performance)
  • pcm_s32le - 32-bit signed integer PCM, little-endian
  • pcm_f16le - 16-bit floating point PCM, little-endian
  • pcm_f32le - 32-bit floating point PCM, little-endian
  • pcm_mulaw - 8-bit μ-law encoded PCM
  • pcm_alaw - 8-bit A-law encoded PCM
Available options:
pcm_s16le,
pcm_s32le,
pcm_f16le,
pcm_f32le,
pcm_mulaw,
pcm_alaw
sample_rate
integer | null

The sample rate of the audio in Hz.

Body

multipart/form-data
file
file
model
string

ID of the model to use for transcription. Use ink-whisper for the latest Cartesia Whisper model.

language
enum<string> | null

The language of the input audio in ISO-639-1 format. Defaults to en.

Available options:
en,
zh,
de,
es,
ru,
ko,
fr,
ja,
pt,
tr,
pl,
ca,
nl,
ar,
sv,
it,
id,
hi,
fi,
vi,
he,
uk,
el,
ms,
cs,
ro,
da,
hu,
ta,
no,
th,
ur,
hr,
bg,
lt,
la,
mi,
ml,
cy,
sk,
te,
fa,
lv,
bn,
sr,
az,
sl,
kn,
et,
mk,
br,
eu,
is,
hy,
ne,
mn,
bs,
kk,
sq,
sw,
gl,
mr,
pa,
si,
km,
sn,
yo,
so,
af,
oc,
ka,
be,
tg,
sd,
gu,
am,
yi,
lo,
uz,
fo,
ht,
ps,
tk,
nn,
mt,
sa,
lb,
my,
bo,
tl,
mg,
as,
tt,
ha,
ln,
ha,
ba,
jw,
su,
yu
timestamp_granularities[]
enum<string>[] | null

The timestamp granularities to populate for this transcription. Currently only word level timestamps are supported.

Response

200 - application/json
text
string
required

The transcribed text.

language
string | null

The specified language of the input audio.

duration
number | null

The duration of the input audio in seconds.

words
TranscriptionWord · object[] | null

Word-level timestamps showing the start and end time of each word. Only included when [word] is passed into timestamp_granularities[].

I