Skip to main content
POST
/
stt
Speech-to-Text (Batch)
curl --request POST \
  --url https://api.cartesia.ai/stt \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form file='@example-file' \
  --form 'model=<string>' \
  --form language=en \
  --form 'timestamp_granularities[]=word'
{
  "text": "<string>",
  "language": "<string>",
  "duration": 123,
  "words": [
    {
      "word": "<string>",
      "start": 123,
      "end": 123
    }
  ]
}

Authorizations

X-API-Key
string
header
required

Headers

Cartesia-Version
enum<string>
required

API version header. Must be set to the API version, e.g. '2024-06-10'.

Available options:
2024-06-10,
2024-11-13,
2025-04-16
Example:

"2024-11-13"

Query Parameters

encoding
enum<string> | null

The encoding format to process the audio as. If not specified, the audio file will be decoded automatically.

Supported formats:

  • pcm_s16le - 16-bit signed integer PCM, little-endian (recommended for best performance)
  • pcm_s32le - 32-bit signed integer PCM, little-endian
  • pcm_f16le - 16-bit floating point PCM, little-endian
  • pcm_f32le - 32-bit floating point PCM, little-endian
  • pcm_mulaw - 8-bit μ-law encoded PCM
  • pcm_alaw - 8-bit A-law encoded PCM The encoding format for audio data sent to the STT WebSocket.
Available options:
pcm_s16le,
pcm_s32le,
pcm_f16le,
pcm_f32le,
pcm_mulaw,
pcm_alaw
sample_rate
integer | null

The sample rate of the audio in Hz.

Body

multipart/form-data
file
file
model
string

ID of the model to use for transcription. Use ink-whisper for the latest Cartesia Whisper model.

language
enum<string> | null

The language of the input audio in ISO-639-1 format. Defaults to en.

Available options:
en,
zh,
de,
es,
ru,
ko,
fr,
ja,
pt,
tr,
pl,
ca,
nl,
ar,
sv,
it,
id,
hi,
fi,
vi,
he,
uk,
el,
ms,
cs,
ro,
da,
hu,
ta,
no,
th,
ur,
hr,
bg,
lt,
la,
mi,
ml,
cy,
sk,
te,
fa,
lv,
bn,
sr,
az,
sl,
kn,
et,
mk,
br,
eu,
is,
hy,
ne,
mn,
bs,
kk,
sq,
sw,
gl,
mr,
pa,
si,
km,
sn,
yo,
so,
af,
oc,
ka,
be,
tg,
sd,
gu,
am,
yi,
lo,
uz,
fo,
ht,
ps,
tk,
nn,
mt,
sa,
lb,
my,
bo,
tl,
mg,
as,
tt,
haw,
ln,
ha,
ba,
jw,
su,
yue
timestamp_granularities[]
enum<string>[] | null

The timestamp granularities to populate for this transcription. Currently only word level timestamps are supported.

The granularity of timestamps to include in the response.

Currently only word level timestamps are supported, providing start and end times for each word.

Available options:
word

Response

200 - application/json
text
string
required

The transcribed text.

language
string | null

The specified language of the input audio.

duration
number<double> | null

The duration of the input audio in seconds.

words
TranscriptionWord · object[] | null

Word-level timestamps showing the start and end time of each word. Only included when [word] is passed into timestamp_granularities[].