Skip to main content
POST
/
stt
Batch Speech-to-Text
curl --request POST \
  --url https://api.cartesia.ai/stt \
  --header 'Authorization: Bearer <token>' \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: multipart/form-data' \
  --form file='@example-file' \
  --form model=ink-whisper \
  --form language=en
{
  "text": "<string>",
  "request_id": "<string>",
  "is_final": true,
  "language": "<string>",
  "duration": 123,
  "words": [
    {
      "word": "<string>",
      "start": 123,
      "end": 123
    }
  ]
}

Authorizations

Authorization
string
header
required

A short-lived access token to make API requests from a client.

Headers

Cartesia-Version
enum<string>
required

API version header.

Available options:
2026-03-01
Example:

"2026-03-01"

Query Parameters

encoding
enum<string> | null

Required when uploading raw PCM data without a container header. If not specified, the audio file will be decoded automatically from its container (e.g. WAV, MP3, FLAC). Must match the actual encoding of your audio. For detailed guidance on each format, see Audio encodings.

Available options:
pcm_s16le,
pcm_s32le,
pcm_f16le,
pcm_f32le,
pcm_mulaw,
pcm_alaw
sample_rate
integer | null

The sample rate of the audio in Hz.

Body

multipart/form-data
file
file
required

There's no need to break up your audio file. Long files are intelligently chunked by our server.

Supported audio formats: flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm

model
enum<string>
required

ID of the model to use for transcription. Must be in the ink-whisper family of models.

Available options:
ink-whisper
Example:

"ink-whisper"

language
enum<string>
default:en

The language of the input audio in ISO-639-1 format

Available options:
en,
zh,
de,
es,
ru,
ko,
fr,
ja,
pt,
tr,
pl,
ca,
nl,
ar,
sv,
it,
id,
hi,
fi,
vi,
he,
uk,
el,
ms,
cs,
ro,
da,
hu,
ta,
no,
th,
ur,
hr,
bg,
lt,
la,
mi,
ml,
cy,
sk,
te,
fa,
lv,
bn,
sr,
az,
sl,
kn,
et,
mk,
br,
eu,
is,
hy,
ne,
mn,
bs,
kk,
sq,
sw,
gl,
mr,
pa,
si,
km,
sn,
yo,
so,
af,
oc,
ka,
be,
tg,
sd,
gu,
am,
yi,
lo,
uz,
fo,
ht,
ps,
tk,
nn,
mt,
sa,
lb,
my,
bo,
tl,
mg,
as,
tt,
haw,
ln,
ha,
ba,
jw,
su,
yue
timestamp_granularities[]
enum<string>[]

The granularity of timestamps to include in the response. Currently only word level timestamps are supported, providing start and end times for each word.

Available options:
word

Response

200 - application/json
type
enum<string>
required

The message type. Always transcript for a batch transcription response.

Available options:
transcript
text
string
required

The transcribed text.

request_id
string

Unique identifier for this transcription request.

is_final
boolean
deprecated

Not used for batch transcription.

language
string

The specified language of the input audio.

duration
number<double>

The duration of the input audio in seconds.

words
TranscriptionWord · object[]

Word-level timestamps showing the start and end time of each word. Only included when [word] is passed into timestamp_granularities[].