Speech-to-Text (Batch)

curl --request POST \ --url https://api.cartesia.ai/stt \ --header 'Authorization: Bearer <token>' \ --header 'Cartesia-Version: <cartesia-version>' \ --header 'Content-Type: multipart/form-data' \ --form file='@example-file' \ --form 'model=<string>' \ --form language=en \ --form 'timestamp_granularities[]=word'

Authorizations

Authorization

string

header

required

A short-lived access token to make API requests from a client.

Headers

Cartesia-Version

enum<string>

required

API version header.

Available options:

2026-03-01

Example:

"2026-03-01"

Query Parameters

encoding

enum<string> | null

The encoding format to process the audio as. Required when uploading raw PCM data without a container header. If not specified, the audio file will be decoded automatically from its container (e.g. WAV, MP3, FLAC). For guidance on choosing an encoding, see Audio encodings. The encoding format for audio data sent to the STT API. Must match the actual encoding of your audio. For detailed guidance on each format, see Audio encodings.

Available options:

pcm_s16le,

pcm_s32le,

pcm_f16le,

pcm_f32le,

pcm_mulaw,

pcm_alaw

sample_rate

integer | null

The sample rate of the audio in Hz.

Body

multipart/form-data

file

model

string

ID of the model to use for transcription. Use ink-whisper for the latest Cartesia Whisper model.

language

enum<string> | null

The language of the input audio in ISO-639-1 format. Defaults to en.

Available options:

en,

zh,

de,

es,

ru,

ko,

fr,

ja,

pt,

tr,

pl,

ca,

nl,

ar,

sv,

it,

id,

hi,

fi,

vi,

he,

uk,

el,

ms,

cs,

ro,

da,

hu,

ta,

no,

th,

ur,

hr,

bg,

lt,

la,

mi,

ml,

cy,

sk,

te,

fa,

lv,

bn,

sr,

az,

sl,

kn,

et,

mk,

br,

eu,

is,

hy,

ne,

mn,

bs,

kk,

sq,

sw,

gl,

mr,

pa,

si,

km,

sn,

yo,

so,

af,

oc,

ka,

be,

tg,

sd,

gu,

am,

yi,

lo,

uz,

fo,

ht,

ps,

tk,

nn,

mt,

sa,

lb,

my,

bo,

tl,

mg,

as,

tt,

haw,

ln,

ha,

ba,

jw,

su,

yue

timestamp_granularities[]

enum<string>[] | null

The timestamp granularities to populate for this transcription. Currently only word level timestamps are supported.

The granularity of timestamps to include in the response.

Currently only word level timestamps are supported, providing start and end times for each word.

Available options:

word

Response

200 - application/json

text

string

required

The transcribed text.

language

string | null

The specified language of the input audio.

duration

number<double> | null

The duration of the input audio in seconds.

words

TranscriptionWord · object[] | null

Word-level timestamps showing the start and end time of each word. Only included when [word] is passed into timestamp_granularities[].

Show child attributes

Use the API

API Status

TTS

STT

Agents

Voices

Voice Changer

Auth

Datasets

Fine Tunes

Infill

Pronunciation Dicts

Admin

Authorizations

Headers

Query Parameters

Body

Response

Use the API

API Status

TTS

STT

Agents

Voices

Voice Changer

Auth

Datasets

Fine Tunes

Infill

Pronunciation Dicts

Admin

Documentation Index

Authorizations

Headers

Query Parameters

Body

Response