Speech-to-Text (Batch)

curl --request POST \
  --url https://api.cartesia.ai/stt \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form 'model=<string>' \
  --form language=en \
  --form 'timestamp_granularities[]=word' \
  --form file=@example-file

{
  "text": "<string>",
  "language": "<string>",
  "duration": 123,
  "words": [
    {
      "word": "<string>",
      "start": 123,
      "end": 123
    }
  ]
}

STT

Speech-to-Text (Batch)

Transcribes audio files into text using Cartesia’s Speech-to-Text API.

Upload an audio file and receive a complete transcription response. Supports arbitrarily long audio files with automatic intelligent chunking for longer audio.

Supported audio formats: flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm

Response format: Returns JSON with transcribed text, duration, and language. Include timestamp_granularities: ["word"] to get word-level timestamps.

Pricing: Batch transcription is priced at 1 credit per 2 seconds of audio processed.

For migrating from the OpenAI SDK, see our OpenAI Whisper to Cartesia Ink Migration Guide.

POST

stt

Speech-to-Text (Batch)

curl --request POST \
  --url https://api.cartesia.ai/stt \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form 'model=<string>' \
  --form language=en \
  --form 'timestamp_granularities[]=word' \
  --form file=@example-file

{
  "text": "<string>",
  "language": "<string>",
  "duration": 123,
  "words": [
    {
      "word": "<string>",
      "start": 123,
      "end": 123
    }
  ]
}

Authorizations

X-API-Key

string

header

required

Headers

Cartesia-Version

enum<string>

required

API version header. Must be set to the API version, e.g. '2024-06-10'.

Available options:

2024-06-10,

2024-11-13,

2025-04-16

Example:

"2024-06-10"

Query Parameters

encoding

enum<string>

The encoding format to process the audio as. If not specified, the audio file will be decoded automatically.

Supported formats:

pcm_s16le - 16-bit signed integer PCM, little-endian (recommended for best performance)
pcm_s32le - 32-bit signed integer PCM, little-endian
pcm_f16le - 16-bit floating point PCM, little-endian
pcm_f32le - 32-bit floating point PCM, little-endian
pcm_mulaw - 8-bit μ-law encoded PCM
pcm_alaw - 8-bit A-law encoded PCM

Available options:

pcm_s16le,

pcm_s32le,

pcm_f16le,

pcm_f32le,

pcm_mulaw,

pcm_alaw

sample_rate

integer | null

The sample rate of the audio in Hz.

Body

multipart/form-data

file

model

string

ID of the model to use for transcription. Use ink-whisper for the latest Cartesia Whisper model.

language

enum<string> | null

The language of the input audio in ISO-639-1 format. Defaults to en.

Available options:

en,

zh,

de,

es,

ru,

ko,

fr,

ja,

pt,

tr,

pl,

ca,

nl,

ar,

sv,

it,

id,

hi,

fi,

vi,

he,

uk,

el,

ms,

cs,

ro,

da,

hu,

ta,

no,

th,

ur,

hr,

bg,

lt,

la,

mi,

ml,

cy,

sk,

te,

fa,

lv,

bn,

sr,

az,

sl,

kn,

et,

mk,

br,

eu,

is,

hy,

ne,

mn,

bs,

kk,

sq,

sw,

gl,

mr,

pa,

si,

km,

sn,

yo,

so,

af,

oc,

ka,

be,

tg,

sd,

gu,

am,

yi,

lo,

uz,

fo,

ht,

ps,

tk,

nn,

mt,

sa,

lb,

my,

bo,

tl,

mg,

as,

tt,

ha,

ln,

ha,

ba,

jw,

su,

yu

timestamp_granularities[]

enum<string>[] | null

The timestamp granularities to populate for this transcription. Currently only word level timestamps are supported.

Show child attributes

Response

200 - application/json

text

string

required

The transcribed text.

language

string | null

The specified language of the input audio.

duration

number | null

The duration of the input audio in seconds.

words

TranscriptionWord · object[] | null

Word-level timestamps showing the start and end time of each word. Only included when [word] is passed into timestamp_granularities[].

Show child attributes

Context Flushing and Flush IDs

Speech-to-Text (Streaming)

⌘I

Use the API

API Status

TTS

STT

Voices

Voice Changer

Auth

Infill

Speech-to-Text (Batch)

Authorizations

Headers

Query Parameters

Body

Response