Batch Speech-to-Text
Transcribes an audio file of any length
Authorizations
A short-lived access token to make API requests from a client.
Headers
API version header.
2026-03-01 "2026-03-01"
Query Parameters
Required when uploading raw PCM data without a container header. If not specified, the audio file will be decoded automatically from its container (e.g. WAV, MP3, FLAC). Must match the actual encoding of your audio. For detailed guidance on each format, see Audio encodings.
pcm_s16le, pcm_s32le, pcm_f16le, pcm_f32le, pcm_mulaw, pcm_alaw The sample rate of the audio in Hz.
Body
There's no need to break up your audio file. Long files are intelligently chunked by our server.
Supported audio formats: flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm
ID of the model to use for transcription. Must be in the ink-whisper family of models.
ink-whisper "ink-whisper"
The language of the input audio in ISO-639-1 format
en, zh, de, es, ru, ko, fr, ja, pt, tr, pl, ca, nl, ar, sv, it, id, hi, fi, vi, he, uk, el, ms, cs, ro, da, hu, ta, no, th, ur, hr, bg, lt, la, mi, ml, cy, sk, te, fa, lv, bn, sr, az, sl, kn, et, mk, br, eu, is, hy, ne, mn, bs, kk, sq, sw, gl, mr, pa, si, km, sn, yo, so, af, oc, ka, be, tg, sd, gu, am, yi, lo, uz, fo, ht, ps, tk, nn, mt, sa, lb, my, bo, tl, mg, as, tt, haw, ln, ha, ba, jw, su, yue The granularity of timestamps to include in the response.
Currently only word level timestamps are supported, providing start and end times for each word.
word Response
The message type. Always transcript for a batch transcription response.
transcript The transcribed text.
Unique identifier for this transcription request.
Not used for batch transcription.
The specified language of the input audio.
The duration of the input audio in seconds.
Word-level timestamps showing the start and end time of each word. Only included when [word] is passed into timestamp_granularities[].