Migrating to the Stable API

Our stable API is now available. Our previous API (version 0) will be deprecated on June 17. All future deprecations will come with at least one month of deprecation notice.

If you run into migration issues, please message our Discord (or shared Slack channel, if you’re partnered with us). We’re happy to help you migrate.

API-wide changes

You must now pass the Cartesia-Version header with every request. (For WebSockets, you may pass thecartesia_version query parameter.)

Remove the /v0 prefix and instead supply a Cartesia-Version header containing the version (a date in YYYY-MM-DD format) you developed or tested your integration against. As of this documentation, the latest version is 2024-06-10.

For WebSockets, you can alternatively specify the cartesia_version query parameter, which will take precedence.

Voices

The /voices/{id}/embedding endpoint has been removed in favor of just /voices/{id}, which also returns other useful information about the voice. You should be able to just remove the /embedding suffix; any existing code that worked with the old endpoint should work with the new one.

Text-to-Speech

Text-to-speech endpoints have all moved from /audio to /tts, so if you were hitting /v0/audio/websocket, you should now hit /tts/websocket. (Correspondingly, for Server-Sent Events, hit /tts/sse.)

Request Format

Both the WebSocket and the Server-Sent Events endpoints now accept the same request format. (For WebSockets, send it over the WebSocket. For Server-Sent Events, send it in the request body.)

Here’s what the new request format looks like, with required fields marked:

1 {
2   // Required.
3   "model_id": "sonic-english",
4 
5   // Required.
6   "voice": {
7     // Required. Choices: id, embedding.
8     "mode": "id",
9     // Specify if mode is id.
10     "id": "a167e0f3-df7e-4d52-a9c3-f949145efdab"
11     // Specify if mode is embedding.
12     "embedding": [...]
13   },
14 
15   // Required.
16   "output_format": {
17     // Required. Choices: raw.
18     "container": "raw",
19     // Required. Choices: pcm_f32le, pcm_s16le, pcm_mulaw, pcm_alaw.
20     "encoding": "pcm_f32le",
21     // Required. Choices: 8000, 16000, 22050, 24000, 44100.
22     "sample_rate": 44100,
23   },
24 
25   // Required.
26   "transcript": "Hello, world!",
27 
28   // Optional. If omitted, the maximum duration will be set to a very large value.
29   "duration": 180,
30 
31   // Optional. If specified, the response will also include a context ID.
32   "context_id": "happy-monkeys-fly"
33 }

Note that:

Many previously optional fields are now required. Such as model_id, output_format.
Output format is now passed as an object.
Chunk time and lookahead are no longer accepted as parameters. If specified, they will be ignored.

Response Format

We no longer return sampling_rate. The sampling rate passed in the request will be respected, and should be considered authoritative. If the passed sampling rate is unsupported, the API will throw an error.
We no longer return length. This should be easy to calculate by decoding the data.

Server-Sent Events-specific changes

The deprecated API incorrectly implemented the Server-Sent Events response format, and was therefore rejected by standards-compliant parsers. This issue has been fixed.
Data responses will have "done": false and there will now be a final done event sent in the event stream, just like in the WebSocket. The final done event has the following format:

1 {
2   "done": true,
3   "status_code": 200,
4 
5   // Omitted if request did not specify context ID.
6   "context_id": "happy-monkeys-fly"
7 }

WebSocket

You must specify inputs directly, without wrapping them in a data field, as shown under #request-format.
The API key can now be specified in a header. (Browsers still don’t allow specifying headers for WebSocket connections. But this change makes using HTTP libraries that allow specifying global API key/version headers easy.)

1	{
2	// Required.
3	"model_id": "sonic-english",
4
5	// Required.
6	"voice": {
7	// Required. Choices: id, embedding.
8	"mode": "id",
9	// Specify if mode is id.
10	"id": "a167e0f3-df7e-4d52-a9c3-f949145efdab"
11	// Specify if mode is embedding.
12	"embedding": [...]
13	},
14
15	// Required.
16	"output_format": {
17	// Required. Choices: raw.
18	"container": "raw",
19	// Required. Choices: pcm_f32le, pcm_s16le, pcm_mulaw, pcm_alaw.
20	"encoding": "pcm_f32le",
21	// Required. Choices: 8000, 16000, 22050, 24000, 44100.
22	"sample_rate": 44100,
23	},
24
25	// Required.
26	"transcript": "Hello, world!",
27
28	// Optional. If omitted, the maximum duration will be set to a very large value.
29	"duration": 180,
30
31	// Optional. If specified, the response will also include a context ID.
32	"context_id": "happy-monkeys-fly"
33	}

1	{
2	"done": true,
3	"status_code": 200,
4
5	// Omitted if request did not specify context ID.
6	"context_id": "happy-monkeys-fly"
7	}