Amazon SageMaker Jumpstart provides the quickest path to deploying Cartesia’s self-hosted solution with managed infrastructure, automatic scaling, and integrated monitoring. This deployment method is ideal for teams new to self-hosted AI or those wanting managed infrastructure. To get started, visit the Sonic 3 on AWS Marketplace to subscribe.Documentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
SageMaker Jumpstart deployment offers:- Managed Infrastructure: AWS handles server provisioning and maintenance
- Automatic Scaling: Built-in auto-scaling based on demand
- Integrated Monitoring: CloudWatch integration for metrics and logging
- Pay-per-use: Cost optimization through on-demand resource allocation
- Quick Setup: Deploy in minutes using pre-configured notebooks
Prerequisites
AWS Account Requirements
- AWS account with SageMaker access
- Sufficient service limits for GPU instances (ml.g6e.xlarge)
- IAM role with Sagemaker Full Access and Marketplace Subscription Access (ViewSubscriptions, Unsubscribe, Subscribe)
- VPC configuration (optional, for private deployment)
Getting Started
To get started with deploying an inference endpoint for Sonic 3 on Sagemaker, please refer to the steps in this notebookInference Setup
Sonic 3 supports only real time inference on Sagemaker. Please selectml.g6e.xlarge as your inference endpoint instance type. Each instance is capable of serving 8 concurrent requests. In order to get the best performance, Sagemaker suggests that you reuse the client-to-SageMaker connection, as it can save the time to re-establish the connection. In boto3, you can configure max_pool_connections . Multiple requests will reuse the connections, which avoids the cost of establishing new TCP/TLS connections for each request.
Inputs and Outputs
Input Summary
The response streaming endpoint takes in a JSON object as the input that specifies the transcript, voice, language, and output format for the generationInput Parameters
| Parameter | Description | Type | Required |
|---|---|---|---|
context_id | A unique ID provided by the client to identify the request. It can be any string value and helps with tracking or debugging. | string | Yes |
transcript | The text that will be converted into speech. You can include additional controls (e.g., emotion, speed, volume) as supported by Sonic 3 models. Docs | string | Yes |
language | The language code of the transcript text. Supported codes: en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa | string | Yes |
output_format | Must match the raw option from the Cartesia TTS SSE API. Only raw is supported.Docs | string | Yes |
voice | Matches the voice field from the Cartesia TTS SSE API. Only mode = id is supported.Example: { "mode": "id", "id": "voice_123" }Docs | object | Yes |
generation_config | Optional configuration object matching the API schema. Docs | object | No |
add_timestamps | Whether to include word-level timestamps in the output. Docs | boolean | No |
add_phoneme_timestamps | Whether to include phoneme-level timestamps in the output. Docs | boolean | No |
use_normalized_timestamps | Whether timestamps should be normalized (0–1 range). Docs | boolean | No |
Data Sample
Output Details
Output Events
Sagemaker sends back the response events in a Response Stream. The payload is sent to you as base 64 encoded blobs. Due to Sagemaker limitation, it may truncate one event into several segements. Or API always attach a linebreak to the end of each complete event, such that you can reconciliate them on client side. Each event we send back is a json object that contains the generated audio chunk and some metadatas. The event can be one of the following types, identified byevent.type:
Chunk Event
A chunk event always contains at most 20 ms worth of audio chunk in the output format and sample rate you specified.| Parameter | Description | Type | Required |
|---|---|---|---|
type | The type of response event. For chunk events, this value is always "chunk". | string | Yes |
context_id | Optional identifier for the response context. Useful for correlating responses with requests or sessions. | string | No |
status_code | The HTTP-like status code representing the success or error state of the chunk event. | int | Yes |
done | Indicates whether this is the final chunk (true) or if more chunks are expected (false). | bool | Yes |
data | The base 64 encoded chunk of audio data. Each chunk represents a portion of the full audio output. | string | Yes |
sampling_rate | The sampling rate (in Hz) of the audio data in this chunk (e.g., 44100 or 8000). | int | Yes |
step_time | The time (in seconds) representing the generation step for this chunk, useful for synchronization or latency tracking. | float | Yes |
Done Event
A done event signals the completion of the generation. Done events are identified byevent.type == "done" and event.done == True.
Timestamp Event
A timestamp event provides timing information for recognized words or tokens.| Parameter | Description | Type | Required |
|---|---|---|---|
type | The response type. Always "timestamps". | string | Yes |
context_id | Optional identifier correlating this timestamp event with its request/session. | string | No |
status_code | Status code indicating success or failure. | int | Yes |
done | Indicates whether this is the final timestamp event. | bool | Yes |
word_timestamps | A dictionary describing word-level timestamps (format may vary by implementation). | dict<string, any> | Yes |
Phoneme Timestamp Event
A phoneme timestamp event provides timing data at the phoneme level, typically for detailed speech analysis.| Parameter | Description | Type | Required |
|---|---|---|---|
type | The response type. Always "phoneme_timestamps". | string | Yes |
context_id | Optional identifier for correlating this event with a request/session. | string | No |
status_code | Processing status code. | int | Yes |
done | Indicates whether this is the final phoneme timestamp event. | bool | Yes |
phoneme_timestamps | A dictionary containing phoneme-level timing information. | dict<string, any> | Yes |
Error Handling
If an error occurs during the generation type, Sagemaker will send back the error as a Model Error. To handle the error, you may inspect theOriginalStatusCode field of the error object (See examples for error handling in python).
422 Errors
A 422 error indicates that your input is not of the correct format. You may see more details in theMessage field.
429 Errors
A 429 error indicates that the model container you are hitting does not have capacity to serve requests at the point. Our models serve at most 4 concurrent generation requests at a time. If you are running multiple inference container replicas, we suggest that you use load-aware routing in sagemaker by configuring the parametersRoutingConfig inside the ProductionVariants configuration, Set it to LEAST_OUTSTANDING_REQUESTS for optimal load distribution.
Container Logs
You should be able to see container logs in cloudwatch. Most logs should be emitted with a request id. The server side request id is of the format{uuid}-{client supplied context id}.