Cloud Service Provisioning

Amazon SageMaker Jumpstart provides the quickest path to deploying Cartesia’s self-hosted solution with managed infrastructure, automatic scaling, and integrated monitoring. This deployment method is ideal for teams new to self-hosted AI or those wanting managed infrastructure. To get started, visit the Sonic 3 on AWS Marketplace to subscribe.

Overview

SageMaker Jumpstart deployment offers:

Managed Infrastructure: AWS handles server provisioning and maintenance
Automatic Scaling: Built-in auto-scaling based on demand
Integrated Monitoring: CloudWatch integration for metrics and logging
Pay-per-use: Cost optimization through on-demand resource allocation
Quick Setup: Deploy in minutes using pre-configured notebooks

Prerequisites

AWS Account Requirements

AWS account with SageMaker access
Sufficient service limits for GPU instances (ml.g6e.xlarge)
IAM role with Sagemaker Full Access and Marketplace Subscription Access (ViewSubscriptions, Unsubscribe, Subscribe)
VPC configuration (optional, for private deployment)

Getting Started

To get started with deploying an inference endpoint for Sonic 3 on Sagemaker, please refer to the steps in this notebook

Inference Setup

Sonic 3 supports only real time inference on Sagemaker. Please select ml.g6e.xlarge as your inference endpoint instance type. Each instance is capable of serving 8 concurrent requests. In order to get the best performance, Sagemaker suggests that you reuse the client-to-SageMaker connection, as it can save the time to re-establish the connection. In boto3, you can configure max_pool_connections . Multiple requests will reuse the connections, which avoids the cost of establishing new TCP/TLS connections for each request.

Inputs and Outputs

Input Summary

The response streaming endpoint takes in a JSON object as the input that specifies the transcript, voice, language, and output format for the generation

Input Parameters

Parameter	Description	Type	Required
`context_id`	A unique ID provided by the client to identify the request. It can be any string value and helps with tracking or debugging.	`string`	Yes
`transcript`	The text that will be converted into speech. You can include additional controls (e.g., emotion, speed, volume) as supported by Sonic models. Docs	`string`	Yes
`language`	The language code of the transcript text. Supported codes: `en`, `fr`, `de`, `es`, `pt`, `zh`, `ja`, `hi`, `it`, `ko`, `nl`, `pl`, `ru`, `sv`, `tr`, `tl`, `bg`, `ro`, `ar`, `cs`, `el`, `fi`, `hr`, `ms`, `sk`, `da`, `ta`, `uk`, `hu`, `no`, `vi`, `bn`, `th`, `he`, `ka`, `id`, `te`, `gu`, `kn`, `ml`, `mr`, `pa`	`string`	Yes
`output_format`	Must match the `raw` option from the Cartesia TTS SSE API. Only `raw` is supported. Docs	`string`	Yes
`voice`	Matches the `voice` field from the Cartesia TTS SSE API. Only mode = `id` is supported. Example: `{ "mode": "id", "id": "voice_123" }` Docs	`object`	Yes
`generation_config`	Optional configuration object matching the API schema. Docs	`object`	No
`add_timestamps`	Whether to include word-level timestamps in the output. Docs	`boolean`	No
`add_phoneme_timestamps`	Whether to include phoneme-level timestamps in the output. Docs	`boolean`	No
`use_normalized_timestamps`	Whether timestamps should be normalized (0–1 range). Docs	`boolean`	No

Data Sample

{
    "context_id": "0",
    "transcript": "The detective burst through the door. 'We've got maybe five minutes before they realize we're here, so listen carefully and listen well: <speed ratio='1.5'/> the artifact is hidden beneath the old courthouse, exactly three feet below the cornerstone, and <volume ratio='0.5'/>whatever you do, DO NOT touch it with your bare hands!' She paused, catching her breath. 'Now... here's the important part... <speed ratio='0.6'/>you need to... very slowly... very carefully... wrap it in the copper wire first... then the silk cloth... then seal it in the lead box.' <volume ratio='2.0'/> Footsteps echoed in the hallway. 'GO GO GO! They're coming up the stairs RIGHT NOW!'",
    "language": "en",
    "output_format": {
        "container": "raw",
        "sample_rate": 44100,
        "encoding": "pcm"
    },
    "voice_id": {
        "mode": "id",
        "id": "bf0a246a-8642-498a-9950-80c35e9276b5"
    }
}

Output Details

Output Events

Sagemaker sends back the response events in a Response Stream. The payload is sent to you as base 64 encoded blobs. Due to Sagemaker limitation, it may truncate one event into several segements. Or API always attach a linebreak to the end of each complete event, such that you can reconciliate them on client side. Each event we send back is a json object that contains the generated audio chunk and some metadatas. The event can be one of the following types, identified by event.type:

Chunk Event

A chunk event always contains at most 20 ms worth of audio chunk in the output format and sample rate you specified.

Parameter	Description	Type	Required
`type`	The type of response event. For chunk events, this value is always `"chunk"`.	`string`	Yes
`context_id`	Optional identifier for the response context. Useful for correlating responses with requests or sessions.	`string`	No
`status_code`	The HTTP-like status code representing the success or error state of the chunk event.	`int`	Yes
`done`	Indicates whether this is the final chunk (`true`) or if more chunks are expected (`false`).	`bool`	Yes
`data`	The base 64 encoded chunk of audio data. Each chunk represents a portion of the full audio output.	`string`	Yes
`sampling_rate`	The sampling rate (in Hz) of the audio data in this chunk (e.g., `44100` or `8000`).	`int`	Yes
`step_time`	The time (in seconds) representing the generation step for this chunk, useful for synchronization or latency tracking.	`float`	Yes

Done Event

A done event signals the completion of the generation. Done events are identified by event.type == "done" and event.done == True.

Timestamp Event

A timestamp event provides timing information for recognized words or tokens.

Parameter	Description	Type	Required
`type`	The response type. Always `"timestamps"`.	`string`	Yes
`context_id`	Optional identifier correlating this timestamp event with its request/session.	`string`	No
`status_code`	Status code indicating success or failure.	`int`	Yes
`done`	Indicates whether this is the final timestamp event.	`bool`	Yes
`word_timestamps`	A dictionary describing word-level timestamps (format may vary by implementation).	`dict<string, any>`	Yes

Phoneme Timestamp Event

A phoneme timestamp event provides timing data at the phoneme level, typically for detailed speech analysis.

Parameter	Description	Type	Required
`type`	The response type. Always `"phoneme_timestamps"`.	`string`	Yes
`context_id`	Optional identifier for correlating this event with a request/session.	`string`	No
`status_code`	Processing status code.	`int`	Yes
`done`	Indicates whether this is the final phoneme timestamp event.	`bool`	Yes
`phoneme_timestamps`	A dictionary containing phoneme-level timing information.	`dict<string, any>`	Yes

Error Handling

If an error occurs during the generation type, Sagemaker will send back the error as a Model Error. To handle the error, you may inspect the OriginalStatusCode field of the error object (See examples for error handling in python).

422 Errors

A 422 error indicates that your input is not of the correct format. You may see more details in the Message field.

429 Errors

A 429 error indicates that the model container you are hitting does not have capacity to serve requests at the point. Our models serve at most 4 concurrent generation requests at a time. If you are running multiple inference container replicas, we suggest that you use load-aware routing in sagemaker by configuring the parameters RoutingConfig inside the ProductionVariants configuration, Set it to LEAST_OUTSTANDING_REQUESTS for optimal load distribution.

Container Logs

You should be able to see container logs in cloudwatch. Most logs should be emitted with a request id. The server side request id is of the format {uuid}-{client supplied context id}.

Overview

Deployments

Guides

Performance

Cloud Service Provisioning

Overview

Prerequisites

AWS Account Requirements

Getting Started

Inference Setup

Inputs and Outputs

Input Summary

Input Parameters

Data Sample

Output Details

Output Events

Chunk Event

Done Event

Timestamp Event

Phoneme Timestamp Event

Error Handling

422 Errors

429 Errors

Container Logs

Overview

Deployments

Guides

Performance

Documentation Index

​Overview

​Prerequisites

​AWS Account Requirements

​Getting Started

​Inference Setup

​Inputs and Outputs

​Input Summary

​Input Parameters

​Data Sample

​Output Details

​Output Events

Chunk Event

Done Event

Timestamp Event

Phoneme Timestamp Event

​Error Handling

​422 Errors

​429 Errors

​Container Logs

Overview

Prerequisites

AWS Account Requirements

Getting Started

Inference Setup

Inputs and Outputs

Input Summary

Input Parameters

Data Sample

Output Details

Output Events

Error Handling

422 Errors

429 Errors

Container Logs