> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Infill (Bytes)

> Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.

**The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.**

At least one of `left_audio` or `right_audio` must be provided.

As with all generative models, there's some inherent variability, but here's some tips we recommend to get the best results from infill:
- Use longer infill transcripts
  - This gives the model more flexibility to adapt to the rest of the audio
- Target natural pauses in the audio when deciding where to clip
  - This means you don't need word-level timestamps to be as precise
- Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible
  - This helps the model generate more natural transitions



## OpenAPI

````yaml /latest.yml POST /infill/bytes
openapi: 3.0.1
info:
  title: Cartesia API
  version: 0.0.1
servers:
  - url: https://api.cartesia.ai
    description: Production
security: []
paths:
  /infill/bytes:
    post:
      tags:
        - Infill
      summary: Infill (Bytes)
      description: >-
        Generate audio that smoothly connects two existing audio segments. This
        is useful for inserting new speech between existing speech segments
        while maintaining natural transitions.


        **The cost is 1 credit per character of the infill text plus a fixed
        cost of 300 credits.**


        At least one of `left_audio` or `right_audio` must be provided.


        As with all generative models, there's some inherent variability, but
        here's some tips we recommend to get the best results from infill:

        - Use longer infill transcripts
          - This gives the model more flexibility to adapt to the rest of the audio
        - Target natural pauses in the audio when deciding where to clip
          - This means you don't need word-level timestamps to be as precise
        - Clip right up to the start and end of the audio segment you want
        infilled, keeping as much silence in the left/right audio segments as
        possible
          - This helps the model generate more natural transitions
      operationId: infill_bytes
      parameters:
        - $ref: '#/components/parameters/CartesiaVersionHeader'
      requestBody:
        required: true
        content:
          multipart/form-data:
            schema:
              type: object
              properties:
                left_audio:
                  type: string
                  format: binary
                right_audio:
                  type: string
                  format: binary
                model_id:
                  description: >-
                    The ID of the model to use for generating audio. Any model
                    other than the first `"sonic"` model is supported.
                  type: string
                language:
                  description: The language of the transcript
                  type: string
                transcript:
                  description: The infill text to generate
                  type: string
                voice_id:
                  description: The ID of the voice to use for generating audio
                  type: string
                output_format[container]:
                  $ref: '#/components/schemas/OutputFormatContainer'
                  description: The format of the output audio
                output_format[sample_rate]:
                  description: The sample rate of the output audio
                  type: integer
                  enum:
                    - 8000
                    - 16000
                    - 22050
                    - 24000
                    - 44100
                    - 48000
                output_format[encoding]:
                  $ref: '#/components/schemas/RawEncoding'
                  description: Required for `raw` and `wav` containers.
                  nullable: true
                output_format[bit_rate]:
                  description: Required for `mp3` containers.
                  type: integer
                  nullable: true
      responses:
        '200':
          description: Audio bytes
          content:
            audio/*:
              schema:
                type: string
                format: binary
      security:
        - APIKeyAuth: []
components:
  parameters:
    CartesiaVersionHeader:
      name: Cartesia-Version
      in: header
      description: API version header.
      required: true
      schema:
        type: string
        format: date
        example: '2026-03-01'
        enum:
          - '2026-03-01'
  schemas:
    OutputFormatContainer:
      title: OutputFormatContainer
      type: string
      enum:
        - raw
        - wav
        - mp3
    RawEncoding:
      title: RawEncoding
      type: string
      description: >-
        The encoding format for output audio. See [Choosing TTS
        Parameters](/build-with-cartesia/capability-guides/choosing-tts-parameters)
        if you're unsure what to use.
      enum:
        - pcm_f32le
        - pcm_s16le
        - pcm_mulaw
        - pcm_alaw
  securitySchemes:
    APIKeyAuth:
      type: http
      scheme: bearer
      bearerFormat: API Key
      description: >-
        Cartesia API key (`sk_car_...`). Get one at
        [play.cartesia.ai/keys](https://play.cartesia.ai/keys).

````