> ## Documentation Index
> Fetch the complete documentation index at: https://hanabiaiinc-auto-go-api-docs.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# WebSocket TTS Streaming

> Real-time text-to-speech streaming via WebSocket

<Note>
  The WebSocket TTS endpoint enables bidirectional streaming for low-latency text-to-speech generation with MessagePack serialization.
</Note>


## AsyncAPI

````yaml api-reference/asyncapi.yml ttsLive
id: ttsLive
title: Tts live
description: >
  Real-time TTS streaming channel. Clients send text chunks and receive audio
  chunks concurrently.


  ## Connection Headers

  - `Authorization: Bearer <api_key>` - Required for authentication (see
  security section)

  - `model: <model_name>` - Required to specify which TTS model to use (see
  bindings)
servers:
  - id: production
    protocol: wss
    host: api.fish.audio
    bindings: []
    variables: []
address: /v1/tts/live
parameters: []
bindings:
  - protocol: ws
    version: latest
    value:
      headers:
        type: object
        required:
          - model
        properties:
          model:
            type: string
            enum:
              - s1
            description: TTS model to use for this session
    schemaProperties:
      - name: headers
        type: object
        required: false
        properties:
          - name: model
            type: string
            description: TTS model to use for this session
            enumValues:
              - s1
            required: true
operations:
  - &ref_3
    id: receiveText
    title: Receive text
    description: >
      Server receives text and control events from the client.


      **Event Sequence:**

      1. Client sends StartEvent once at the beginning with TTS configuration

      2. Client sends TextEvent for each text chunk to synthesize

      3. Client optionally sends FlushEvent to force immediate synthesis of
      buffered text

      4. Client sends CloseEvent when all text has been sent
    type: receive
    messages:
      - &ref_5
        id: startEvent
        contentType: application/msgpack
        payload:
          - name: Start TTS Session
            description: >
              Initiates a TTS streaming session with configuration.


              This must be the first message sent after connecting. It contains
              all the

              configuration for voice, audio format, and generation parameters.
            type: object
            properties:
              - name: event
                type: string
                description: Event type identifier
                required: true
              - name: request
                type: object
                required: true
                properties:
                  - name: text
                    type: string
                    description: >
                      Text to synthesize. For WebSocket streaming, this is
                      typically empty

                      in the StartEvent (text is sent via TextEvent messages).
                    required: true
                  - name: temperature
                    type: number
                    description: >
                      Controls randomness in speech generation. Higher values
                      (e.g., 1.0) make

                      output more random, lower values (e.g., 0.1) more
                      deterministic.
                    required: false
                  - name: top_p
                    type: number
                    description: >
                      Controls diversity via nucleus sampling. Lower values
                      (e.g., 0.1) make

                      output more focused, higher values (e.g., 1.0) allow more
                      diversity.
                    required: false
                  - name: references
                    type: array
                    description: >
                      Reference audio samples for instant voice cloning. Provide
                      audio samples

                      with transcriptions to clone a voice in real-time.
                    required: false
                    properties:
                      - name: audio
                        type: string
                        description: Audio file bytes for the reference sample
                        required: true
                      - name: text
                        type: string
                        description: >
                          Transcription of what is spoken in the reference
                          audio. Should match

                          exactly what's spoken and include punctuation for
                          proper prosody.
                        required: true
                  - name: reference_id
                    type: &ref_0
                      - string
                      - 'null'
                    description: >
                      ID of a pre-trained reference model from fish.audio.

                      Find model IDs in voice URLs (e.g.,
                      '802e3bc2b27e49c2995d23ef70e6ac89').
                    required: false
                  - name: prosody
                    type: object
                    description: Speech speed and volume settings
                    required: false
                    properties:
                      - name: speed
                        type: number
                        description: |
                          Speech speed multiplier. Range: 0.5-2.0.
                          Examples: 1.5 = 50% faster, 0.8 = 20% slower
                        required: false
                      - name: volume
                        type: number
                        description: >
                          Volume adjustment in decibels. Range: -20 to 20.

                          Positive values increase volume, negative values
                          decrease it.
                        required: false
                  - name: chunk_length
                    type: integer
                    description: >
                      Characters per generation chunk. Lower values = faster
                      initial response

                      but potentially lower quality. Higher values = better
                      quality but slower.
                    required: false
                  - name: normalize
                    type: boolean
                    description: >
                      Whether to normalize/clean input text. Reduces latency but
                      may reduce

                      performance on numbers and dates.
                    required: false
                  - name: format
                    type: string
                    description: Audio output format
                    enumValues:
                      - wav
                      - pcm
                      - mp3
                      - opus
                    required: false
                  - name: sample_rate
                    type: &ref_1
                      - integer
                      - 'null'
                    description: >
                      Audio sample rate in Hz. If not specified, uses
                      format-specific default.
                    required: false
                  - name: mp3_bitrate
                    type: integer
                    description: MP3 bitrate in kbps (only used when format=mp3)
                    enumValues:
                      - 64
                      - 128
                      - 192
                    required: false
                  - name: opus_bitrate
                    type: integer
                    description: Opus bitrate in kbps (only used when format=opus)
                    enumValues:
                      - -1000
                      - 24
                      - 32
                      - 48
                      - 64
                    required: false
                  - name: latency
                    type: string
                    description: >
                      Generation mode:

                      - 'normal': Higher quality, slower

                      - 'balanced': Faster generation, may have slight quality
                      reduction
                    enumValues:
                      - normal
                      - balanced
                    required: false
        headers: []
        jsonPayloadSchema:
          type: object
          required:
            - event
            - request
          properties:
            event:
              type: string
              const: start
              description: Event type identifier
              x-parser-schema-id: <anonymous-schema-2>
            request:
              type: object
              required:
                - text
              properties:
                text:
                  type: string
                  description: >
                    Text to synthesize. For WebSocket streaming, this is
                    typically empty

                    in the StartEvent (text is sent via TextEvent messages).
                  x-parser-schema-id: <anonymous-schema-3>
                temperature:
                  type: number
                  minimum: 0
                  maximum: 1
                  default: 0.7
                  description: >
                    Controls randomness in speech generation. Higher values
                    (e.g., 1.0) make

                    output more random, lower values (e.g., 0.1) more
                    deterministic.
                  x-parser-schema-id: <anonymous-schema-4>
                top_p:
                  type: number
                  minimum: 0
                  maximum: 1
                  default: 0.7
                  description: >
                    Controls diversity via nucleus sampling. Lower values (e.g.,
                    0.1) make

                    output more focused, higher values (e.g., 1.0) allow more
                    diversity.
                  x-parser-schema-id: <anonymous-schema-5>
                references:
                  type: array
                  items:
                    type: object
                    required:
                      - audio
                      - text
                    properties:
                      audio:
                        type: string
                        format: binary
                        description: Audio file bytes for the reference sample
                        x-parser-schema-id: <anonymous-schema-7>
                      text:
                        type: string
                        description: >
                          Transcription of what is spoken in the reference
                          audio. Should match

                          exactly what's spoken and include punctuation for
                          proper prosody.
                        x-parser-schema-id: <anonymous-schema-8>
                    x-parser-schema-id: ReferenceAudio
                  description: >
                    Reference audio samples for instant voice cloning. Provide
                    audio samples

                    with transcriptions to clone a voice in real-time.
                  x-parser-schema-id: <anonymous-schema-6>
                reference_id:
                  type: *ref_0
                  description: >
                    ID of a pre-trained reference model from fish.audio.

                    Find model IDs in voice URLs (e.g.,
                    '802e3bc2b27e49c2995d23ef70e6ac89').
                  x-parser-schema-id: <anonymous-schema-9>
                prosody:
                  oneOf:
                    - type: object
                      properties:
                        speed:
                          type: number
                          minimum: 0.5
                          maximum: 2
                          default: 1
                          description: |
                            Speech speed multiplier. Range: 0.5-2.0.
                            Examples: 1.5 = 50% faster, 0.8 = 20% slower
                          x-parser-schema-id: <anonymous-schema-11>
                        volume:
                          type: number
                          minimum: -20
                          maximum: 20
                          default: 0
                          description: >
                            Volume adjustment in decibels. Range: -20 to 20.

                            Positive values increase volume, negative values
                            decrease it.
                          x-parser-schema-id: <anonymous-schema-12>
                      x-parser-schema-id: ProsodyControl
                    - type: 'null'
                      x-parser-schema-id: <anonymous-schema-13>
                  description: Speech speed and volume settings
                  x-parser-schema-id: <anonymous-schema-10>
                chunk_length:
                  type: integer
                  minimum: 100
                  maximum: 300
                  default: 200
                  description: >
                    Characters per generation chunk. Lower values = faster
                    initial response

                    but potentially lower quality. Higher values = better
                    quality but slower.
                  x-parser-schema-id: <anonymous-schema-14>
                normalize:
                  type: boolean
                  default: true
                  description: >
                    Whether to normalize/clean input text. Reduces latency but
                    may reduce

                    performance on numbers and dates.
                  x-parser-schema-id: <anonymous-schema-15>
                format:
                  type: string
                  enum:
                    - wav
                    - pcm
                    - mp3
                    - opus
                  default: mp3
                  description: Audio output format
                  x-parser-schema-id: <anonymous-schema-16>
                sample_rate:
                  type: *ref_1
                  description: >
                    Audio sample rate in Hz. If not specified, uses
                    format-specific default.
                  x-parser-schema-id: <anonymous-schema-17>
                mp3_bitrate:
                  type: integer
                  enum:
                    - 64
                    - 128
                    - 192
                  default: 128
                  description: MP3 bitrate in kbps (only used when format=mp3)
                  x-parser-schema-id: <anonymous-schema-18>
                opus_bitrate:
                  type: integer
                  enum:
                    - -1000
                    - 24
                    - 32
                    - 48
                    - 64
                  default: 32
                  description: Opus bitrate in kbps (only used when format=opus)
                  x-parser-schema-id: <anonymous-schema-19>
                latency:
                  type: string
                  enum:
                    - normal
                    - balanced
                  default: balanced
                  description: >
                    Generation mode:

                    - 'normal': Higher quality, slower

                    - 'balanced': Faster generation, may have slight quality
                    reduction
                  x-parser-schema-id: <anonymous-schema-20>
              x-parser-schema-id: TTSRequest
          x-parser-schema-id: <anonymous-schema-1>
        title: Start TTS Session
        description: >
          Initiates a TTS streaming session with configuration.


          This must be the first message sent after connecting. It contains all
          the

          configuration for voice, audio format, and generation parameters.
        example: |-
          {
            "event": "start",
            "request": {
              "text": "",
              "format": "mp3",
              "chunk_length": 200,
              "reference_id": "802e3bc2b27e49c2995d23ef70e6ac89",
              "latency": "balanced"
            }
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: startEvent
      - &ref_6
        id: textEvent
        contentType: application/msgpack
        payload:
          - name: Send Text Chunk
            description: >
              Sends a chunk of text for synthesis.


              You can send multiple TextEvent messages in sequence. The server
              will buffer

              and synthesize text according to the chunk_length parameter from
              StartEvent.
            type: object
            properties:
              - name: event
                type: string
                description: Event type identifier
                required: true
              - name: text
                type: string
                description: Text chunk to synthesize
                required: true
        headers: []
        jsonPayloadSchema:
          type: object
          required:
            - event
            - text
          properties:
            event:
              type: string
              const: text
              description: Event type identifier
              x-parser-schema-id: <anonymous-schema-22>
            text:
              type: string
              description: Text chunk to synthesize
              x-parser-schema-id: <anonymous-schema-23>
          x-parser-schema-id: <anonymous-schema-21>
        title: Send Text Chunk
        description: >
          Sends a chunk of text for synthesis.


          You can send multiple TextEvent messages in sequence. The server will
          buffer

          and synthesize text according to the chunk_length parameter from
          StartEvent.
        example: |-
          {
            "event": "text",
            "text": "Hello, this is streaming text. "
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: textEvent
      - &ref_7
        id: flushEvent
        contentType: application/msgpack
        payload:
          - name: Flush Buffered Text
            description: >
              Forces immediate synthesis of all buffered text.


              Use this when you want audio generated immediately without waiting
              for more

              text or for the buffer to fill up. Useful for ensuring low latency
              in

              interactive applications.
            type: object
            properties:
              - name: event
                type: string
                description: Event type identifier
                required: true
        headers: []
        jsonPayloadSchema:
          type: object
          required:
            - event
          properties:
            event:
              type: string
              const: flush
              description: Event type identifier
              x-parser-schema-id: <anonymous-schema-25>
          x-parser-schema-id: <anonymous-schema-24>
        title: Flush Buffered Text
        description: >
          Forces immediate synthesis of all buffered text.


          Use this when you want audio generated immediately without waiting for
          more

          text or for the buffer to fill up. Useful for ensuring low latency in

          interactive applications.
        example: |-
          {
            "event": "flush"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: flushEvent
      - &ref_8
        id: closeEvent
        contentType: application/msgpack
        payload:
          - name: End TTS Session
            description: >
              Signals the end of the text stream.


              After sending this event, the server will finish synthesizing any
              remaining

              buffered text and send a FinishEvent before closing the
              connection.
            type: object
            properties:
              - name: event
                type: string
                description: Event type identifier (note 'stop', not 'close')
                required: true
        headers: []
        jsonPayloadSchema:
          type: object
          required:
            - event
          properties:
            event:
              type: string
              const: stop
              description: Event type identifier (note 'stop', not 'close')
              x-parser-schema-id: <anonymous-schema-27>
          x-parser-schema-id: <anonymous-schema-26>
        title: End TTS Session
        description: >
          Signals the end of the text stream.


          After sending this event, the server will finish synthesizing any
          remaining

          buffered text and send a FinishEvent before closing the connection.
        example: |-
          {
            "event": "stop"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: closeEvent
    bindings: []
    extensions: &ref_2
      - id: x-parser-unique-object-id
        value: ttsLive
  - &ref_4
    id: sendAudio
    title: Send audio
    description: >
      Server sends audio chunks and completion events to the client.


      **Event Flow:**

      - Server sends AudioEvent messages as audio is generated (multiple times)

      - Server sends FinishEvent once when synthesis completes

      - Clients should ignore unknown events to support future protocol
      extensions
    type: send
    messages:
      - &ref_9
        id: audioEvent
        contentType: application/msgpack
        payload:
          - name: Audio Chunk
            description: >
              Contains generated audio bytes.


              You will receive multiple AudioEvent messages as audio is
              generated. Each

              message contains a chunk of audio in the format you specified.
              Concatenate

              all chunks to get the complete audio.
            type: object
            properties:
              - name: event
                type: string
                description: Event type identifier
                required: true
              - name: audio
                type: string
                description: >-
                  Audio bytes in the format specified in StartEvent (mp3, wav,
                  pcm, or opus)
                required: true
        headers: []
        jsonPayloadSchema:
          type: object
          required:
            - event
            - audio
          properties:
            event:
              type: string
              const: audio
              description: Event type identifier
              x-parser-schema-id: <anonymous-schema-29>
            audio:
              type: string
              format: binary
              description: >-
                Audio bytes in the format specified in StartEvent (mp3, wav,
                pcm, or opus)
              x-parser-schema-id: <anonymous-schema-30>
          x-parser-schema-id: <anonymous-schema-28>
        title: Audio Chunk
        description: >
          Contains generated audio bytes.


          You will receive multiple AudioEvent messages as audio is generated.
          Each

          message contains a chunk of audio in the format you specified.
          Concatenate

          all chunks to get the complete audio.
        example: |-
          {
            "event": "audio",
            "audio": "<binary audio data>"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: audioEvent
      - &ref_10
        id: finishEvent
        contentType: application/msgpack
        payload:
          - name: Session Complete
            description: >
              Signals that the TTS session has completed.


              - If reason='stop', synthesis completed successfully

              - If reason='error', an error occurred (client should handle
              gracefully)


              The WebSocket connection will close after this event.
            type: object
            properties:
              - name: event
                type: string
                description: Event type identifier
                required: true
              - name: reason
                type: string
                description: |
                  Completion reason:
                  - 'stop': Normal completion
                  - 'error': An error occurred during synthesis
                enumValues:
                  - stop
                  - error
                required: true
        headers: []
        jsonPayloadSchema:
          type: object
          required:
            - event
            - reason
          properties:
            event:
              type: string
              const: finish
              description: Event type identifier
              x-parser-schema-id: <anonymous-schema-32>
            reason:
              type: string
              enum:
                - stop
                - error
              description: |
                Completion reason:
                - 'stop': Normal completion
                - 'error': An error occurred during synthesis
              x-parser-schema-id: <anonymous-schema-33>
          x-parser-schema-id: <anonymous-schema-31>
        title: Session Complete
        description: >
          Signals that the TTS session has completed.


          - If reason='stop', synthesis completed successfully

          - If reason='error', an error occurred (client should handle
          gracefully)


          The WebSocket connection will close after this event.
        example: |-
          {
            "event": "finish",
            "reason": "stop"
          }
        bindings: []
        extensions:
          - id: x-parser-unique-object-id
            value: finishEvent
    bindings: []
    extensions: *ref_2
sendOperations:
  - *ref_3
receiveOperations:
  - *ref_4
sendMessages:
  - *ref_5
  - *ref_6
  - *ref_7
  - *ref_8
receiveMessages:
  - *ref_9
  - *ref_10
extensions:
  - id: x-parser-unique-object-id
    value: ttsLive
securitySchemes:
  - id: bearerAuth
    name: bearerAuth
    type: http
    description: |
      API key authentication using Bearer token.

      Get your API key from https://fish.audio/app/api-keys

      Pass the token in the Authorization header:
      `Authorization: Bearer YOUR_API_KEY`
    scheme: bearer
    extensions: []

````