> ## Documentation Index
> Fetch the complete documentation index at: https://hanabiaiinc-auto-go-api-docs.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Running Inference

> Generate speech using self-hosted Fish Audio models

export const AudioTranscript = ({voices = []}) => {
  const [selectedVoice, setSelectedVoice] = useState(0);
  const [isPlaying, setIsPlaying] = useState(false);
  const [currentTime, setCurrentTime] = useState(0);
  const [duration, setDuration] = useState(0);
  const [isDropdownOpen, setIsDropdownOpen] = useState(false);
  const audioRef = useRef(null);
  const dropdownRef = useRef(null);
  useEffect(() => {
    const audio = audioRef.current;
    if (!audio) return;
    const updateTime = () => setCurrentTime(audio.currentTime);
    const updateDuration = () => setDuration(audio.duration);
    const handleEnded = () => setIsPlaying(false);
    audio.addEventListener('timeupdate', updateTime);
    audio.addEventListener('loadedmetadata', updateDuration);
    audio.addEventListener('ended', handleEnded);
    return () => {
      audio.removeEventListener('timeupdate', updateTime);
      audio.removeEventListener('loadedmetadata', updateDuration);
      audio.removeEventListener('ended', handleEnded);
    };
  }, []);
  useEffect(() => {
    const handleClickOutside = event => {
      if (dropdownRef.current && !dropdownRef.current.contains(event.target)) {
        setIsDropdownOpen(false);
      }
    };
    if (isDropdownOpen) {
      document.addEventListener('mousedown', handleClickOutside);
    }
    return () => {
      document.removeEventListener('mousedown', handleClickOutside);
    };
  }, [isDropdownOpen]);
  useEffect(() => {
    if (audioRef.current) {
      audioRef.current.pause();
      audioRef.current.load();
      setIsPlaying(false);
      setCurrentTime(0);
    }
  }, [selectedVoice]);
  const togglePlay = () => {
    if (isPlaying) {
      audioRef.current.pause();
    } else {
      audioRef.current.play();
    }
    setIsPlaying(!isPlaying);
  };
  const handleProgressChange = e => {
    const newTime = parseFloat(e.target.value);
    audioRef.current.currentTime = newTime;
    setCurrentTime(newTime);
  };
  const formatTime = time => {
    if (isNaN(time)) return '0:00';
    const minutes = Math.floor(time / 60);
    const seconds = Math.floor(time % 60);
    return `${minutes}:${seconds.toString().padStart(2, '0')}`;
  };
  const currentVoice = voices[selectedVoice];
  return <div className="border rounded-lg bg-card border-gray-200 dark:border-gray-800">
      {}
      <div className="grid grid-cols-3 items-center px-3 py-1.5 bg-muted border-b border-gray-200 dark:border-gray-800">
        <span className="text-xs font-medium">Listen to Page</span>

        <span className="text-xs font-semibold text-muted-foreground text-center">Powered by Fish Audio S1</span>

        {voices.length > 1 ? <div className="relative justify-self-end" ref={dropdownRef}>
            <button onClick={() => setIsDropdownOpen(!isDropdownOpen)} className="flex items-center gap-1.5 px-3 py-1 rounded-full bg-muted hover:bg-gray-200 dark:hover:bg-gray-700 transition-all duration-200 cursor-pointer text-xs">
              <span className="text-muted-foreground">Voice:</span>
              <span className="font-medium">{voices[selectedVoice]?.name}</span>
              <svg className={`w-3 h-3 transition-transform duration-200 ${isDropdownOpen ? 'rotate-180' : ''}`} fill="none" stroke="currentColor" viewBox="0 0 24 24">
                <path strokeLinecap="round" strokeLinejoin="round" strokeWidth={2} d="M19 9l-7 7-7-7" />
              </svg>
            </button>

            {isDropdownOpen && <div className="absolute right-0 mt-1 w-auto bg-white dark:bg-black border border-gray-200 dark:border-gray-700 rounded-lg overflow-hidden z-50">
                {voices.map((voice, index) => <button key={index} onClick={() => {
    setSelectedVoice(index);
    setIsDropdownOpen(false);
  }} className={`w-full px-3 py-1.5 text-left text-xs hover:bg-gray-100 dark:hover:bg-gray-800 transition-colors flex items-center gap-2 ${index === selectedVoice ? 'bg-gray-100 dark:bg-gray-800 font-medium' : ''}`}>
                    {voice.id && <img src={`https://public-platform.r2.fish.audio/coverimage/${voice.id}`} alt={voice.name} className="w-5 h-5 rounded-full m-0 flex-shrink-0 object-cover" />}
                    <span className="flex-1 whitespace-nowrap">{voice.name}</span>
                  </button>)}
              </div>}
          </div> : <div className="justify-self-end" />}
      </div>

      {}
      <div className="px-3 py-1.5 bg-card">
        <audio ref={audioRef} src={currentVoice?.url} preload="metadata" />

        <div className="flex items-center gap-2">
          {}
          <button onClick={togglePlay} className="flex-shrink-0 w-6 h-6 flex items-center justify-center bg-gray-300 dark:bg-gray-600 text-gray-800 dark:text-gray-200 rounded-full hover:opacity-80 transition-opacity relative overflow-hidden" aria-label={isPlaying ? 'Pause' : 'Play'}>
            <div className="transition-transform duration-300 ease-in-out" style={{
    transform: isPlaying ? 'rotate(180deg)' : 'rotate(0deg)'
  }}>
              {isPlaying ? <svg className="w-3 h-3" fill="currentColor" viewBox="0 0 24 24">
                  <path d="M6 4h4v16H6V4zm8 0h4v16h-4V4z" />
                </svg> : <svg className="w-3 h-3 ml-0.5" fill="currentColor" viewBox="0 0 24 24">
                  <path d="M8 5v14l11-7z" />
                </svg>}
            </div>
          </button>

          {}
          <div className="flex-1 flex items-center gap-2">
            <span className="text-xs font-mono text-gray-500 dark:text-gray-400 min-w-[35px]">
              {formatTime(currentTime)}
            </span>

            <div className="flex-1 relative h-1 bg-gray-200 dark:bg-gray-700 rounded-full overflow-hidden">
              <div className="absolute top-0 left-0 h-full bg-gray-400 dark:bg-gray-500 transition-all duration-100" style={{
    width: `${duration ? currentTime / duration * 100 : 0}%`
  }} />
              <input type="range" min="0" max={duration || 0} value={currentTime} onChange={handleProgressChange} className="absolute top-0 left-0 w-full h-full opacity-0 cursor-pointer" />
            </div>
            <span className="text-xs font-mono text-gray-500 dark:text-gray-400 min-w-[35px]">
              {formatTime(duration)}
            </span>
          </div>
        </div>
      </div>
    </div>;
};

<AudioTranscript
  voices={[
{
  "id": "8ef4a238714b45718ce04243307c57a7",
  "name": "E-girl",
  "url": "https://pub-b995142090474379a930b856ab79b4d4.r2.dev/audio/self-hosting-running-inference/8ef4a238714b45718ce04243307c57a7.mp3"
},
{
  "id": "802e3bc2b27e49c2995d23ef70e6ac89",
  "name": "Energetic Male",
  "url": "https://pub-b995142090474379a930b856ab79b4d4.r2.dev/audio/self-hosting-running-inference/802e3bc2b27e49c2995d23ef70e6ac89.mp3"
},
{
  "id": "933563129e564b19a115bedd57b7406a",
  "name": "Sarah",
  "url": "https://pub-b995142090474379a930b856ab79b4d4.r2.dev/audio/self-hosting-running-inference/933563129e564b19a115bedd57b7406a.mp3"
},
{
  "id": "bf322df2096a46f18c579d0baa36f41d",
  "name": "Adrian",
  "url": "https://pub-b995142090474379a930b856ab79b4d4.r2.dev/audio/self-hosting-running-inference/bf322df2096a46f18c579d0baa36f41d.mp3"
},
{
  "id": "b347db033a6549378b48d00acb0d06cd",
  "name": "Selene",
  "url": "https://pub-b995142090474379a930b856ab79b4d4.r2.dev/audio/self-hosting-running-inference/b347db033a6549378b48d00acb0d06cd.mp3"
},
{
  "id": "536d3a5e000945adb7038665781a4aca",
  "name": "Ethan",
  "url": "https://pub-b995142090474379a930b856ab79b4d4.r2.dev/audio/self-hosting-running-inference/536d3a5e000945adb7038665781a4aca.mp3"
}
]}
/>

Fish Audio supports multiple inference methods: command line, HTTP API, WebUI, and GUI. Choose the method that best fits your workflow.

<Note>
  This guide assumes you have already [installed Fish Audio locally](/developer-guide/self-hosting/local-setup) or [set up Docker deployment](/developer-guide/self-hosting/docker-deployment).
</Note>

## Download Weights

Before running inference, download the required model weights from Hugging Face:

```bash theme={null}
# Install Hugging Face CLI (if not already installed)
pip install huggingface_hub[cli]
# or
uv tool install huggingface_hub[cli]

# Download Fish Audio S1-mini weights
hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
```

<Info>
  **Fish Audio S1-mini** is the open-source distilled version (0.5B parameters) optimized for local deployment. The full **S1** model (4B parameters) is available exclusively on [Fish Audio cloud](https://fish.audio).
</Info>

## Command Line Inference

Command line inference provides maximum control and is ideal for scripting and batch processing.

### Step 1: Extract VQ Tokens from Reference Audio

First, encode your reference audio to get voice characteristics:

```bash theme={null}
python fish_speech/models/dac/inference.py \
    -i "reference_audio.wav" \
    --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"
```

This generates two files:

* `fake.npy` - VQ tokens representing voice characteristics
* `fake.wav` - Reconstructed audio for verification

<Tip>
  **Skip this step if you want random voice generation** - the model can generate speech without reference audio.
</Tip>

### Step 2: Generate Semantic Tokens from Text

Convert your text to semantic tokens using the language model:

```bash theme={null}
python fish_speech/models/text2semantic/inference.py \
    --text "The text you want to convert to speech" \
    --prompt-text "Transcription of your reference audio" \
    --prompt-tokens "fake.npy" \
    --compile
```

**Parameters:**

* `--text`: The text to synthesize
* `--prompt-text`: Transcription of the reference audio (for voice cloning)
* `--prompt-tokens`: Path to VQ tokens from Step 1 (for voice cloning)
* `--compile`: Enable kernel fusion for faster inference (\~10x speedup on RTX 4090)

<Note>
  For random voice generation, omit `--prompt-text` and `--prompt-tokens` parameters.
</Note>

This creates a file named `codes_N.npy` (where N starts from 0) containing semantic tokens.

<Warning>
  For GPUs that don't support bf16 (bfloat16), add the `--half` flag to use fp16 instead.
</Warning>

### Step 3: Generate Audio from Semantic Tokens

Finally, convert semantic tokens to audio:

```bash theme={null}
python fish_speech/models/dac/inference.py \
    -i "codes_0.npy"
```

This generates the final audio file.

### Full Example

Here's a complete workflow for voice cloning:

```bash theme={null}
# 1. Encode reference audio
python fish_speech/models/dac/inference.py \
    -i "my_voice.wav" \
    --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"

# 2. Generate semantic tokens
python fish_speech/models/text2semantic/inference.py \
    --text "Hello, this is a test of voice cloning." \
    --prompt-text "This is my reference voice recording." \
    --prompt-tokens "fake.npy" \
    --compile

# 3. Generate final audio
python fish_speech/models/dac/inference.py \
    -i "codes_0.npy"
```

## HTTP API Inference

The HTTP API provides a programmatic interface for integrations and production deployments.

### Start API Server

```bash theme={null}
# With local installation
python -m tools.api_server \
    --listen 0.0.0.0:8080 \
    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
    --decoder-config-name modded_dac_vq

# With UV
uv run tools/api_server.py \
    --listen 0.0.0.0:8080 \
    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
    --decoder-config-name modded_dac_vq
```

<Tip>
  Add the `--compile` flag to enable torch.compile optimization for faster inference.
</Tip>

### Access API Documentation

Once the server is running, access the interactive API documentation at:

```
http://localhost:8080/docs
```

The API provides endpoints for:

* Text-to-speech synthesis
* Voice cloning with reference audio
* Batch processing
* Model information

### Example API Request

```bash theme={null}
curl -X POST "http://localhost:8080/v1/tts" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a test",
    "reference_audio": "base64_encoded_audio",
    "reference_text": "Reference transcription"
  }'
```

## WebUI Inference

The WebUI provides an intuitive interface for interactive testing and development.

### Start WebUI

```bash theme={null}
# With all parameters
python -m tools.run_webui \
    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
    --decoder-config-name modded_dac_vq

# Or use defaults (auto-detects models in checkpoints/)
python -m tools.run_webui
```

<Tip>
  Add the `--compile` flag for faster inference during interactive sessions.
</Tip>

### Access WebUI

The WebUI starts on port 7860 by default. Access it at:

```
http://localhost:7860
```

### Configure with Environment Variables

Customize the WebUI using Gradio environment variables:

```bash theme={null}
# Enable public sharing
GRADIO_SHARE=1 python -m tools.run_webui

# Change server port
GRADIO_SERVER_PORT=8080 python -m tools.run_webui

# Change server name
GRADIO_SERVER_NAME=0.0.0.0 python -m tools.run_webui
```

### Using Reference Audio Library

For faster workflow, pre-save reference audio:

1. Create a `references/` directory in the project root
2. Create subdirectories named by voice ID: `references/<voice_id>/`
3. Place files in each subdirectory:
   * `sample.wav` - Reference audio file
   * `sample.lab` - Text transcription of the audio

Example structure:

```
references/
├── alice/
│   ├── sample.wav
│   └── sample.lab
└── bob/
    ├── sample.wav
    └── sample.lab
```

These references will appear as selectable options in the WebUI.

## GUI Inference

For users who prefer a native desktop application, a PyQt6-based GUI is available.

### Download GUI Client

Download the latest release from the [Fish Speech GUI repository](https://github.com/AnyaCoder/fish-speech-gui/releases).

**Supported platforms:**

* Linux
* Windows
* macOS

### Connect to API Server

The GUI client connects to a running API server (see [HTTP API Inference](#http-api-inference) above).

1. Start the API server
2. Launch the GUI client
3. Configure the API endpoint (default: `http://localhost:8080`)

## Docker Inference

If you're using Docker deployment, refer to the [Docker Deployment guide](/developer-guide/self-hosting/docker-deployment) for detailed instructions on:

* Running pre-built WebUI containers
* Running pre-built API server containers
* Customizing container configuration
* Volume mounts for models and references

Quick example:

```bash theme={null}
# Start WebUI with Docker
docker run -d \
    --name fish-speech-webui \
    --gpus all \
    -p 7860:7860 \
    -v ./checkpoints:/app/checkpoints \
    -v ./references:/app/references \
    -e COMPILE=1 \
    fishaudio/fish-speech:latest-webui-cuda
```

## Performance Optimization

### Enable Compilation

Torch compilation provides \~10x speedup on compatible GPUs:

```bash theme={null}
# Add --compile flag to any inference command
python -m tools.api_server --compile ...
```

<Warning>
  Compilation requires:

  * CUDA-compatible GPU
  * Triton library (not supported on Windows/macOS)
  * First run will be slow due to compilation overhead
</Warning>

### Use Mixed Precision

For GPUs without bf16 support, use fp16:

```bash theme={null}
python fish_speech/models/text2semantic/inference.py --half ...
```

### Batch Processing

For multiple audio generations, use batch processing to amortize model loading overhead:

```python theme={null}
# Example batch processing script
import fish_speech

model = fish_speech.load_model("checkpoints/openaudio-s1-mini")

texts = ["First sentence", "Second sentence", "Third sentence"]
for text in texts:
    audio = model.synthesize(text)
    audio.save(f"output_{texts.index(text)}.wav")
```

## Emotion Control

Fish Audio S1 supports emotional markers for expressive speech synthesis:

### Basic Emotions

```
(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)
```

### Advanced Emotions

```
(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)
```

### Tone Markers

```
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
```

### Special Effects

```
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)
```

### Example Usage

```bash theme={null}
python fish_speech/models/text2semantic/inference.py \
    --text "(excited)This is amazing! (laughing)Ha ha ha!" \
    --compile
```

<Info>
  Emotion control is currently supported for English, Chinese, and Japanese. More languages coming soon!
</Info>

For more details, see the [Emotion Reference](/api-reference/emotion-reference).

## Troubleshooting

### Out of Memory Errors

If you encounter CUDA out of memory errors:

1. Reduce input text length
2. Use `--half` flag for fp16 inference
3. Close other GPU applications
4. Use a smaller batch size

### Slow Inference

To improve speed:

1. Enable `--compile` flag
2. Verify GPU is being used (check with `nvidia-smi`)
3. Ensure CUDA version matches PyTorch installation
4. Use fp16 instead of bf16 on older GPUs

### Poor Audio Quality

For better quality:

1. Use high-quality reference audio (clear, no background noise)
2. Ensure reference text accurately matches reference audio
3. Use 10-30 seconds of reference audio
4. See [Voice Cloning Best Practices](/developer-guide/best-practices/voice-cloning)

### Model Loading Errors

If models fail to load:

1. Verify model weights are downloaded completely
2. Check checkpoint paths are correct
3. Ensure sufficient disk space
4. Re-download weights if corrupted

## Next Steps

* **[Emotion Control Best Practices](/developer-guide/best-practices/emotion-control)** - Master expressive speech
* **[Voice Cloning Best Practices](/developer-guide/best-practices/voice-cloning)** - Optimize voice cloning quality
* **[API Reference](/api-reference/introduction)** - Integrate with your applications
* **[Cloud API](https://fish.audio)** - Compare with managed service performance
