Set up speech services with Speaches

Speaches is an OpenAI-compatible speech server for speech-to-text (STT) and text-to-speech (TTS). With pre-loaded models, you can use it right out of the box, or easily integrate it as a drop-in backend for any app supporting the OpenAI SDK.

This guide walks you through installing and using Speaches on Olares, including speech-to-text, text-to-speech, Audio Chat, API access, and basic model management.

Learning objectives

In this guide, you will learn how to:

Install Speaches on Olares.
Transcribe or translate audio files using speech-to-text.
Generate speech from text using text-to-speech.
Have voice conversations with an AI model using Audio Chat.
Access the Speaches API from other apps.
Manage speech models.

Prerequisites

Olares is running on a device with an NVIDIA GPU.
Ollama installed and running with at least one chat model downloaded (required for Audio Chat only).

Install Speaches

Open Market and search for "Speaches".
Click Get, then Install, and wait for installation to complete.

After installation, you will see two icons on Launchpad:

Speaches: The main interface for speech-to-text, text-to-speech, and audio chat.
Speaches Terminal: A command-line terminal for managing models.

Model setup on first launch

When you open Speaches for the first time, it downloads and initializes its built-in models. Depending on your network connection, this process may take some time.

If initialization does not finish within 30 minutes, it may time out and be canceled automatically. If this happens, wait until your network connection is stable, then open Speaches again to retry initialization.

Use Speaches

Speaches ships with two models ready to use out of the box:

Model	Type	Purpose
`Systran/faster-whisper-small`	STT	Speech recognition and translation
`speaches-ai/Kokoro-82M-v1.0-ONNX`	TTS	Speech synthesis

Transcribe audio

Open Speaches and click the Speech-to-Text tab.
Under Model, select a STT model, such as Systran/faster-whisper-small.
Under Task, select transcribe.
Upload an audio file or click mic to record audio from your microphone.
(Optional) Enable Stream if you want to receive partial results while transcription is still in progress.
Click Generate.

The transcription appears in Textbox after processing completes.

Translate audio to English

Speaches can automatically detect the language of the audio and translate it into English.

Open Speaches and click the Speech-to-Text tab.
Under Model, select a STT model, such as Systran/faster-whisper-small.
Under Task, select translate.
Upload an audio file or click mic to record audio from your microphone.
(Optional) Enable Stream if you want to receive partial results while translation is still in progress.
Click Generate.

The English translation appears in Textbox after processing completes.

Generate speech from text

Open Speaches and click the Text-to-Speech tab.
Enter the text you want to convert in Input Text.
Under Model, select a TTS model.
Select a voice from Voice.
Under Response Format, select an output format.
Click Generate Speech.
Play the generated audio and download it if needed.

Chat with AI using voice

Use Audio Chat to talk to an AI model with voice, text, or an audio file. Speaches first converts your voice to text, sends the text to the chat model, and can convert the reply back to speech.

INFO

Audio Chat requires Ollama to be installed, with at least one chat model downloaded.
Audio playback is currently available for English replies only. For other languages, the reply is shown as text only.

Start a voice conversation

Open Speaches and click the Audio Chat tab.
Under Chat Model, select an Ollama model, such as qwen2.5:7b.
Send a message using one of these methods:
- Audio file: Upload an audio file.
- Text: Type your message in the input field next to the microphone icon and send it.
- Voice: Click mic to record your message, then click send to send it.
Wait for Speaches to generate the reply.

WARNING

The full voice pipeline (STT, LLM, TTS) takes time to complete. Do not refresh the page while a reply is being generated, as you might see UI flickering during processing.

Optional: Improve transcription accuracy for Audio Chat

Audio Chat uses the pre-installed Systran/faster-whisper-small speech-to-text model by default. For better transcription accuracy, you can switch to a larger model such as Systran/faster-whisper-large-v3.

More GPU resources may be required

Larger models require more GPU resources. If generation tasks start failing after switching to a larger model, see Why do tasks fail after switching to a larger model.

Open Speaches Terminal and download the model:
bash
```
hf download Systran/faster-whisper-large-v3
```
If you see a warning about HF_TOKEN, you can ignore it. The model download can still continue without this setting.
Go to Settings > Applications > Speaches > Manage environment variables.
Click edit_square next to SPEACHES_WHISPER_MODEL.
Set the value as the model you downloaded, for example, Systran/faster-whisper-large-v3, then click Confirm.
Click Apply to save the changes.

Speaches restarts automatically to apply the change.

Wait for service initialization

After the app shows as running again, wait a little longer before using it, as the service may still be initializing.

Manage models

Manage models when you want to use a different model, improve quality, or free up storage space.

Check downloaded models

To see all downloaded models, open Speaches Terminal and run:

bash

hf cache list

Download a new model

Open Speaches Terminal and run:

bash

hf download <model-name>

For example:

bash

# Download a larger Whisper model for higher accuracy
hf download Systran/faster-whisper-medium
# Highest accuracy Whisper model, requires more memory
hf download Systran/faster-whisper-large-v3

Shared model storage

Models are downloaded to Olares Files, at /Home/Huggingface/speaches/. If other apps on your Olares also use Hugging Face models, they share this directory.

Refresh the Speaches page to load the new model into the list.

Remove a model

To free up storage space, you can remove models you no longer need:

Open Speaches Terminal and run:

bash

hf cache rm model/<model_name>

For example:

bash

hf cache rm model/Systran/faster-whisper-medium

Refresh the Speaches page to update the model list.

Switch to CPU mode

Speaches uses GPU mode by default. If needed, you can switch it to CPU mode instead. CPU mode is slower and is mainly suitable for small tasks.

To switch to CPU mode:

Go to Settings > Applications > Speaches > Manage environment variables.
Click edit_square next to SPEACHES_GPU, change its value to false, then click Confirm.
Click Apply to save the changes.

Speaches automatically redeploys in CPU mode. Processing will be slower compared to GPU mode.

FAQs

Why does Audio Chat show an error?

Audio Chat requires Ollama to be running with at least one chat model downloaded. If Ollama is not installed or has no models available, Audio Chat displays an error.

To fix this issue, install Ollama and download a chat model by following the Ollama guide. Speaches detects Ollama automatically, so you do not need to restart Speaches.

Why do tasks fail after switching to a larger model?

This issue usually happens when the GPU is in Memory slicing mode.

Larger models require more VRAM. If Speaches is assigned only a small amount of VRAM, generation tasks may fail after you switch to a larger model.

To fix this issue:

Increase the VRAM assigned to Speaches in Memory slicing mode.
Or switch the GPU to another mode.

For detailed instructions, see Manage GPU resources.

Can I use a different Ollama instance for Audio Chat?

Yes. Update the CHAT_COMPLETION_BASE_URL in the deployment configuration:

Open Control Hub and navigate to Browse > System > speachesserver-shared > Deployments > speaches.
Click edit_square to edit the YAML file.
In Edit YAML, find CHAT_COMPLETION_BASE_URL, and update its value to your Ollama endpoint. Make sure the URL ends with /v1.
Go to Settings > Applications > Speaches, click Stop, then click Resume to restart Speaches.

Learn more

Speaches official documentation: Full API reference and model compatibility.
Ollama: Download and run local AI models.
Open WebUI: Chat interface that can use Speaches as a speech backend.
IndexTTS2: Generate speech from text with zero-shot voice cloning.

OpenClaw

集成聊天应用

NemoClaw

OpenCode

Open WebUI

ComfyUI

Windows

Immich

TREK (NOMAD)

Set up speech services with Speaches

Learning objectives

Prerequisites

Install Speaches

Use Speaches

Transcribe audio

Translate audio to English

Generate speech from text

Chat with AI using voice

Start a voice conversation

Optional: Improve transcription accuracy for Audio Chat

Manage models

Check downloaded models

Download a new model

Remove a model

Switch to CPU mode

FAQs

Why does Audio Chat show an error?

Why do tasks fail after switching to a larger model?

Can I use a different Ollama instance for Audio Chat?

Learn more

OpenClaw

集成聊天应用

NemoClaw

OpenCode

Open WebUI

ComfyUI

Set up speech services with Speaches ​

Learning objectives ​

Prerequisites ​

Install Speaches ​

Use Speaches ​

Transcribe audio ​

Translate audio to English ​

Generate speech from text ​

Chat with AI using voice ​

Start a voice conversation ​

Optional: Improve transcription accuracy for Audio Chat ​

Manage models ​

Check downloaded models ​

Download a new model ​

Remove a model ​

Switch to CPU mode ​

FAQs ​

Why does Audio Chat show an error? ​

Why do tasks fail after switching to a larger model? ​

Can I use a different Ollama instance for Audio Chat? ​

Learn more ​

Set up speech services with Speaches

Learning objectives

Prerequisites

Install Speaches

Use Speaches

Transcribe audio

Translate audio to English

Generate speech from text

Chat with AI using voice

Start a voice conversation

Optional: Improve transcription accuracy for Audio Chat

Manage models

Check downloaded models

Download a new model

Remove a model

Switch to CPU mode

FAQs

Why does Audio Chat show an error?

Why do tasks fail after switching to a larger model?

Can I use a different Ollama instance for Audio Chat?

Learn more