Fully local voice assistant with GPU acceleration. No cloud services required.
[Microphone] → Whisper STT (GPU) → Ollama LLM → Piper TTS → [Speakers]
- Whisper (faster-whisper) — Speech-to-text on RTX 2070 Super
- Ollama — Local LLM inference
- Piper — Text-to-speech synthesis
- Wyoming Protocol — Service communication
- Dell G7 7700 / Intel i7-10750H / 32GB RAM / Nvidia RTX 2070 Super (8GB VRAM) / Windows 11
- Double-click
start-services.bat
- Wait for all three windows to show “Ready” (~10 seconds)
- Press and hold spacebar, speak your question
- Release spacebar when done
- Wait for AI response (plays automatically)
To stop: double-click stop-services.bat or close all three service windows.
| Service | Endpoint |
|---|
| Whisper STT | tcp://127.0.0.1:10300 |
| Piper TTS | tcp://127.0.0.1:10200 |
| Ollama | http://127.0.0.1:11434 |
| Model | Category | Recommended Quantization | Notes |
|---|
| Llama 3 7B Instruct | General / Chat | Q4_K_M | Best default brain, excellent EN/ES |
| Mistral 7B Instruct v0.3 | Reasoning | Q4_K_M | Coherent, strong context retention |
| Gemma 7B Instruct | Conversational | Q4_K_M | Warm tone, good multilingual |
| Qwen 2 7B Instruct | Bilingual EN+ES | Q4_K_M | Best for Pepa-style personality |
| Yi 1.5 7B Chat | Creative / Narrative | Q4_K_M | Good for informal dialogue |
| Model | VRAM | Notes |
|---|
| Phi-3 Mini (3.8B) | ~4 GB | Fast + smart, ideal for quick assistants |
| Gemma 2B | ~3 GB | Lightweight CPU+GPU mix |
| TinyLlama 1.1B | < 2 GB | Experiments only |
- Default: Llama 3 7B Instruct (
Q4_K_M)
- Bilingual EN↔ES: Qwen 2 7B Instruct
- Tight VRAM: Phi-3 Mini (3.8B)
- Services won’t start — Verify
D:\Pepa\venv exists; run ollama list; check Device Manager for GPU.
- No audio — Must run on bare metal (not RDP); check Windows audio settings.
- Slow responses — Check GPU usage in Task Manager; verify Whisper is using CUDA.
- “No speech detected” — Speak closer to mic; increase hold time on spacebar.
| Component | Latency |
|---|
| Whisper (base, GPU) | ~0.5–1s |
| Ollama (3B model) | ~1–2s |
| Piper TTS | ~0.2–0.5s |
| Total end-to-end | < 1 second |