Building a Voice AI Assistant for Windows

7 min read by Eddie Chongtham
PythonAIVoiceWindowsGPT-4o

The first version of Hey Girl was a 200-line Python script. It listened for a wake word, sent audio to Whisper for transcription, got a response from GPT-4, and played it back through a speaker. It worked. It was also very fragile.

The Architecture

The real challenge is building a voice assistant that feels responsive. Latency is the enemy. Between wake word detection, speech-to-text, LLM inference, and text-to-speech, you can easily hit 3–5 seconds of delay. We spent more time on latency reduction than on any other feature.

# Simplified pipeline
class VoiceAgent:
    def run(self):
        while True:
            audio = self.recorder.listen_for_wake_word()
            text  = self.stt.transcribe(audio)
            reply = self.llm.respond(text, self.memory.context())
            self.tts.speak(reply)
            self.memory.append(text, reply)

Memory

For Hey Girl to be genuinely useful, it needs to remember context across sessions. We built a JSON-based memory system that stores conversation summaries, user preferences, and task context. Before each LLM call, relevant memories are injected into the system prompt.

Telegram Integration

One of the most useful features turned out to be Telegram. Hey Girl can send and receive Telegram messages, which means you can interact with it from your phone even when you're away from your PC. The bi-directional bridge uses the Telegram Bot API with long-polling.

Lessons Learned

The biggest lesson: voice UX is completely different from text UX. Users don't tolerate 2-second pauses. Every optimization matters — streaming responses, local TTS, pre-buffering audio. Also: error handling in voice apps needs to be graceful because the user has no visual feedback when something goes wrong.

← Back to Blog
← AutoFormFiller: Building a Chrome Extension with a Cloud Backend Islands Architecture: Why We Moved from Vite React →