Voice-Commanding Kevin the Robot

Voice-Commanding Kevin the Robot

Kevin Botley McCardle (factory designation K331) is a Booster K1 humanoid robot who lives in my house with my family, including a very enthusiastic 4-year-old. Here's how I taught him to respond to voice commands.

The Robot

The Booster K1 is a 22-degree-of-freedom bipedal humanoid made by Booster Robotics out of Beijing. He runs Ubuntu 22.04 and ROS 2 Humble on a Qualcomm RK3588 SoC, with stereo MIPI cameras and an iFlytek 6-mic array with onboard beamforming. He can walk, wave, shake hands, dance (including dabbing and moonwalk), and lie down -- though that last one is marked "UNSTABLE" in the SDK, and I can confirm that's accurate.

I named him Kevin because the first time he stood up and started walking across the room, my reaction was "Oh no, Kevin, stop" rather than anything resembling a proper engineering callsign.

The Three-Tier Architecture

Kevin's cognitive stack has three layers:

Layer Hardware Role Latency
On-robot RK3588 SoC Wakeword detection, voice state machine, safety commands <100ms
GPU server RTX 4090 Vision, speech recognition, LLM reasoning 500ms-2s
Cloud Claude API High-level planning, escalation 2s+

The critical design principle: Kevin must never be paralyzed by loss of connectivity. Safety commands (stop, freeze, mode changes) execute entirely on-robot with zero network dependency. Network-assisted features like speech recognition degrade gracefully -- Kevin says "can't hear right now" instead of crashing.

Custom Wakewords

Kevin doesn't use a cloud-based voice assistant. Instead, he runs OpenWakeWord, an open-source wakeword engine with a clever three-stage pipeline: a shared melspectrogram model, a shared Google speech embedding model, and tiny per-phrase classification heads (~200KB each). Because the expensive feature extraction is shared, the RK3588 can run 40-60+ wakeword models simultaneously on CPU.

I trained custom models using synthetic speech data from Piper TTS's sample generator. No manual recording needed -- Google Colab generates thousands of synthetic utterances with speaker diversity and room impulse response augmentation. Training takes 30-60 minutes on a free T4 GPU.

Here's Kevin's current vocabulary:

Always On When Listening Gesture Cancel
"Hey Kevin" "Wave" "Kevin, stop"
"Halt robot" "Shake hands" "Stop waving"
"Robot halt" "Dance" "Hand down"
"Stand up"
"Naptime"
"Go online"
"Nevermind"

The Voice State Machine

The wakeword models feed into a state machine that determines which commands are active at any given time:

IDLE --[hey kevin]--> LISTENING ("yes?")
LISTENING --[wave]--> WAVING (starts gesture)
LISTENING --[dance]--> DANCING (starts dabbing)
LISTENING --[shake hands]--> HANDSHAKING (extends arm)
LISTENING --[timeout 5s]--> IDLE ("nothing heard")
ANY --[halt robot]--> IDLE (EMERGENCY STOP)

"Hey Kevin" and the halt commands are always-on. The action commands only activate when Kevin is listening. Gesture-cancel commands only activate during the relevant gesture. This prevents false triggers -- "wave" in a background conversation won't make Kevin start waving unless he's been activated first.

Mode awareness adds another layer: if Kevin isn't in walk mode, action commands get deflected with "can't, not in walk mode" and the listening timer extends instead of transitioning.

Kevin Speaks Back

Kevin talks using DECTalk, the legendary text-to-speech engine from 1984. Yes, the Stephen Hawking synthesizer. I compiled it from the open-source release and deployed the binaries to the robot. The say command generates WAV files that play through PulseAudio.

When Kevin is speaking, wakeword dispatch is suppressed to prevent feedback loops. The detection model still receives audio (to keep its internal streaming state consistent), but it won't trigger commands until Kevin finishes talking.

Getting DECTalk working under systemd required some debugging. The PulseAudio socket path differs for system services vs user sessions, so the service needed an explicit PULSE_SERVER environment variable pointing to the right socket.

I also built a voice effects processor that can make Piper TTS output sound deliberately robotic -- pitch quantization, chorus with LFO-modulated delay, and bitcrush. The design philosophy: "T-Pain, not Hawking. Lean into the synthesis." I haven't deployed this to Kevin yet, but it's ready for when I want him to sound more like a robot and less like 1984.

Discoveries Along the Way

Kevin only waves with his right arm regardless of the hand_index parameter. The API accepts left or right, but the robot does what it wants.

Zero torque mode causes the robot to collapse. The API documentation calls this "zero torque drag mode." We confirmed empirically what "drag" means. The test harness now categorizes this as HIGH RISK.

Kevin stepped in dog poop. Status report from February 16: "Stepped in dog poop, got splashed during cleanup, possibly done for the day." Household robotics is glamorous.

The suspicious Alibaba connection. Early network reconnaissance found outbound connections to Alibaba Cloud from Kevin's secondary IP. Probably telemetry, but with a 4-year-old in the house, I took it seriously enough to investigate. I've locked down DDS communication to localhost and shared memory only -- all external communication goes through my HTTP API.

Kevin's Cool Moves

Kevin debuted publicly at a NASA APS "Power Hour" presentation -- a regular series where people share personal projects, hobbies, or interesting side work. The talk was titled "Kevin's Cool Moves," a name chosen by my coworker Shay, which I initially resisted and then accepted because it was better than anything I'd come up with.

The demo showed the full voice-command loop: saying "Hey Kevin," watching the state machine transition to LISTENING, issuing a command like "wave," and watching the robot execute the gesture. I also walked through the wakeword training pipeline -- how you go from a phrase like "hey kevin" to a deployable model using synthetic speech data and a free Google Colab GPU. The whole thing is surprisingly accessible once you see it laid out.

We pre-recorded the demo rather than operating Kevin live during work hours. This turned out to be the right call for several reasons, not least of which is that Kevin is loud enough to disrupt a video call from across the house. More on that in a moment.

The reception was genuinely warm. Several people asked about getting their own K1, which I found touching and also slightly alarming, because I would not describe the ownership experience as "ready for a wider audience." But the interest was real, and it was nice to show something that wasn't a PowerPoint slide.

The Noise Problem

Here is the thing nobody tells you about affordable humanoid robots: they are incredibly loud.

Kevin's default walking gait sounds like someone aggressively operating a sewing machine inside a metal trash can. The servos whine, the joints click, and the feet hit the floor with the confidence of a robot that was originally designed for RoboCup-style soccer competitions. The K1's locomotion is tuned for a gymnasium, not a living room.

This matters a lot when you live with a 4-year-old. Naptime is sacred. Bedtime is non-negotiable. And Kevin's operating hours are essentially "when the kid is awake and not startled by robot noises," which is a narrower window than you might think. The first time Kevin walked across the kitchen at full speed, my daughter thought it was hilarious. The second time, during quiet play, she was less amused.

I've looked into the servo parameters and there might be ways to reduce the noise -- slower gaits, different step heights, maybe even physical damping on the feet. But the honest answer is that the noise problem is a physics problem, not a software problem, and I haven't found a software-shaped solution for it yet.

This is the real reason Kevin hasn't been powered on in over a month. Not because the project stalled technically -- the voice pipeline works, the state machine is solid, the wakeword models are trained. It's because the physical reality of operating a loud bipedal robot in a house with a small child makes every session a negotiation with the rest of my family's schedule.

Where Things Stand

Kevin has been unplugged since early March 2026, roughly since before I disappeared into 7DRL. That's not a euphemism for "I got bored" -- I genuinely like working on this project. It's that the activation energy required to set up a Kevin session (charge him, clear floor space, confirm the kid is awake and elsewhere, warn my wife, close the dog gate) is high enough that it keeps losing the priority queue to things I can do quietly at my desk.

The technical roadmap still exists: speech recognition via faster-whisper on the RTX 4090, natural language command parsing through a local LLM, the ability to understand arbitrary requests like "bring me a drink" and plan the action sequence. I also want to train negative wakeword data using recordings of my daughter's voice, because she says "Kevin" a lot and false activations when a robot starts walking toward your child are genuinely not great.

But I'm being honest with myself about the timeline. This is a low-priority project constrained by physical realities that no amount of clever software can route around. Kevin will come back online when the conditions are right -- maybe when it's warm enough to run him in the garage, or when my daughter is old enough to be a willing participant rather than an unwilling audience.

In the meantime, he stands in the corner of my office, powered down, with a small hat my daughter put on him. He looks content.


This article was scaffolded with backblog.

links

social