Non-Verbal Human-Robot Interaction with Reachy Mini: A Real-Time Multimodal System and Turing Test Evaluation

Simon Fraser University, *Equal contributions
Pipeline

System overview of the multimodal non-verbal interaction pipeline.

Abstract

Non-verbal cues such as gesture, posture, and facial expression are central to human communication, yet remain largely unaddressed in social robotics. Existing HRI systems either rely on language or require cloud inference, limiting real-time, expressive non-verbal response. We present a fully offline, real-time non-verbal interaction system for Reachy Mini, a compact humanoid robot, that perceives and autonomously responds to human body motion and emotion.

Our system combines continuous head and arm mirroring, hand gesture recognition, facial emotion detection, and a locally-hosted Vision Language Model (VLM) that selects from a library of 81 pre-recorded expressions, coordinated by a priority scheduler across four concurrent perception layers.

Running entirely on consumer hardware with 3s end-to-end latency, the system produces behaviour convincing enough that participants in a Turing test-style study could not reliably distinguish it from human teleoperation (38% overall accuracy, below the 50% chance baseline). These results suggest that lightweight, offline multimodal perception is sufficient to produce socially credible non-verbal robot behaviour, opening a path toward accessible and deployable social HRI systems.

Video

Results

Sample Participant Interactions During the Pilot Study

Participant interactions

Top row, left to right: a participant performing a two-handed gesture triggering action detection (V2); leaning to the side with head tracking active (V2); a teleoperator controlling the robot with one hand (V2). Bottom row: a participant using a finger-point gesture for hand gesture recognition (V3); raising one hand while the mirroring layer tracks arm elevation (V2); exhibiting a frown expression detected by the facial emotion recognition module (V3).

Turing Test Prediction Accuracy by Condition and Pipeline

Pipeline Condition Correct Total Accuracy
V2 Autonomous 4580%
V2 Teleoperated 1425%
V3 Autonomous 2540%
V3 Teleoperated 1714%
V2 + V3 Autonomous 61060%
V2 + V3 Teleoperated 21118%

Turing Test Prediction Accuracy by Condition

Turing results

Participants in the Teleoperated condition identified the correct control mode at only 18% (2/11), well below the 50% chance baseline, indicating a strong autonomy-assumption bias. Participants in the Autonomous condition performed closer to chance at 60% (6/10).

Mean Godspeed Subscale Scores by Condition (1–5 scale)

Condition Anthropomorphism Animacy Likeability Intelligence Safety
Autonomous 2.72 3.22 3.96 3.08 3.00
Teleoperated 3.35 3.59 4.38 3.33 3.30

Mean Godspeed Subscale Scores by Condition

Godspeed results

Bars show condition means; overlaid points show individual observations (N=21). Teleoperated interactions received higher scores across all subscales except Perceived Safety, though differences should be interpreted as descriptive trends given the pilot study sample size.