UGotMe: An Embodied System <br> for Affective Human-Robot Interaction

Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios.

An overview of UGotMe, the proposed affective human-robot interaction system. The working pipeline includes on-robot multimodal perception (B) and on-edge vision-language to emotion modeling (C), where multimodal emotion recognition and robotic facial expression generation occur. D. Customized active face extraction (a)-(e) handles the environmental noise issue.

An illustration of the VL2E model. The vision encoder of the VL2E model processes face sequences of active speakers and extracts visual features at the frame level. In this way, the model can ignore the presence of irrelevant environmental noise and better focus on the emotion-rich facial expressions. To account for conversation context, we conduct context modeling when calculating the textual representation for the current utterance. To better model inter-modal interactions, we leverage the crossmodal transformer to perform multimodal fusion once the unimodal features are extracted.

Left: UGotMe-TelME. Right: UGotMe-VL2E. In both cases, the inactive speaker has a sad expression, while the active speaker, who is talking with Ameca, has a joyful expression. Dialogue context for both cases are: “The movie we saw last night is really impressive. That’s awesome. What movie did you watch? You jump, I jump”. Ameca is supposed to deliver the same emotion as the active speaker through facial expression. However, in the case of UGotMe- TelME, distracting face confuses the model, leading to the wrong answer.

BibTeX

@misc{li2024ugotmeembodiedaffectivehumanrobot,
      title={UGotMe: An Embodied System for Affective Human-Robot Interaction}, 
      author={Peizhen Li and Longbing Cao and Xiao-Ming Wu and Xiaohan Yu and Runze Yang},
      year={2024},
      eprint={2410.18373},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2410.18373}, 
}

UGotMe: An Embodied System
for Affective Human-Robot Interaction

UGotMe effectively provides appropriate emotional responses to human interactants while maintaining a positive user experience, even in the presence of distracting factors.

Abstract

System overview

Emotion Recognition Module

Comparison Results

Real-world Deployment

Video Presentation

BibTeX