AI-Powered Multimodal Emotion Recognition: A Zero-Shot Framework Using Large Multimodal Models for Real-Time Affective Computing

Vanitha A; Mohammed Roshan Akther K M

doi:10.17148/IJARCCE.2026.15424

← Back to VOLUME 15, ISSUE 4, APRIL 2026

AI-Powered Multimodal Emotion Recognition: A Zero-Shot Framework Using Large Multimodal Models for Real-Time Affective Computing

Vanitha A, Mohammed Roshan Akther K M

Downloads: Download PDF|DOI: 10.17148/IJARCCE.2026.15424

👁 23 views📥 3 downloads

Abstract: Contemporary emotion recognition systems predominantly depend on Convolutional Neural Network (CNN) classifiers pre-trained on constrained label sets, rendering them brittle in unconstrained real-world conditions. This paper presents an AI-powered multimodal emotion recognition framework that exploits the zero-shot reasoning capability of the Google Gemini 1.5 Flash Large Multimodal Model (LMM) to perform high-fidelity facial micro-expression analysis directly in a web browser. The system captures live video frames using the HTML5 MediaDevices API, performs client- side JPEG compression via the Canvas API, and transmits Base64-encoded image payloads to the Gemini inference endpoint, along with a carefully engineered multimodal prompt. Dynamic JSON schema enforcement ensures structured, type-safe responses that contain the detected emotion, a contextual explanation, bounding-box coordinates (normalized to a 0–1000 scale), and an affective quote. The front-end is implemented using React 19, Vite 6, TypeScript, Tailwind CSS, and Framer Motion, forming a production-grade, Decoupled, Multimodal Architecture. Experimental evaluation demonstrates a mean round-trip inference latency of 1.3 seconds on broadband connectivity, as well as the ability to detect fine-grained affective states well beyond the canonical "Big Six" categories, including complex states such as "Suppressed Anger" and "Cautiously Optimistic." The architecture adheres to a strict zero-storage privacy model compliant with GDPR and CCPA. Results confirm that cloud-based LMM prompting supersedes CNN edge models for nuanced emotion understanding while remaining accessible via commodity hardware.

Keywords: Affective Computing, Emotion Recognition, Large Multimodal Models, Gemini 1.5 Flash, Zero-Shot Learning, Facial Action Coding System, Micro-Expression Analysis, Prompt Engineering, Human-Computer Interaction, Real-Time Vision.

How to Cite:

[1] Vanitha A, Mohammed Roshan Akther K M, “AI-Powered Multimodal Emotion Recognition: A Zero-Shot Framework Using Large Multimodal Models for Real-Time Affective Computing,” International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), DOI: 10.17148/IJARCCE.2026.15424

This work is licensed under a Creative Commons Attribution 4.0 International License.