Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations

Two metrics are proposed to evaluate AER performance with automatic segmentation based on time-weighted emotion and speaker classification errors.

Paper and LLMs

GitHub Link

The GitHub link is https://github.com/w-wu/steer

Introduce

The repository "W-Wu/sTEER" contains code related to the "Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations" paper. The paper introduces a system that combines emotion recognition, speech recognition, and speaker diarisation in a jointly-trained model. The proposed evaluation metrics include Time-weighted Emotion Error Rate (TEER) and speaker-attributed Time-weighted Emotion Error Rate (sTEER). The code provides instructions and tools for data preparation, training, testing, and evaluation using Python, PyTorch, and Speechbrain. The paper details these processes and includes references for proper citation. Note that results might slightly differ due to PyTorch's CTC loss function behavior.

Content

Two metrics proposed to evaluate emotion classification performance with automatic segmentation:

Alternatives & Similar Tools

Free Google Gemini: the best largest and most capable AI model Free

Google Gemini, a multimodal AI by DeepMind, processes text, audio, images, and more. Gemini outperforms in AI benchmarks, is optimized for varied devices, and has been tested for safety and bias, adhering to responsible AI practices.

Visit →

Video ReTalking-focuses on audio-based lip synchronization for talking head video editing Open Source

Video ReTalking, advanced real-world talking head video according to input audio, producing a high-quality

Visit →

UniSim-Chat Control Video and Virtual simulation Open Source

Then transplant it to the real world to solve complex problems

Visit →

LongLLaMA-handle very long text contexts, up to 256,000 tokens Open Source

LongLLaMA is a large language model designed to handle very long text contexts, up to 256,000 tokens. It's based on OpenLLaMA and uses a technique called Focused Transformer (FoT) for training. The repository provides a smaller 3B version of LongLLaMA for free use. It can also be used as a replacement for LLaMA models with shorter contexts.

Visit →

LLaVA-LLMs designed to connect a vision encoder with a language model Open Source

Large Language and Vision Assistant

Visit →

Ntropy Insights- Save 80% on underwriting businesses everywhere Freemium

Use bank data and Ntropy's AI. Parse bank feeds and statements, extract revenue and COGs, automatically re-create a P&L within milliseconds. Any industry, any geo.

Visit →