Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations logo

Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations

Two metrics are proposed to evaluate AER performance with automatic segmentation based on time-weighted emotion and speaker classification errors.

GitHub Link

The GitHub link is https://github.com/w-wu/steer

Introduce

The repository "W-Wu/sTEER" contains code related to the "Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations" paper. The paper introduces a system that combines emotion recognition, speech recognition, and speaker diarisation in a jointly-trained model. The proposed evaluation metrics include Time-weighted Emotion Error Rate (TEER) and speaker-attributed Time-weighted Emotion Error Rate (sTEER). The code provides instructions and tools for data preparation, training, testing, and evaluation using Python, PyTorch, and Speechbrain. The paper details these processes and includes references for proper citation. Note that results might slightly differ due to PyTorch's CTC loss function behavior.

Content

Two metrics proposed to evaluate emotion classification performance with automatic segmentation:

Alternatives & Similar Tools

LongLLaMA-handle very long text contexts, up to 256,000 tokens logo

LongLLaMA is a large language model designed to handle very long text contexts, up to 256,000 tokens. It's based on OpenLLaMA and uses a technique called Focused Transformer (FoT) for training. The repository provides a smaller 3B version of LongLLaMA for free use. It can also be used as a replacement for LLaMA models with shorter contexts.