Speaker Detection, Tracking & AI Summarization System
This project is a multimodal AI system focused on speaker detection, tracking, speech segmentation, transcription, and summarization. It uses MediaPipe Face Mesh to analyze facial landmarks and determine whether a detected person is speaking based on lip movement ratios. The system can identify active speakers in video, segment speaker-specific audio, transcribe speech using Whisper, and generate concise summaries using LangChain. It also supports YouTube videos and PDF documents, allowing users to extract and summarize content from multiple sources.
The Challenge
Long videos, meetings, lectures, and documents can be difficult to review manually. It is often time-consuming to identify who is speaking, extract speaker-specific content, transcribe the speech, and summarize the key points. Traditional summarization systems usually focus only on text or audio and do not combine visual speaker detection with NLP and LLM-based summarization.
The Solution
The solution combines computer vision and NLP into a single pipeline. MediaPipe Face Mesh detects facial landmarks and estimates speaking activity by analyzing lip movement ratios. The system tracks active speakers, segments the related audio, transcribes it using Whisper, and passes the extracted text to LangChain for summarization. It also extends the same summarization workflow to YouTube videos and PDF documents.
Architecture
The system processes video input by extracting frames and detecting faces using MediaPipe Face Mesh. Lip landmark distances are calculated to determine speaker activity. Active speaker segments are mapped to audio portions, which are then transcribed using Whisper. The transcribed text is processed through LangChain to generate summaries. For YouTube links, the system extracts video content, and for PDF uploads, it extracts document text before summarization.