Abstract
Analyzing videos presents a unique challenge due to their rich content compared to images. Furthermore, processing lengthy videos efficiently necessitates segmenting them into scenes. Focusing on individual scene analysis offers an efficient alternative to analyzing entire videos. The application of this approach extends to a variety of Video Intelligence tasks, from surveillance applications to comprehensive video analytics. By capitalizing on open-source foundation models and leveraging audio and text features, our framework offers a versatile solution to the intricate task of video analysis, catering to a multitude of real-world applications.