Video Summarization Utilizing OpenAI Whisper & Hugging Chat API



“Much less is extra,” as architect Ludwig Mies van der Rohe famously mentioned, and that is what summarization means. Summarization is a essential software in lowering voluminous textual content material into succinct, related morsels, interesting to right this moment’s fast-paced data consumption. In textual content purposes, summarization aids data retrieval, and helps decision-making. The mixing of Generative AI, like OpenAI GPT-3-based fashions, has revolutionized this course of by not solely extracting key components from textual content and producing coherent summaries that retain the supply’s essence. Curiously, Generative AI’s capabilities prolong past textual content to video summarization. This includes extracting pivotal scenes, dialogues, and ideas from movies, creating abridged representations of the content material. You may obtain video summarization in many various methods, together with producing a brief abstract video, performing video content material evaluation, and highlighting key sections of the video or making a textual abstract of the video utilizing video transcription

The Open AI Whisper API leverages automated speech recognition know-how to transform spoken language into written textual content, therefore growing accuracy and effectivity of textual content summarization. Then again, the Hugging Face Chat API supplies state-of-the-art language fashions like GPT-3.

Studying Targets

On this article we are going to find out about:

  • We find out about video summarization strategies
  • Perceive the purposes of Video Summarization
  • Discover the Open AI Whisper mannequin structure
  • Be taught to implement the video textual summarization utilizing the Open AI Whisper and Hugging Chat API

This text was printed as part of the Data Science Blogathon.

Video Summarization Methods

Video Analytics

It includes the method of extracting significant data from a video. Use deep studying to trace and determine objects and motion in a video and determine the scenes. A number of the common strategies for video summarization are:

Keyframe Extraction and Shot Boundary Detection

This course of consists of changing the video to a restricted variety of nonetheless footage. Video skim is one other time period for this shorter video of keyshots.

Video pictures are non-interrupted steady sequence of frames. Shot boundary recognition detects transitions between pictures, like cuts, fades, or dissolves, and chooses frames from every shot to construct a abstract. The under are the most important steps to extract a steady brief video abstract from an extended video:

  • Body Extraction – Snapshot of video is extracted from video, we will take 1fps for 30 fps video.
  • Face and Emotion Detection – We are able to then extract faces from video & rating the feelings of faces to detect emotion scores. Face detection utilizing SSD (Single Shot Multibox Detector).
  • Body Rating & Choice – Choose frames which have excessive emotion rating after which rank.
  • Closing Extraction – We extract subtitles from the video together with timestamps. We then extract the sentences equivalent to the extracted frames chosen above, together with their beginning and ending instances within the video. Lastly, we merge the video components corresponding to those intervals to generate the ultimate abstract video.

Motion Recognition and Temporal Subsampling

On this we attempt to determine human motion carried out within the video that is broadly used utility of Video analytics. We breakdown the video in small subsequences as a substitute of frames and attempt to estimate the motion carried out within the phase  by classification and sample recognition strategies like HMC (Hidden Markov Chain Evaluation).

Single and Multi-modal Approaches

On this article we have now used single modal method the place in we use the audio of video to create a abstract of video utilizing textual abstract. Right here we use a
single side of video which is audio convert it to textual content after which get abstract utilizing that textual content.

In multi-modal method we mix data from many modalities like audio, visible, and textual content, give a holistic information of the video content material for extra correct summarization.

Functions of Video Summarization

Earlier than diving into the implementation of our video summarization we must always first know the purposes of video summarization. Under are a number of the listed examples of video summarization in quite a lot of fields and domains:

  • Safety and Surveillance: Video summarization can permit us to research great amount of surveillance video to get necessary occasions spotlight with out manually reviewing the video
  • Schooling and Coaching: One can ship key notes and coaching video thus college students can revise the video contents with out going via the entire video.
  • Content material Searching: Youtube makes use of this to spotlight necessary a part of video related to consumer search in an effort to permit customers to resolve they need to watch that specific video or not primarily based on their search necessities.
  • Catastrophe Administration: For emergencies and disaster video summarization can permit to take actions primarily based on conditions highlighted within the video abstract.

Open AI Whisper Mannequin Overview

The Whisper mannequin of Open AI is a automated speech recognition(ASR). It’s used for transcribing speech audio into textual content.

 Architecture of Open AI Whisper Model
Structure of Open AI Whisper Mannequin

It’s primarily based on the transformer structure, which stacks encoder and decoder blocks with an consideration mechanism that propagates data between them. It’ll take the audio recording, divide it into 30-second items, and course of every one individually. For every 30-second recording, the encoder encodes the audio and preserves the placement of every phrase said, and the decoder makes use of this encoded data to find out what was mentioned.

The decoder will anticipate tokens from all of this data, that are mainly every phrase pronounced. It’ll then repeat this course of for the next phrase , utilising the entire similar data to help it determine the following one which makes extra sense.

 Whisper model task flowchart
Whisper mannequin process flowchart

Coding Instance for Video Textual Summarization

 Flowchart of Textual Video Summarization
Flowchart of Textual Video Summarization

1 – Set up and Load Libraries

!pip set up yt-dlp openai-whisper hugchat
import yt_dlp
import whisper
from hugchat import hugchat

#Operate for saving audio from enter video id of youtube
def obtain(video_id: str) -> str:
    video_url = f'{video_id}'
    ydl_opts = {
        'format': 'm4a/bestaudio/greatest',
        'paths': {'house': 'audio/'},
        'outtmpl': {'default': '%(id)s.%(ext)s'},
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'm4a',
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        error_code = ydl.obtain([video_url])
        if error_code != 0:
            increase Exception('Did not obtain video')

    return f'audio/{video_id}.m4a'

#Name operate with video id
file_path = obtain('A_JQK_k4Kyc&t=99s')

3 – Transcribe audio to textual content utilizing Whisper

# Load whisper mannequin
whisper_model = whisper.load_model("tiny")

# Transcribe audio operate
def transcribe(file_path: str) -> str:
  # `fp16` defaults to `True`, which tells the mannequin to try to run on GPU.
  transcription = whisper_model.transcribe(file_path, fp16=False)
  return transcription['text']

#Name the transcriber operate with file path of audio  
transcript = transcribe('/content material/audio/A_JQK_k4Kyc.m4a')

 4 – Summarize transcribed textual content utilizing Hugging Chat

Word to make use of hugging chat api we have to login or join on hugging face platform. After that rather than “username” and “password” we have to move in our hugging face credentials.

from hugchat.login import Login

# login
signal = Login("username", "password")
cookies = signal.login()
signal.saveCookiesToDir("/content material")

# load cookies from usercookies
cookies = signal.loadCookiesFromDir("/content material") # It will detect if the JSON file exists, return cookies if it does and lift an Exception if it isn't.

# Create a ChatBot
chatbot = hugchat.ChatBot(cookies=cookies.get_dict())  # or cookie_path="usercookies/<electronic mail>.json"

#Summarise Transcript
print('''Summarize the next :-'''+transcript))


In conclusion, the idea of summarization is a transformative pressure in data administration. It’s a robust software that distills voluminous content material into concise, significant types, tailor-made to the fast-paced consumption of right this moment’s world.

By way of the mixing of Generative AI fashions like OpenAI’s GPT-3, summarization has transcended its conventional boundaries, evolving right into a course of that not solely extracts however generates coherent and contextually correct summaries.

The journey into video summarization unveils its relevance throughout various sectors. The implementation of how audio extraction, transcription utilizing Whisper, and summarization via Hugging Face Chat might be seamlessly built-in to create video textual summaries.

Key Takeaways

1. Generative AI: Video summarization might be achieved utilizing generative AI applied sciences resembling LLMs and ASR.

2. Functions in Fields:  Video summarization is definitely useful in lots of necessary fields the place one has to research great amount of movies to mine essential data.

3. Primary Implementation:  On this article we explored primary code implementation of video summarization primarily based on audio dimension.

4. Mannequin Structure: We additionally learnt about primary structure of Open AI Whisper mannequin and its course of circulation.

Continuously Requested Questions

Q1.  What are limits of Whisper API?

A. Whisper API name restrict is 50 in a min. There is no such thing as a audio size restrict however information upto 25 MB can solely be shared. One can scale back file measurement of audio by lowering bitrate of audio.

Q2. The Whisper API helps which file codecs?

A. The next file codecs: m4a, mp3, webm, mp4, mpga, wav, mpeg

Q3. What are the options of Whisper API?

A. A number of the main options for Automated Speech Recognition are – Twilio Voice, Deepgram, Azure speech-to-text, Google Cloud Speech-to-text.

This autumn. What are the constraints of Automated Speech Recognition (ASR) system?

A. One of many the issue in comprehending various accents of the identical language, necessity for specialised coaching purposes in specialised fields.

Q5. What are the options to Automated Speech Recognition (ASR)?

A. Superior analysis is happening within the area of speech recognition like decoding imagined speech from EEG alerts utilizing neural structure. This enables individuals
with speech disabilities to speak their ideas of speech to outdoors world with assist of units. One such fascinating paper here.

The media proven on this article isn’t owned by Analytics Vidhya and is used on the Writer’s discretion.


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button