Description
This is a Gradio application that converts video files into subtitle (SRT) files by extracting audio from the video and transcribing the speech using AI. It automates the process of converting videos into subtitle files, making it useful for users who want to create captions or transcriptions for their video content.
Model Inference
This application uses the Whisper large model for automatic speech recognition (ASR). The model is accessed through Hugging Face’s Transformers library using the pipeline
function, which simplifies the task of running the model. It transcribes the speech in the extracted audio from the video and converts it into text.
Whisper is hosted and distributed via Hugging Face, but the model itself was originally developed by OpenAI.
Tools Used
- Gradio: Used to create an interactive interface for uploading video files and downloading the generated SRT subtitle file.
- MoviePy: Utilized to extract the audio track from the video file and convert it into a WAV file. It handles video file reading and audio processing to ensure the audio can be passed to the speech recognition model.
- Hugging Face Transformers (Whisper Model): The model is accessed through Hugging Face’s
pipeline
for automatic speech recognition tasks. - OS: The
os
module is used for file management, including removing temporary files like the extracted WAV audio file after the transcription is complete.
Process Overview
- User Uploads Video: The user uploads a video file (e.g., MP4, AVI, MKV) via the Gradio interface.
- Extract Audio: The video file is processed with MoviePy to extract the audio track and convert it to a WAV file.
- Speech Recognition: The Whisper model is used to transcribe the extracted audio into text.
- Generate SRT File: The transcribed text is formatted into an SRT file with basic timestamps for subtitles.
- Download SRT: The user can download the generated subtitle file in SRT format via the Gradio interface.