What it is Whisper and how to get started with it.
What is whisper ?
Whisper, a 1.6 billion parameter AI model that can transcribe and interpret spoken sounds from 97 different languages, was just published by OpenAI. The variety of data utilized to train Whisper is an essential feature. Whisper was trained using 680,000 hours of multilingual and multifunctional supervised data acquired from the web. Non-English audio examples account for one-third of the training data.
Whisper processes audio in 30-second chunks and uses an encoder-decoder Transformer architecture. Unlike other cutting-edge ASR models, Whisper is not fine-tuned on a benchmark dataset; instead, it is trained on a large-scale, noisy dataset of speech audio and matched transcribed text obtained from the internet using "poor" supervision. Whisper generated 55% fewer mistakes than Wav2Vec, a baseline model, in an assessment of a set of voice recognition datasets.
Several speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection, are trained on by a Transformer sequence-to-sequence model. Since each of these tasks is collectively represented as a set of tokens that the decoder must anticipate, a single model may take the place of several distinct steps in a conventional speech-processing pipeline. A selection of unique tokens that act as task specifiers or classification goals are used in the multitask training format.
image credit :OpenAI
How to get started with whisper ?
OpenAI has come under fire for not open-sourcing its models. GPT-3 and DALL-E, two of OpenAI's most outstanding deep learning models, are exclusively available through subscription API services and cannot be downloaded or analyzed. Whisper, on the other hand, was provided as a pre-trained, open-source model that anybody may download and run on any computing platform. So let's get started.
Whisper has five model sizes available, four of which are exclusively available in English, with speed and accuracy tradeoffs. The names of the various models are shown below, along with their approximate memory needs and relative speeds.
Whisper trains and tests models using Python 3.9.9 and PyTorch 1.10.1, however, the model is believed to be compatible with Python 3.7 or later and recent PyTorch versions. A few Python libraries are also required, most notably HuggingFace Transformers for their quick tokenizer implementation and ffmpeg-python for reading audio files. So make sure you have installed python 3.7 or a higher version.
After installing python Open a terminal and enter the following command it will download and install the most recent commit from the official whisper GitHub repository, as well as any Python dependencies that may be required.
pip install git+https://github.com/openai/whisper.git
It is also necessary to have the command-line utility ffmpeg installed on your system.
Now open a terminal and type the following command to transcribe speech from an audio file (replace the "audio.mp3" with your audio path) you can use any other model other than medium to suite your requirements
whisper audio.mp3 --model medium
The default configuration (which picks the small model) works well for English transcription. To transcribe an audio file including non-English speech, use the —language option to indicate the language.
whisper malayalam.wav --language Malayalam
You can also use whisper to translate audio to English using the below command.
whisper malayalam.wav --language Malayalam --task translate
You can also use whisper on your python script as shown bellow.
It's that easy so what are waiting for go ahead and try it for yourself it not always OpenAI releases an opensource model :-)
Reference
https://github.com/openai/whisper
https://openai.com/blog/whisper/
Comments