Update README
Browse files- README.md +8 -2
- docs/options.md +20 -12
README.md
CHANGED
@@ -76,6 +76,12 @@ cores (up to 8):
|
|
76 |
python app.py --input_audio_max_duration -1 --auto_parallel True
|
77 |
```
|
78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
79 |
# Docker
|
80 |
|
81 |
To run it in Docker, first install Docker and optionally the NVIDIA Container Toolkit in order to use the GPU.
|
@@ -109,7 +115,7 @@ You can also pass custom arguments to `app.py` in the Docker container, for inst
|
|
109 |
sudo docker run -d --gpus all -p 7860:7860 \
|
110 |
--mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
|
111 |
--restart=on-failure:15 registry.gitlab.com/aadnk/whisper-webui:latest \
|
112 |
-
app.py --input_audio_max_duration -1 --server_name 0.0.0.0 --
|
113 |
--default_vad silero-vad --default_model_name large
|
114 |
```
|
115 |
|
@@ -119,7 +125,7 @@ sudo docker run --gpus all \
|
|
119 |
--mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
|
120 |
--mount type=bind,source=${PWD},target=/app/data \
|
121 |
registry.gitlab.com/aadnk/whisper-webui:latest \
|
122 |
-
cli.py --model large --
|
123 |
--output_dir /app/data /app/data/YOUR-FILE-HERE.mp4
|
124 |
```
|
125 |
|
|
|
76 |
python app.py --input_audio_max_duration -1 --auto_parallel True
|
77 |
```
|
78 |
|
79 |
+
### Multiple Files
|
80 |
+
|
81 |
+
You can upload multiple files either through the "Upload files" option, or as a playlist on YouTube.
|
82 |
+
Each audio file will then be processed in turn, and the resulting SRT/VTT/Transcript will be made available in the "Download" section.
|
83 |
+
When more than one file is processed, the UI will also generate a "All_Output" zip file containing all the text output files.
|
84 |
+
|
85 |
# Docker
|
86 |
|
87 |
To run it in Docker, first install Docker and optionally the NVIDIA Container Toolkit in order to use the GPU.
|
|
|
115 |
sudo docker run -d --gpus all -p 7860:7860 \
|
116 |
--mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
|
117 |
--restart=on-failure:15 registry.gitlab.com/aadnk/whisper-webui:latest \
|
118 |
+
app.py --input_audio_max_duration -1 --server_name 0.0.0.0 --auto_parallel True \
|
119 |
--default_vad silero-vad --default_model_name large
|
120 |
```
|
121 |
|
|
|
125 |
--mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
|
126 |
--mount type=bind,source=${PWD},target=/app/data \
|
127 |
registry.gitlab.com/aadnk/whisper-webui:latest \
|
128 |
+
cli.py --model large --auto_parallel True --vad silero-vad \
|
129 |
--output_dir /app/data /app/data/YOUR-FILE-HERE.mp4
|
130 |
```
|
131 |
|
docs/options.md
CHANGED
@@ -3,18 +3,19 @@ To transcribe or translate an audio file, you can either copy an URL from a webs
|
|
3 |
supported by YT-DLP will work, including YouTube). Otherwise, upload an audio file (choose "All Files (*.*)"
|
4 |
in the file selector to select any file type, including video files) or use the microphone.
|
5 |
|
6 |
-
For longer audio files (>10 minutes), it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option.
|
7 |
|
8 |
## Model
|
9 |
Select the model that Whisper will use to transcribe the audio:
|
10 |
|
11 |
-
| Size
|
12 |
-
|
13 |
-
| tiny
|
14 |
-
| base
|
15 |
-
| small
|
16 |
-
| medium
|
17 |
-
| large
|
|
|
18 |
|
19 |
## Language
|
20 |
|
@@ -24,10 +25,12 @@ Note that if the selected language and the language in the audio differs, Whispe
|
|
24 |
language. For instance, if the audio is in English but you select Japaneese, the model may translate the audio to Japanese.
|
25 |
|
26 |
## Inputs
|
27 |
-
The options "URL (YouTube, etc.)", "Upload
|
28 |
|
29 |
-
|
30 |
-
the URL.
|
|
|
|
|
31 |
|
32 |
## Task
|
33 |
Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.
|
@@ -75,4 +78,9 @@ number of seconds after the line has finished. For instance, if a line ends at 1
|
|
75 |
10:04, the line's text will be included if the prompt window is 4 seconds or more (10:04 - 10:00 = 4 seconds).
|
76 |
|
77 |
Note that detected lines in gaps between speech sections will not be included in the prompt
|
78 |
-
(if silero-vad or silero-vad-expand-into-gaps) is used.
|
|
|
|
|
|
|
|
|
|
|
|
3 |
supported by YT-DLP will work, including YouTube). Otherwise, upload an audio file (choose "All Files (*.*)"
|
4 |
in the file selector to select any file type, including video files) or use the microphone.
|
5 |
|
6 |
+
For longer audio files (>10 minutes), it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option, especially if you are using the `large-v1` model. Note that `large-v2` is a lot more forgiving, but you may still want to use a VAD with a slightly higher "VAD - Max Merge Size (s)" (60 seconds or more).
|
7 |
|
8 |
## Model
|
9 |
Select the model that Whisper will use to transcribe the audio:
|
10 |
|
11 |
+
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|
12 |
+
|-----------|------------|--------------------|--------------------|---------------|----------------|
|
13 |
+
| tiny | 39 M | tiny.en | tiny | ~1 GB | ~32x |
|
14 |
+
| base | 74 M | base.en | base | ~1 GB | ~16x |
|
15 |
+
| small | 244 M | small.en | small | ~2 GB | ~6x |
|
16 |
+
| medium | 769 M | medium.en | medium | ~5 GB | ~2x |
|
17 |
+
| large | 1550 M | N/A | large | ~10 GB | 1x |
|
18 |
+
| large-v2 | 1550 M | N/A | large | ~10 GB | 1x |
|
19 |
|
20 |
## Language
|
21 |
|
|
|
25 |
language. For instance, if the audio is in English but you select Japaneese, the model may translate the audio to Japanese.
|
26 |
|
27 |
## Inputs
|
28 |
+
The options "URL (YouTube, etc.)", "Upload Files" or "Micriphone Input" allows you to send an audio input to the model.
|
29 |
|
30 |
+
### Multiple Files
|
31 |
+
Note that the UI will only process either the given URL or the upload files (including microphone) - not both.
|
32 |
+
|
33 |
+
But you can upload multiple files either through the "Upload files" option, or as a playlist on YouTube. Each audio file will then be processed in turn, and the resulting SRT/VTT/Transcript will be made available in the "Download" section. When more than one file is processed, the UI will also generate a "All_Output" zip file containing all the text output files.
|
34 |
|
35 |
## Task
|
36 |
Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.
|
|
|
78 |
10:04, the line's text will be included if the prompt window is 4 seconds or more (10:04 - 10:00 = 4 seconds).
|
79 |
|
80 |
Note that detected lines in gaps between speech sections will not be included in the prompt
|
81 |
+
(if silero-vad or silero-vad-expand-into-gaps) is used.
|
82 |
+
|
83 |
+
# Command Line Options
|
84 |
+
|
85 |
+
Both `app.py` and `cli.py` also accept command line options, such as the ability to enable parallel execution on multiple
|
86 |
+
CPU/GPU cores, the default model name/VAD and so on. Consult the README in the root folder for more information.
|