israelisraeli commited on
Commit
7a64aed
1 Parent(s): 8f383ba

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - he
4
+ base_model:
5
+ - ivrit-ai/whisper-large-v3-turbo-d4-p1-take2
6
+ pipeline_tag: automatic-speech-recognition
7
+ tags:
8
+ - faster-whisper
9
+ ---
10
+
11
+ # ivrit-faster-whisper-turbo-d4
12
+
13
+ This model is a conversion of the **ivrit-ai/whisper-large-v3-turbo-d4-p1-take2** model to the [**Faster-Whisper**](https://github.com/guillaumekln/faster-whisper) format, offering significantly faster inference times.
14
+
15
+ ### Model Overview
16
+
17
+ - **Base Model**: [ivrit-ai/whisper-large-v3-turbo-d4-p1-take2](https://huggingface.co/ivrit-ai/whisper-large-v3-turbo-d4-p1-take2)
18
+ - **Converted to**: Faster-Whisper (for faster ASR with minimal performance loss)
19
+ - **Language**: Hebrew (`he`)
20
+ - **Quantization**: Float32
21
+
22
+ ### All credits go to **ivrit-ai** for developing the original Whisper model.
23
+
24
+ ## How to Use the Model
25
+
26
+ To use the model in your projects, follow the steps below to load and transcribe audio:
27
+
28
+ ```python
29
+ # Import the Faster Whisper module
30
+ import faster_whisper
31
+
32
+ # Load the model from Hugging Face
33
+ model = faster_whisper.WhisperModel("israelisraeli/ivrit-faster-whisper-turbo-d4", device="cuda")
34
+
35
+ # Transcribe the audio file to JSON
36
+ segs, _ = model.transcribe("AUDIOFILE_efiTheTigger.mp3", language="he")
37
+
38
+ # Format the output into a list of dictionaries with timestamps and text
39
+ transcribed_segments_with_timestamps = [
40
+ {"start": s.start, "end": s.end, "text": s.text} for s in segs
41
+ ]
42
+
43
+ import json
44
+
45
+ # Save the result to a JSON file
46
+ with open("transcribed_segments_with_timestamps.json", "w", encoding="utf-8") as json_file:
47
+ json.dump(
48
+ transcribed_segments_with_timestamps, json_file, ensure_ascii=False, indent=4
49
+ )
50
+
51
+ print("Transcription saved to transcribed_segments_with_timestamps.json")
52
+ ```
53
+
54
+
55
+ ## Conversion process
56
+
57
+ ### Tokenizer Conversion
58
+
59
+ ```python
60
+ from transformers import AutoTokenizer
61
+
62
+ # Load the tokenizer from the original Whisper model files
63
+ tokenizer_directory = "path_to_whisper_model_files"
64
+ tokenizer = AutoTokenizer.from_pretrained(tokenizer_directory)
65
+
66
+ # Save the tokenizer into a single JSON file
67
+ tokenizer.save_pretrained("path_to_save_directory", legacy_format=False)
68
+ ```
69
+
70
+ ### Model Conversion to Faster-Whisper
71
+
72
+ To convert the original [ivrit-ai/whisper-large-v3-turbo-d4-p1-take2](https://huggingface.co/ivrit-ai/whisper-large-v3-turbo-d4-p1-take2) model to the Faster-Whisper format, i used the CTranslate2 library. The following command was used for the conversion:
73
+
74
+ ```bash
75
+ ct2-transformers-converter \
76
+ --model ./whisper-large-v3-turbo-d4-p1-take2 \
77
+ --output_dir ./ivrit-faster-whisper-turbo-d4 \
78
+ --copy_files tokenizer.json preprocessor_config.json \
79
+ ```
80
+
81
+