ESPnet
multilingual
audio
codec
ftshijt commited on
Commit
665aa6a
1 Parent(s): 80fffef

Update model

Browse files
Files changed (29) hide show
  1. README.md +343 -3
  2. exp/codec_train_soundstream4_fs44100_raw_fs44100/120epoch.pth +3 -0
  3. exp/codec_train_soundstream4_fs44100_raw_fs44100/config.yaml +268 -0
  4. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/adv_loss.png +0 -0
  5. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/codec_commit_loss.png +0 -0
  6. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/codec_loss.png +0 -0
  7. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/codec_quantization_loss.png +0 -0
  8. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/discriminator_backward_time.png +0 -0
  9. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/discriminator_forward_time.png +0 -0
  10. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/discriminator_loss.png +0 -0
  11. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/discriminator_optim_step_time.png +0 -0
  12. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/discriminator_train_time.png +0 -0
  13. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/fake_loss.png +0 -0
  14. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/feat_match_loss.png +0 -0
  15. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/generator_backward_time.png +0 -0
  16. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/generator_forward_time.png +0 -0
  17. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/generator_optim_step_time.png +0 -0
  18. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/generator_train_time.png +0 -0
  19. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/gpu_max_cached_mem_GB.png +0 -0
  20. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/iter_time.png +0 -0
  21. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/loss.png +0 -0
  22. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/mel_loss.png +0 -0
  23. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/mel_loss_real.png +0 -0
  24. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/optim0_lr0.png +0 -0
  25. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/optim1_lr0.png +0 -0
  26. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/real_loss.png +0 -0
  27. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/reconstruct_loss.png +0 -0
  28. exp/codec_train_soundstream4_fs44100_raw_fs44100/images/train_time.png +0 -0
  29. meta.yaml +8 -0
README.md CHANGED
@@ -1,3 +1,343 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - codec
6
+ language: multilingual
7
+ datasets:
8
+ - amuse
9
+ license: cc-by-4.0
10
+ ---
11
+
12
+ ## ESPnet2 Codec model
13
+
14
+ ### `espnet/amuse_soundstream_44.1k`
15
+
16
+ This model was trained by ftshijt using amuse recipe in [espnet](https://github.com/espnet/espnet/).
17
+
18
+ ### Demo: How to use in ESPnet2
19
+
20
+ Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html)
21
+ if you haven't done that already.
22
+
23
+ ```bash
24
+ cd espnet
25
+ git checkout 5201685018b0e8fb9826bc51a710623140a06627
26
+ pip install -e .
27
+ cd egs2/amuse/codec1
28
+ ./run.sh --skip_data_prep false --skip_train true --download_model espnet/amuse_soundstream_44.1k
29
+ ```
30
+
31
+
32
+
33
+ ## Codec config
34
+
35
+ <details><summary>expand</summary>
36
+
37
+ ```
38
+ config: conf/train_soundstream4_fs44100.yaml
39
+ print_config: false
40
+ log_level: INFO
41
+ drop_last_iter: false
42
+ dry_run: false
43
+ iterator_type: chunk
44
+ valid_iterator_type: null
45
+ output_dir: exp/codec_train_soundstream4_fs44100_raw_fs44100
46
+ ngpu: 1
47
+ seed: 777
48
+ num_workers: 0
49
+ num_att_plot: 0
50
+ dist_backend: nccl
51
+ dist_init_method: env://
52
+ dist_world_size: 4
53
+ dist_rank: 0
54
+ local_rank: 0
55
+ dist_master_addr: localhost
56
+ dist_master_port: 53543
57
+ dist_launcher: null
58
+ multiprocessing_distributed: true
59
+ unused_parameters: true
60
+ sharded_ddp: false
61
+ cudnn_enabled: true
62
+ cudnn_benchmark: false
63
+ cudnn_deterministic: false
64
+ use_tf32: false
65
+ collect_stats: false
66
+ write_collected_feats: false
67
+ max_epoch: 120
68
+ patience: null
69
+ val_scheduler_criterion:
70
+ - valid
71
+ - loss
72
+ early_stopping_criterion:
73
+ - valid
74
+ - loss
75
+ - min
76
+ best_model_criterion:
77
+ - - valid
78
+ - mel_loss
79
+ - min
80
+ - - train
81
+ - total_count
82
+ - max
83
+ keep_nbest_models: 5
84
+ nbest_averaging_interval: 0
85
+ grad_clip: -1
86
+ grad_clip_type: 2.0
87
+ grad_noise: false
88
+ accum_grad: 1
89
+ no_forward_run: false
90
+ resume: true
91
+ train_dtype: float32
92
+ use_amp: false
93
+ log_interval: 1000
94
+ use_matplotlib: true
95
+ use_tensorboard: true
96
+ create_graph_in_tensorboard: false
97
+ use_wandb: false
98
+ wandb_project: null
99
+ wandb_id: null
100
+ wandb_entity: null
101
+ wandb_name: null
102
+ wandb_model_log_interval: -1
103
+ detect_anomaly: false
104
+ use_adapter: false
105
+ adapter: lora
106
+ save_strategy: all
107
+ adapter_conf: {}
108
+ pretrain_path: null
109
+ init_param: []
110
+ ignore_init_mismatch: false
111
+ freeze_param: []
112
+ num_iters_per_epoch: 5000
113
+ batch_size: 64
114
+ valid_batch_size: null
115
+ batch_bins: 1000000
116
+ valid_batch_bins: null
117
+ train_shape_file:
118
+ - exp/codec_stats_raw/train/audio_shape
119
+ valid_shape_file:
120
+ - exp/codec_stats_raw/valid/audio_shape
121
+ batch_type: unsorted
122
+ valid_batch_type: null
123
+ fold_length:
124
+ - 256000
125
+ sort_in_batch: descending
126
+ shuffle_within_batch: false
127
+ sort_batch: descending
128
+ multiple_iterator: false
129
+ chunk_length: 44544
130
+ chunk_shift_ratio: 0.5
131
+ num_cache_chunks: 128
132
+ chunk_excluded_key_prefixes: []
133
+ chunk_default_fs: null
134
+ train_data_path_and_name_and_type:
135
+ - - dump/raw/train/wav.scp
136
+ - audio
137
+ - kaldi_ark
138
+ valid_data_path_and_name_and_type:
139
+ - - dump/raw/dev-small/wav.scp
140
+ - audio
141
+ - kaldi_ark
142
+ multi_task_dataset: false
143
+ allow_variable_data_keys: false
144
+ max_cache_size: 0.0
145
+ max_cache_fd: 32
146
+ allow_multi_rates: false
147
+ valid_max_cache_size: null
148
+ exclude_weight_decay: false
149
+ exclude_weight_decay_conf: {}
150
+ optim: adam
151
+ optim_conf:
152
+ lr: 0.0001
153
+ betas:
154
+ - 0.5
155
+ - 0.9
156
+ eps: 1.0e-09
157
+ weight_decay: 0.0
158
+ scheduler: exponentiallr
159
+ scheduler_conf:
160
+ gamma: 0.999875
161
+ optim2: adam
162
+ optim2_conf:
163
+ lr: 0.0001
164
+ betas:
165
+ - 0.5
166
+ - 0.9
167
+ eps: 1.0e-09
168
+ weight_decay: 0.0
169
+ scheduler2: exponentiallr
170
+ scheduler2_conf:
171
+ gamma: 0.999875
172
+ generator_first: true
173
+ skip_discriminator_prob: 0.0
174
+ model_conf: {}
175
+ use_preprocessor: true
176
+ codec: soundstream
177
+ codec_conf:
178
+ sampling_rate: 44100
179
+ generator_params:
180
+ hidden_dim: 512
181
+ encdec_channels: 1
182
+ encdec_n_filters: 32
183
+ encdec_n_residual_layers: 3
184
+ encdec_ratios:
185
+ - 2
186
+ - 4
187
+ - 8
188
+ - 8
189
+ encdec_activation: ELU
190
+ encdec_activation_params:
191
+ alpha: 1.0
192
+ encdec_norm: weight_norm
193
+ encdec_kernel_size: 7
194
+ encdec_residual_kernel_size: 7
195
+ encdec_last_kernel_size: 7
196
+ encdec_dilation_base: 2
197
+ encdec_causal: false
198
+ encdec_pad_mode: reflect
199
+ encdec_true_skip: false
200
+ encdec_compress: 2
201
+ encdec_lstm: 2
202
+ decoder_trim_right_ratio: 1.0
203
+ decoder_final_activation: null
204
+ decoder_final_activation_params: null
205
+ quantizer_n_q: 32
206
+ quantizer_bins: 1024
207
+ quantizer_decay: 0.99
208
+ quantizer_kmeans_init: true
209
+ quantizer_kmeans_iters: 50
210
+ quantizer_threshold_ema_dead_code: 2
211
+ quantizer_target_bandwidth:
212
+ - 2
213
+ - 4
214
+ - 8
215
+ - 16
216
+ - 32
217
+ sample_rate: 44100
218
+ discriminator_params:
219
+ scales: 3
220
+ scale_downsample_pooling: AvgPool1d
221
+ scale_downsample_pooling_params:
222
+ kernel_size: 4
223
+ stride: 2
224
+ padding: 2
225
+ scale_discriminator_params:
226
+ in_channels: 1
227
+ out_channels: 1
228
+ kernel_sizes:
229
+ - 15
230
+ - 41
231
+ - 5
232
+ - 3
233
+ channels: 128
234
+ max_downsample_channels: 1024
235
+ max_groups: 16
236
+ bias: true
237
+ downsample_scales:
238
+ - 2
239
+ - 2
240
+ - 4
241
+ - 4
242
+ - 1
243
+ nonlinear_activation: LeakyReLU
244
+ nonlinear_activation_params:
245
+ negative_slope: 0.1
246
+ scale_follow_official_norm: false
247
+ complexstft_discriminator_params:
248
+ in_channels: 1
249
+ channels: 32
250
+ strides:
251
+ - - 1
252
+ - 2
253
+ - - 2
254
+ - 2
255
+ - - 1
256
+ - 2
257
+ - - 2
258
+ - 2
259
+ - - 1
260
+ - 2
261
+ - - 2
262
+ - 2
263
+ chan_mults:
264
+ - 1
265
+ - 2
266
+ - 4
267
+ - 4
268
+ - 8
269
+ - 8
270
+ n_fft: 1024
271
+ hop_length: 256
272
+ win_length: 1024
273
+ stft_normalized: false
274
+ generator_adv_loss_params:
275
+ average_by_discriminators: false
276
+ loss_type: mse
277
+ discriminator_adv_loss_params:
278
+ average_by_discriminators: false
279
+ loss_type: mse
280
+ use_feat_match_loss: true
281
+ feat_match_loss_params:
282
+ average_by_discriminators: false
283
+ average_by_layers: false
284
+ include_final_outputs: true
285
+ use_mel_loss: true
286
+ mel_loss_params:
287
+ range_start: 5
288
+ range_end: 11
289
+ window: hann
290
+ n_mels: 40
291
+ fmin: 0
292
+ fmax: null
293
+ log_base: null
294
+ fs: 44100
295
+ lambda_quantization: 0.0
296
+ lambda_commit: 1.0
297
+ lambda_reconstruct: 1.0
298
+ lambda_adv: 1.0
299
+ lambda_mel: 45.0
300
+ lambda_feat_match: 2.0
301
+ cache_generator_outputs: true
302
+ required:
303
+ - output_dir
304
+ version: '202402'
305
+ distributed: true
306
+ ```
307
+
308
+ </details>
309
+
310
+
311
+
312
+ ### Citing ESPnet
313
+
314
+ ```BibTex
315
+ @inproceedings{watanabe2018espnet,
316
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
317
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
318
+ year={2018},
319
+ booktitle={Proceedings of Interspeech},
320
+ pages={2207--2211},
321
+ doi={10.21437/Interspeech.2018-1456},
322
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
323
+ }
324
+
325
+
326
+
327
+
328
+
329
+
330
+ ```
331
+
332
+ or arXiv:
333
+
334
+ ```bibtex
335
+ @misc{watanabe2018espnet,
336
+ title={ESPnet: End-to-End Speech Processing Toolkit},
337
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
338
+ year={2018},
339
+ eprint={1804.00015},
340
+ archivePrefix={arXiv},
341
+ primaryClass={cs.CL}
342
+ }
343
+ ```
exp/codec_train_soundstream4_fs44100_raw_fs44100/120epoch.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8155641a45fdf28aea6296272b3cd16d18db9fa4587104449b3e80e03786c5ba
3
+ size 342005066
exp/codec_train_soundstream4_fs44100_raw_fs44100/config.yaml ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: conf/train_soundstream4_fs44100.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ drop_last_iter: false
5
+ dry_run: false
6
+ iterator_type: chunk
7
+ valid_iterator_type: null
8
+ output_dir: exp/codec_train_soundstream4_fs44100_raw_fs44100
9
+ ngpu: 1
10
+ seed: 777
11
+ num_workers: 0
12
+ num_att_plot: 0
13
+ dist_backend: nccl
14
+ dist_init_method: env://
15
+ dist_world_size: 4
16
+ dist_rank: 0
17
+ local_rank: 0
18
+ dist_master_addr: localhost
19
+ dist_master_port: 53543
20
+ dist_launcher: null
21
+ multiprocessing_distributed: true
22
+ unused_parameters: true
23
+ sharded_ddp: false
24
+ cudnn_enabled: true
25
+ cudnn_benchmark: false
26
+ cudnn_deterministic: false
27
+ use_tf32: false
28
+ collect_stats: false
29
+ write_collected_feats: false
30
+ max_epoch: 120
31
+ patience: null
32
+ val_scheduler_criterion:
33
+ - valid
34
+ - loss
35
+ early_stopping_criterion:
36
+ - valid
37
+ - loss
38
+ - min
39
+ best_model_criterion:
40
+ - - valid
41
+ - mel_loss
42
+ - min
43
+ - - train
44
+ - total_count
45
+ - max
46
+ keep_nbest_models: 5
47
+ nbest_averaging_interval: 0
48
+ grad_clip: -1
49
+ grad_clip_type: 2.0
50
+ grad_noise: false
51
+ accum_grad: 1
52
+ no_forward_run: false
53
+ resume: true
54
+ train_dtype: float32
55
+ use_amp: false
56
+ log_interval: 1000
57
+ use_matplotlib: true
58
+ use_tensorboard: true
59
+ create_graph_in_tensorboard: false
60
+ use_wandb: false
61
+ wandb_project: null
62
+ wandb_id: null
63
+ wandb_entity: null
64
+ wandb_name: null
65
+ wandb_model_log_interval: -1
66
+ detect_anomaly: false
67
+ use_adapter: false
68
+ adapter: lora
69
+ save_strategy: all
70
+ adapter_conf: {}
71
+ pretrain_path: null
72
+ init_param: []
73
+ ignore_init_mismatch: false
74
+ freeze_param: []
75
+ num_iters_per_epoch: 5000
76
+ batch_size: 64
77
+ valid_batch_size: null
78
+ batch_bins: 1000000
79
+ valid_batch_bins: null
80
+ train_shape_file:
81
+ - exp/codec_stats_raw/train/audio_shape
82
+ valid_shape_file:
83
+ - exp/codec_stats_raw/valid/audio_shape
84
+ batch_type: unsorted
85
+ valid_batch_type: null
86
+ fold_length:
87
+ - 256000
88
+ sort_in_batch: descending
89
+ shuffle_within_batch: false
90
+ sort_batch: descending
91
+ multiple_iterator: false
92
+ chunk_length: 44544
93
+ chunk_shift_ratio: 0.5
94
+ num_cache_chunks: 128
95
+ chunk_excluded_key_prefixes: []
96
+ chunk_default_fs: null
97
+ train_data_path_and_name_and_type:
98
+ - - dump/raw/train/wav.scp
99
+ - audio
100
+ - kaldi_ark
101
+ valid_data_path_and_name_and_type:
102
+ - - dump/raw/dev-small/wav.scp
103
+ - audio
104
+ - kaldi_ark
105
+ multi_task_dataset: false
106
+ allow_variable_data_keys: false
107
+ max_cache_size: 0.0
108
+ max_cache_fd: 32
109
+ allow_multi_rates: false
110
+ valid_max_cache_size: null
111
+ exclude_weight_decay: false
112
+ exclude_weight_decay_conf: {}
113
+ optim: adam
114
+ optim_conf:
115
+ lr: 0.0001
116
+ betas:
117
+ - 0.5
118
+ - 0.9
119
+ eps: 1.0e-09
120
+ weight_decay: 0.0
121
+ scheduler: exponentiallr
122
+ scheduler_conf:
123
+ gamma: 0.999875
124
+ optim2: adam
125
+ optim2_conf:
126
+ lr: 0.0001
127
+ betas:
128
+ - 0.5
129
+ - 0.9
130
+ eps: 1.0e-09
131
+ weight_decay: 0.0
132
+ scheduler2: exponentiallr
133
+ scheduler2_conf:
134
+ gamma: 0.999875
135
+ generator_first: true
136
+ skip_discriminator_prob: 0.0
137
+ model_conf: {}
138
+ use_preprocessor: true
139
+ codec: soundstream
140
+ codec_conf:
141
+ sampling_rate: 44100
142
+ generator_params:
143
+ hidden_dim: 512
144
+ encdec_channels: 1
145
+ encdec_n_filters: 32
146
+ encdec_n_residual_layers: 3
147
+ encdec_ratios:
148
+ - 2
149
+ - 4
150
+ - 8
151
+ - 8
152
+ encdec_activation: ELU
153
+ encdec_activation_params:
154
+ alpha: 1.0
155
+ encdec_norm: weight_norm
156
+ encdec_kernel_size: 7
157
+ encdec_residual_kernel_size: 7
158
+ encdec_last_kernel_size: 7
159
+ encdec_dilation_base: 2
160
+ encdec_causal: false
161
+ encdec_pad_mode: reflect
162
+ encdec_true_skip: false
163
+ encdec_compress: 2
164
+ encdec_lstm: 2
165
+ decoder_trim_right_ratio: 1.0
166
+ decoder_final_activation: null
167
+ decoder_final_activation_params: null
168
+ quantizer_n_q: 32
169
+ quantizer_bins: 1024
170
+ quantizer_decay: 0.99
171
+ quantizer_kmeans_init: true
172
+ quantizer_kmeans_iters: 50
173
+ quantizer_threshold_ema_dead_code: 2
174
+ quantizer_target_bandwidth:
175
+ - 2
176
+ - 4
177
+ - 8
178
+ - 16
179
+ - 32
180
+ sample_rate: 44100
181
+ discriminator_params:
182
+ scales: 3
183
+ scale_downsample_pooling: AvgPool1d
184
+ scale_downsample_pooling_params:
185
+ kernel_size: 4
186
+ stride: 2
187
+ padding: 2
188
+ scale_discriminator_params:
189
+ in_channels: 1
190
+ out_channels: 1
191
+ kernel_sizes:
192
+ - 15
193
+ - 41
194
+ - 5
195
+ - 3
196
+ channels: 128
197
+ max_downsample_channels: 1024
198
+ max_groups: 16
199
+ bias: true
200
+ downsample_scales:
201
+ - 2
202
+ - 2
203
+ - 4
204
+ - 4
205
+ - 1
206
+ nonlinear_activation: LeakyReLU
207
+ nonlinear_activation_params:
208
+ negative_slope: 0.1
209
+ scale_follow_official_norm: false
210
+ complexstft_discriminator_params:
211
+ in_channels: 1
212
+ channels: 32
213
+ strides:
214
+ - - 1
215
+ - 2
216
+ - - 2
217
+ - 2
218
+ - - 1
219
+ - 2
220
+ - - 2
221
+ - 2
222
+ - - 1
223
+ - 2
224
+ - - 2
225
+ - 2
226
+ chan_mults:
227
+ - 1
228
+ - 2
229
+ - 4
230
+ - 4
231
+ - 8
232
+ - 8
233
+ n_fft: 1024
234
+ hop_length: 256
235
+ win_length: 1024
236
+ stft_normalized: false
237
+ generator_adv_loss_params:
238
+ average_by_discriminators: false
239
+ loss_type: mse
240
+ discriminator_adv_loss_params:
241
+ average_by_discriminators: false
242
+ loss_type: mse
243
+ use_feat_match_loss: true
244
+ feat_match_loss_params:
245
+ average_by_discriminators: false
246
+ average_by_layers: false
247
+ include_final_outputs: true
248
+ use_mel_loss: true
249
+ mel_loss_params:
250
+ range_start: 5
251
+ range_end: 11
252
+ window: hann
253
+ n_mels: 40
254
+ fmin: 0
255
+ fmax: null
256
+ log_base: null
257
+ fs: 44100
258
+ lambda_quantization: 0.0
259
+ lambda_commit: 1.0
260
+ lambda_reconstruct: 1.0
261
+ lambda_adv: 1.0
262
+ lambda_mel: 45.0
263
+ lambda_feat_match: 2.0
264
+ cache_generator_outputs: true
265
+ required:
266
+ - output_dir
267
+ version: '202402'
268
+ distributed: true
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/adv_loss.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/codec_commit_loss.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/codec_loss.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/codec_quantization_loss.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/discriminator_backward_time.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/discriminator_forward_time.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/discriminator_loss.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/discriminator_optim_step_time.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/discriminator_train_time.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/fake_loss.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/feat_match_loss.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/generator_backward_time.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/generator_forward_time.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/generator_optim_step_time.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/generator_train_time.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/gpu_max_cached_mem_GB.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/iter_time.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/loss.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/mel_loss.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/mel_loss_real.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/optim0_lr0.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/optim1_lr0.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/real_loss.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/reconstruct_loss.png ADDED
exp/codec_train_soundstream4_fs44100_raw_fs44100/images/train_time.png ADDED
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: '202402'
2
+ files:
3
+ model_file: exp/codec_train_soundstream4_fs44100_raw_fs44100/120epoch.pth
4
+ python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
5
+ timestamp: 1719515012.822816
6
+ torch: 2.3.0+cu118
7
+ yaml_files:
8
+ train_config: exp/codec_train_soundstream4_fs44100_raw_fs44100/config.yaml