Spaces:

leezhuuu
/

GLM-4-Voice

Runtime error

App Files Files Community

leezhuuu commited on Oct 25

Commit

e663533

•

1 Parent(s): 22cf3e2

Upload 7 files

Browse files

Files changed (7) hide show

LICENSE +201 -0
README.md +105 -10
README_en.md +97 -0
flow_inference.py +142 -0
model_server.py +116 -0
requirements.txt +36 -0
web_demo.py +257 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2024 GLM-4-Voice Model Team @ Zhipu AI
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md CHANGED Viewed

@@ -1,10 +1,105 @@
----
-title: GLM 4 Voice
-emoji: 🐢
-colorFrom: indigo
-colorTo: blue
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# GLM-4-Voice
+Read this in [English](./README_en.md)
+GLM-4-Voice 是智谱 AI 推出的端到端语音模型。GLM-4-Voice 能够直接理解和生成中英文语音，进行实时语音对话，并且能够遵循用户的指令要求改变语音的情感、语调、语速、方言等属性。
+## Model Architecture
+![Model Architecture](./resources/architecture.jpeg)
+GLM-4-Voice 由三个部分组成：
+* GLM-4-Voice-Tokenizer: 通过在 [Whisper](https://github.com/openai/whisper) 的 Encoder 部分增加 Vector Quantization 并在 ASR 数据上有监督训练，将连续的语音输入转化为离散的 token。每秒音频平均只需要用 12.5 个离散 token 表示。
+* GLM-4-Voice-Decoder: 基于 [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) 的 Flow Matching 模型结构训练的支持流式推理的语音解码器，将离散化的语音 token 转化为连续的语音输出。最少只需要 10 个语音 token 即可开始生成，降低端到端对话延迟。
+* GLM-4-Voice-9B: 在 [GLM-4-9B](https://github.com/THUDM/GLM-4) 的基础上进行语音模态的预训练和对齐，从而能够理解和生成离散化的语音 token。
+预训练方面，为了攻克模型在语音模态下的智商和合成表现力两个难关，我们将 Speech2Speech 任务解耦合为“根据用户音频做出文本回复”和“根据文本回复和用户语音合成回复语音”两个任务，并设计两种预训练目标，分别基于文本预训练数据和无监督音频数据合成语音-文本交错数据以适配这两种任务形式。GLM-4-Voice-9B 在 GLM-4-9B 的基座模型基础之上，经过了数百万小时音频和数千亿 token 的音频文本交错数据预训练，拥有很强的音频理解和建模能力。
+对齐方面，为了支持高质量的语音对话，我们设计了一套流式思考架构：根据用户语音，GLM-4-Voice 可以流式交替输出文本和语音两个模态的内容，其中语音模态以文本作为参照保证回复内容的高质量，并根据用户的语音指令要求做出相应的声音变化，在最大程度保留语言模型智商的情况下仍然具有端到端建模的能力，同时具备低延迟性，最低只需要输出 20 个 token 便可以合成语音。
+更详细的技术报告将在之后公布。
+## Model List
+|         Model         | Type |                                                                     Download                                                                     |
+|:---------------------:| :---: |:------------------------------------------------------------------------------------------------------------------------------------------------:|
+| GLM-4-Voice-Tokenizer | Speech Tokenizer | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-tokenizer) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/glm-4-voice-tokenizer) |
+|    GLM-4-Voice-9B     | Chat Model |                                          [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-9b) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/glm-4-voice-9b)
+| GLM-4-Voice-Decoder   | Speech Decoder |                                        [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-decoder) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/glm-4-voice-decoder)
+## Usage
+我们提供了可以直接启动的 Web Demo。用户可以输入语音或文本，模型会同时给出语音和文字回复。
+![](resources/web_demo.png)
+### Preparation
+首先下载仓库
+```shell
+git clone --recurse-submodules https://github.com/THUDM/GLM-4-Voice
+cd GLM-4-Voice
+```
+然后安装依赖。
+```shell
+pip install -r requirements.txt
+```
+由于 Decoder 模型不支持通过 `transformers` 初始化，因此 checkpoint 需要单独下载。
+```shell
+# git 模型下载，请确保已安装 git-lfs
+git clone https://huggingface.co/THUDM/glm-4-voice-decoder
+```
+### Launch Web Demo
+首先启动模型服务
+```shell
+python model_server.py --model-path glm-4-voice-9b
+```
+此命令会自动下载 `glm-4-voice-9b`。如果网络条件不好，也手动下载之后通过 `--model-path` 指定本地的路径。
+然后启动 web 服务
+```shell
+python web_demo.py
+```
+即可在 http://127.0.0.1:8888 访问 web demo。此命令会自动下载 `glm-4-voice-tokenizer` 和 `glm-4-voice-9b`。如果网络条件不好，也可以手动下载之后通过 `--tokenizer-path` 和 `--model-path` 指定本地的路径。
+### Known Issues
+* Gradio 的流式音频播放效果不稳定。在生成完成后点击对话框中的音频质量会更高。
+## Cases
+我们提供了 GLM-4-Voice 的部分对话案例，包括控制情绪、改变语速、生成方言等。
+* 用轻���的声音引导我放松
+https://github.com/user-attachments/assets/4e3d9200-076d-4c28-a641-99df3af38eb0
+* 用激动的声音解说足球比赛
+https://github.com/user-attachments/assets/0163de2d-e876-4999-b1bc-bbfa364b799b
+* 用哀怨的声音讲一个鬼故事
+https://github.com/user-attachments/assets/a75b2087-d7bc-49fa-a0c5-e8c99935b39a
+* 用东北话介绍一下冬天有多冷
+https://github.com/user-attachments/assets/91ba54a1-8f5c-4cfe-8e87-16ed1ecf4037
+* 用重庆话念“吃葡萄不吐葡萄皮”
+https://github.com/user-attachments/assets/7eb72461-9e84-4d8e-9c58-1809cf6a8a9b
+* 用北京话念一句绕口令
+https://github.com/user-attachments/assets/a9bb223e-9c0a-440d-8537-0a7f16e31651
+  * 加快语速
+https://github.com/user-attachments/assets/c98a4604-366b-4304-917f-3c850a82fe9f
+  * 再快一点
+https://github.com/user-attachments/assets/d5ff0815-74f8-4738-b0f1-477cfc8dcc2d
+## Acknowledge
+本项目的部分代码来自：
+* [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
+* [transformers](https://github.com/huggingface/transformers)
+* [GLM-4](https://github.com/THUDM/GLM-4)

README_en.md ADDED Viewed

	@@ -0,0 +1,97 @@

+# GLM-4-Voice
+GLM-4-Voice is an end-to-end voice model launched by Zhipu AI. GLM-4-Voice can directly understand and generate Chinese and English speech, engage in real-time voice conversations, and change attributes such as emotion, intonation, speech rate, and dialect based on user instructions.
+## Model Architecture
+![Model Architecture](./resources/architecture.jpeg)
+We provide the three components of GLM-4-Voice:
+* GLM-4-Voice-Tokenizer: Trained by adding vector quantization to the encoder part of [Whisper](https://github.com/openai/whisper), converting continuous speech input into discrete tokens. Each second of audio is converted into 12.5 discrete tokens.
+* GLM-4-Voice-9B: Pre-trained and aligned on speech modality based on [GLM-4-9B](https://github.com/THUDM/GLM-4), enabling understanding and generation of discretized speech.
+* GLM-4-Voice-Decoder: A speech decoder supporting streaming inference, retrained based on [CosyVoice](https://github.com/FunAudioLLM/CosyVoice), converting discrete speech tokens into continuous speech output. Generation can start with as few as 10 audio tokens, reducing conversation latency.
+A more detailed technical report will be published later.
+## Model List
+|         Model         | Type |      Download      |
+|:---------------------:| :---: |:------------------:|
+| GLM-4-Voice-Tokenizer | Speech Tokenizer | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-tokenizer) |
+|    GLM-4-Voice-9B     | Chat Model |  [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-9b)
+| GLM-4-Voice-Decoder   | Speech Decoder |  [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-decoder)
+## Usage
+We provide a Web Demo that can be launched directly. Users can input speech or text, and the model will respond with both speech and text.
+![](resources/web_demo.png)
+### Preparation
+First, download the repository
+```shell
+git clone --recurse-submodules https://github.com/THUDM/GLM-4-Voice
+cd GLM-4-Voice
+```
+Then, install the dependencies.
+```shell
+pip install -r requirements.txt
+```
+Since the Decoder model does not support initialization via `transformers`, the checkpoint needs to be downloaded separately.
+```shell
+# Git model download, please ensure git-lfs is installed
+git clone https://huggingface.co/THUDM/glm-4-voice-decoder
+```
+### Launch Web Demo
+First, start the model service
+```shell
+python model_server.py --model-path glm-4-voice-9b
+```
+Then, start the web service
+```shell
+python web_demo.py
+```
+You can then access the web demo at http://127.0.0.1:8888.
+### Known Issues
+* Gradio’s streaming audio playback can be unstable. The audio quality will be higher when clicking on the audio in the dialogue box after generation is complete.
+## Examples
+We provide some dialogue cases for GLM-4-Voice, including emotion control, speech rate alteration, dialect generation, etc. (The examples are in Chinese.)
+* Use a gentle voice to guide me to relax
+https://github.com/user-attachments/assets/4e3d9200-076d-4c28-a641-99df3af38eb0
+* Use an excited voice to commentate a football match
+https://github.com/user-attachments/assets/0163de2d-e876-4999-b1bc-bbfa364b799b
+* Tell a ghost story with a mournful voice
+https://github.com/user-attachments/assets/a75b2087-d7bc-49fa-a0c5-e8c99935b39a
+* Introduce how cold winter is with a Northeastern dialect
+https://github.com/user-attachments/assets/91ba54a1-8f5c-4cfe-8e87-16ed1ecf4037
+* Say "Eat grapes without spitting out the skins" in Chongqing dialect
+https://github.com/user-attachments/assets/7eb72461-9e84-4d8e-9c58-1809cf6a8a9b
+* Recite a tongue twister with a Beijing accent
+https://github.com/user-attachments/assets/a9bb223e-9c0a-440d-8537-0a7f16e31651
+  * Increase the speech rate
+https://github.com/user-attachments/assets/c98a4604-366b-4304-917f-3c850a82fe9f
+  * Even faster
+https://github.com/user-attachments/assets/d5ff0815-74f8-4738-b0f1-477cfc8dcc2d
+## Acknowledge
+Some code in this project is from:
+* [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
+* [transformers](https://github.com/huggingface/transformers)
+* [GLM-4](https://github.com/THUDM/GLM-4)

flow_inference.py ADDED Viewed

	@@ -0,0 +1,142 @@

+import torch
+import torchaudio
+import numpy as np
+import re
+from hyperpyyaml import load_hyperpyyaml
+import uuid
+from collections import defaultdict
+def fade_in_out(fade_in_mel, fade_out_mel, window):
+    device = fade_in_mel.device
+    fade_in_mel, fade_out_mel = fade_in_mel.cpu(), fade_out_mel.cpu()
+    mel_overlap_len = int(window.shape[0] / 2)
+    fade_in_mel[..., :mel_overlap_len] = fade_in_mel[..., :mel_overlap_len] * window[:mel_overlap_len] + \
+                                         fade_out_mel[..., -mel_overlap_len:] * window[mel_overlap_len:]
+    return fade_in_mel.to(device)
+class AudioDecoder:
+    def __init__(self, config_path, flow_ckpt_path, hift_ckpt_path, device="cuda"):
+        self.device = device
+        with open(config_path, 'r') as f:
+            self.scratch_configs = load_hyperpyyaml(f)
+        # Load models
+        self.flow = self.scratch_configs['flow']
+        self.flow.load_state_dict(torch.load(flow_ckpt_path, map_location=self.device))
+        self.hift = self.scratch_configs['hift']
+        self.hift.load_state_dict(torch.load(hift_ckpt_path, map_location=self.device))
+        # Move models to the appropriate device
+        self.flow.to(self.device)
+        self.hift.to(self.device)
+        self.mel_overlap_dict = defaultdict(lambda: None)
+        self.hift_cache_dict = defaultdict(lambda: None)
+        self.token_min_hop_len = 2 * self.flow.input_frame_rate
+        self.token_max_hop_len = 4 * self.flow.input_frame_rate
+        self.token_overlap_len = 5
+        self.mel_overlap_len = int(self.token_overlap_len / self.flow.input_frame_rate * 22050 / 256)
+        self.mel_window = np.hamming(2 * self.mel_overlap_len)
+        # hift cache
+        self.mel_cache_len = 1
+        self.source_cache_len = int(self.mel_cache_len * 256)
+        # speech fade in out
+        self.speech_window = np.hamming(2 * self.source_cache_len)
+    def token2wav(self, token, uuid, prompt_token=torch.zeros(1, 0, dtype=torch.int32),
+                  prompt_feat=torch.zeros(1, 0, 80), embedding=torch.zeros(1, 192), finalize=False):
+        tts_mel = self.flow.inference(token=token.to(self.device),
+                                      token_len=torch.tensor([token.shape[1]], dtype=torch.int32).to(self.device),
+                                      prompt_token=prompt_token.to(self.device),
+                                      prompt_token_len=torch.tensor([prompt_token.shape[1]], dtype=torch.int32).to(
+                                          self.device),
+                                      prompt_feat=prompt_feat.to(self.device),
+                                      prompt_feat_len=torch.tensor([prompt_feat.shape[1]], dtype=torch.int32).to(
+                                          self.device),
+                                      embedding=embedding.to(self.device))
+        # mel overlap fade in out
+        if self.mel_overlap_dict[uuid] is not None:
+            tts_mel = fade_in_out(tts_mel, self.mel_overlap_dict[uuid], self.mel_window)
+        # append hift cache
+        if self.hift_cache_dict[uuid] is not None:
+            hift_cache_mel, hift_cache_source = self.hift_cache_dict[uuid]['mel'], self.hift_cache_dict[uuid]['source']
+            tts_mel = torch.concat([hift_cache_mel, tts_mel], dim=2)
+        else:
+            hift_cache_source = torch.zeros(1, 1, 0)
+        # _tts_mel=tts_mel.contiguous()
+        # keep overlap mel and hift cache
+        if finalize is False:
+            self.mel_overlap_dict[uuid] = tts_mel[:, :, -self.mel_overlap_len:]
+            tts_mel = tts_mel[:, :, :-self.mel_overlap_len]
+            tts_speech, tts_source = self.hift.inference(mel=tts_mel, cache_source=hift_cache_source)
+            self.hift_cache_dict[uuid] = {'mel': tts_mel[:, :, -self.mel_cache_len:],
+                                          'source': tts_source[:, :, -self.source_cache_len:],
+                                          'speech': tts_speech[:, -self.source_cache_len:]}
+            # if self.hift_cache_dict[uuid] is not None:
+            #     tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
+            tts_speech = tts_speech[:, :-self.source_cache_len]
+        else:
+            tts_speech, tts_source = self.hift.inference(mel=tts_mel, cache_source=hift_cache_source)
+            del self.hift_cache_dict[uuid]
+            del self.mel_overlap_dict[uuid]
+            # if uuid in self.hift_cache_dict.keys() and self.hift_cache_dict[uuid] is not None:
+            #     tts_speech = fade_in_out(tts_speech, self.hift_cache_dict[uuid]['speech'], self.speech_window)
+        return tts_speech, tts_mel
+    def offline_inference(self, token):
+        this_uuid = str(uuid.uuid1())
+        tts_speech, tts_mel = self.token2wav(token, uuid=this_uuid, finalize=True)
+        return tts_speech.cpu()
+    def stream_inference(self, token):
+        token.to(self.device)
+        this_uuid = str(uuid.uuid1())
+        # Prepare other necessary input tensors
+        llm_embedding = torch.zeros(1, 192).to(self.device)
+        prompt_speech_feat = torch.zeros(1, 0, 80).to(self.device)
+        flow_prompt_speech_token = torch.zeros(1, 0, dtype=torch.int32).to(self.device)
+        tts_speechs = []
+        tts_mels = []
+        block_size = self.flow.encoder.block_size
+        prev_mel = None
+        for idx in range(0, token.size(1), block_size):
+            # if idx>block_size: break
+            tts_token = token[:, idx:idx + block_size]
+            print(tts_token.size())
+            if prev_mel is not None:
+                prompt_speech_feat = torch.cat(tts_mels, dim=-1).transpose(1, 2)
+                flow_prompt_speech_token = token[:, :idx]
+            if idx + block_size >= token.size(-1):
+                is_finalize = True
+            else:
+                is_finalize = False
+            tts_speech, tts_mel = self.token2wav(tts_token, uuid=this_uuid,
+                                                 prompt_token=flow_prompt_speech_token.to(self.device),
+                                                 prompt_feat=prompt_speech_feat.to(self.device), finalize=is_finalize)
+            prev_mel = tts_mel
+            prev_speech = tts_speech
+            print(tts_mel.size())
+            tts_speechs.append(tts_speech)
+            tts_mels.append(tts_mel)
+        # Convert Mel spectrogram to audio using HiFi-GAN
+        tts_speech = torch.cat(tts_speechs, dim=-1).cpu()
+        return tts_speech.cpu()

model_server.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""
+A model worker executes the model.
+"""
+import argparse
+import json
+import uuid
+from fastapi import FastAPI, Request
+from fastapi.responses import StreamingResponse
+from transformers import AutoModel, AutoTokenizer
+import torch
+import uvicorn
+from transformers.generation.streamers import BaseStreamer
+from threading import Thread
+from queue import Queue
+class TokenStreamer(BaseStreamer):
+    def __init__(self, skip_prompt: bool = False, timeout=None):
+        self.skip_prompt = skip_prompt
+        # variables used in the streaming process
+        self.token_queue = Queue()
+        self.stop_signal = None
+        self.next_tokens_are_prompt = True
+        self.timeout = timeout
+    def put(self, value):
+        if len(value.shape) > 1 and value.shape[0] > 1:
+            raise ValueError("TextStreamer only supports batch size 1")
+        elif len(value.shape) > 1:
+            value = value[0]
+        if self.skip_prompt and self.next_tokens_are_prompt:
+            self.next_tokens_are_prompt = False
+            return
+        for token in value.tolist():
+            self.token_queue.put(token)
+    def end(self):
+        self.token_queue.put(self.stop_signal)
+    def __iter__(self):
+        return self
+    def __next__(self):
+        value = self.token_queue.get(timeout=self.timeout)
+        if value == self.stop_signal:
+            raise StopIteration()
+        else:
+            return value
+class ModelWorker:
+    def __init__(self, model_path, device='cuda'):
+        self.device = device
+        self.glm_model = AutoModel.from_pretrained(model_path, trust_remote_code=True,
+                                                   device=device).to(device).eval()
+        self.glm_tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+    @torch.inference_mode()
+    def generate_stream(self, params):
+        tokenizer, model = self.glm_tokenizer, self.glm_model
+        prompt = params["prompt"]
+        temperature = float(params.get("temperature", 1.0))
+        top_p = float(params.get("top_p", 1.0))
+        max_new_tokens = int(params.get("max_new_tokens", 256))
+        inputs = tokenizer([prompt], return_tensors="pt")
+        inputs = inputs.to(self.device)
+        streamer = TokenStreamer(skip_prompt=True)
+        thread = Thread(target=model.generate,
+                        kwargs=dict(**inputs, max_new_tokens=int(max_new_tokens),
+                                    temperature=float(temperature), top_p=float(top_p),
+                                    streamer=streamer))
+        thread.start()
+        for token_id in streamer:
+            yield (json.dumps({"token_id": token_id, "error_code": 0}) + "\n").encode()
+    def generate_stream_gate(self, params):
+        try:
+            for x in self.generate_stream(params):
+                yield x
+        except Exception as e:
+            print("Caught Unknown Error", e)
+            ret = {
+                "text": "Server Error",
+                "error_code": 1,
+            }
+            yield (json.dumps(ret)+ "\n").encode()
+app = FastAPI()
+@app.post("/generate_stream")
+async def generate_stream(request: Request):
+    params = await request.json()
+    generator = worker.generate_stream_gate(params)
+    return StreamingResponse(generator)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--host", type=str, default="localhost")
+    parser.add_argument("--port", type=int, default=10000)
+    parser.add_argument("--model-path", type=str, default="THUDM/glm-4-voice-9b")
+    args = parser.parse_args()
+    worker = ModelWorker(args.model_path)
+    uvicorn.run(app, host=args.host, port=args.port, log_level="info")

requirements.txt ADDED Viewed

	@@ -0,0 +1,36 @@

+conformer==0.3.2
+deepspeed==0.14.2; sys_platform == 'linux'
+diffusers==0.27.2
+fastapi==0.115.3
+fastapi-cli==0.0.4
+gdown==5.1.0
+gradio==5.3.0
+grpcio==1.57.0
+grpcio-tools==1.57.0
+huggingface_hub==0.25.2
+hydra-core==1.3.2
+HyperPyYAML==1.2.2
+inflect==7.3.1
+librosa==0.10.2
+lightning==2.2.4
+matplotlib==3.7.5
+modelscope==1.15.0
+networkx==3.1
+numpy==1.24.4
+omegaconf==2.3.0
+onnxruntime-gpu==1.16.0; sys_platform == 'linux'
+onnxruntime==1.16.0; sys_platform == 'darwin' or sys_platform == 'windows'
+openai-whisper==20231117
+protobuf==4.25
+pydantic==2.7.0
+rich==13.7.1
+Requests==2.32.3
+safetensors==0.4.5
+soundfile==0.12.1
+tensorboard==2.14.0
+transformers==4.44.1
+uvicorn==0.32.0
+wget==3.2
+WeTextProcessing==1.0.3
+torch==2.3.0
+torchaudio==2.3.0

web_demo.py ADDED Viewed

	@@ -0,0 +1,257 @@

+import json
+import os.path
+import tempfile
+import sys
+import re
+import uuid
+import requests
+from argparse import ArgumentParser
+import torchaudio
+from transformers import WhisperFeatureExtractor, AutoTokenizer, AutoModel
+from speech_tokenizer.modeling_whisper import WhisperVQEncoder
+sys.path.insert(0, "./cosyvoice")
+sys.path.insert(0, "./third_party/Matcha-TTS")
+from speech_tokenizer.utils import extract_speech_token
+import gradio as gr
+import torch
+audio_token_pattern = re.compile(r"<\|audio_(\d+)\|>")
+from flow_inference import AudioDecoder
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument("--host", type=str, default="0.0.0.0")
+    parser.add_argument("--port", type=int, default="8888")
+    parser.add_argument("--flow-path", type=str, default="./glm-4-voice-decoder")
+    parser.add_argument("--model-path", type=str, default="THUDM/glm-4-voice-9b")
+    parser.add_argument("--tokenizer-path", type=str, default="THUDM/glm-4-voice-tokenizer")
+    args = parser.parse_args()
+    flow_config = os.path.join(args.flow_path, "config.yaml")
+    flow_checkpoint = os.path.join(args.flow_path, 'flow.pt')
+    hift_checkpoint = os.path.join(args.flow_path, 'hift.pt')
+    glm_tokenizer = None
+    device = "cuda"
+    audio_decoder: AudioDecoder = None
+    whisper_model, feature_extractor = None, None
+    def initialize_fn():
+        global audio_decoder, feature_extractor, whisper_model, glm_model, glm_tokenizer
+        if audio_decoder is not None:
+            return
+        # GLM
+        glm_tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)
+        # Flow & Hift
+        audio_decoder = AudioDecoder(config_path=flow_config, flow_ckpt_path=flow_checkpoint,
+                                     hift_ckpt_path=hift_checkpoint,
+                                     device=device)
+        # Speech tokenizer
+        whisper_model = WhisperVQEncoder.from_pretrained(args.tokenizer_path).eval().to(device)
+        feature_extractor = WhisperFeatureExtractor.from_pretrained(args.tokenizer_path)
+    def clear_fn():
+        return [], [], '', '', '', None
+    def inference_fn(
+            temperature: float,
+            top_p: float,
+            max_new_token: int,
+            input_mode,
+            audio_path: str | None,
+            input_text: str | None,
+            history: list[dict],
+            previous_input_tokens: str,
+            previous_completion_tokens: str,
+    ):
+        if input_mode == "audio":
+            assert audio_path is not None
+            history.append({"role": "user", "content": {"path": audio_path}})
+            audio_tokens = extract_speech_token(
+                whisper_model, feature_extractor, [audio_path]
+            )[0]
+            if len(audio_tokens) == 0:
+                raise gr.Error("No audio tokens extracted")
+            audio_tokens = "".join([f"<|audio_{x}|>" for x in audio_tokens])
+            audio_tokens = "<|begin_of_audio|>" + audio_tokens + "<|end_of_audio|>"
+            user_input = audio_tokens
+            system_prompt = "User will provide you with a speech instruction. Do it step by step. First, think about the instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens. "
+        else:
+            assert input_text is not None
+            history.append({"role": "user", "content": input_text})
+            user_input = input_text
+            system_prompt = "User will provide you with a text instruction. Do it step by step. First, think about the instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens."
+        # Gather history
+        inputs = previous_input_tokens + previous_completion_tokens
+        inputs = inputs.strip()
+        if "<|system|>" not in inputs:
+            inputs += f"<|system|>\n{system_prompt}"
+        inputs += f"<|user|>\n{user_input}<|assistant|>streaming_transcription\n"
+        with torch.no_grad():
+            response = requests.post(
+                "http://localhost:10000/generate_stream",
+                data=json.dumps({
+                    "prompt": inputs,
+                    "temperature": temperature,
+                    "top_p": top_p,
+                    "max_new_tokens": max_new_token,
+                }),
+                stream=True
+            )
+            text_tokens, audio_tokens = [], []
+            audio_offset = glm_tokenizer.convert_tokens_to_ids('<|audio_0|>')
+            end_token_id = glm_tokenizer.convert_tokens_to_ids('<|user|>')
+            complete_tokens = []
+            prompt_speech_feat = torch.zeros(1, 0, 80).to(device)
+            flow_prompt_speech_token = torch.zeros(1, 0, dtype=torch.int64).to(device)
+            this_uuid = str(uuid.uuid4())
+            tts_speechs = []
+            tts_mels = []
+            prev_mel = None
+            is_finalize = False
+            block_size = 10
+            for chunk in response.iter_lines():
+                token_id = json.loads(chunk)["token_id"]
+                if token_id == end_token_id:
+                    is_finalize = True
+                if len(audio_tokens) >= block_size or (is_finalize and audio_tokens):
+                    block_size = 20
+                    tts_token = torch.tensor(audio_tokens, device=device).unsqueeze(0)
+                    if prev_mel is not None:
+                        prompt_speech_feat = torch.cat(tts_mels, dim=-1).transpose(1, 2)
+                    tts_speech, tts_mel = audio_decoder.token2wav(tts_token, uuid=this_uuid,
+                                                                  prompt_token=flow_prompt_speech_token.to(device),
+                                                                  prompt_feat=prompt_speech_feat.to(device),
+                                                                  finalize=is_finalize)
+                    prev_mel = tts_mel
+                    tts_speechs.append(tts_speech.squeeze())
+                    tts_mels.append(tts_mel)
+                    yield history, inputs, '', '', (22050, tts_speech.squeeze().cpu().numpy())
+                    flow_prompt_speech_token = torch.cat((flow_prompt_speech_token, tts_token), dim=-1)
+                    audio_tokens = []
+                if not is_finalize:
+                    complete_tokens.append(token_id)
+                    if token_id >= audio_offset:
+                        audio_tokens.append(token_id - audio_offset)
+                    else:
+                        text_tokens.append(token_id)
+        tts_speech = torch.cat(tts_speechs, dim=-1).cpu()
+        complete_text = glm_tokenizer.decode(complete_tokens, spaces_between_special_tokens=False)
+        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
+            torchaudio.save(f, tts_speech.unsqueeze(0), 22050, format="wav")
+        history.append({"role": "assistant", "content": {"path": f.name, "type": "audio/wav"}})
+        history.append({"role": "assistant", "content": glm_tokenizer.decode(text_tokens, ignore_special_tokens=False)})
+        yield history, inputs, complete_text, '', None
+    def update_input_interface(input_mode):
+        if input_mode == "audio":
+            return [gr.update(visible=True), gr.update(visible=False)]
+        else:
+            return [gr.update(visible=False), gr.update(visible=True)]
+    # Create the Gradio interface
+    with gr.Blocks(title="GLM-4-Voice Demo", fill_height=True) as demo:
+        with gr.Row():
+            temperature = gr.Number(
+                label="Temperature",
+                value=0.2
+            )
+            top_p = gr.Number(
+                label="Top p",
+                value=0.8
+            )
+            max_new_token = gr.Number(
+                label="Max new tokens",
+                value=2000,
+            )
+        chatbot = gr.Chatbot(
+            elem_id="chatbot",
+            bubble_full_width=False,
+            type="messages",
+            scale=1,
+        )
+        with gr.Row():
+            with gr.Column():
+                input_mode = gr.Radio(["audio", "text"], label="Input Mode", value="audio")
+                audio = gr.Audio(label="Input audio", type='filepath', show_download_button=True, visible=True)
+                text_input = gr.Textbox(label="Input text", placeholder="Enter your text here...", lines=2, visible=False)
+            with gr.Column():
+                submit_btn = gr.Button("Submit")
+                reset_btn = gr.Button("Clear")
+                output_audio = gr.Audio(label="Last Output Audio (If Any)", show_download_button=True, streaming=True,
+                                        autoplay=True)
+        gr.Markdown("""## Debug Info""")
+        with gr.Row():
+            input_tokens = gr.Textbox(
+                label=f"Input Tokens",
+                interactive=False,
+            )
+            completion_tokens = gr.Textbox(
+                label=f"Completion Tokens",
+                interactive=False,
+            )
+        detailed_error = gr.Textbox(
+            label=f"Detailed Error",
+            interactive=False,
+        )
+        history_state = gr.State([])
+        respond = submit_btn.click(
+            inference_fn,
+            inputs=[
+                temperature,
+                top_p,
+                max_new_token,
+                input_mode,
+                audio,
+                text_input,
+                history_state,
+                input_tokens,
+                completion_tokens,
+            ],
+            outputs=[history_state, input_tokens, completion_tokens, detailed_error, output_audio]
+        )
+        respond.then(lambda s: s, [history_state], chatbot)
+        reset_btn.click(clear_fn, outputs=[chatbot, history_state, input_tokens, completion_tokens, detailed_error, output_audio])
+        input_mode.input(clear_fn, outputs=[chatbot, history_state, input_tokens, completion_tokens, detailed_error, output_audio]).then(update_input_interface, inputs=[input_mode], outputs=[audio, text_input])
+    initialize_fn()
+    # Launch the interface
+    demo.launch(
+        server_port=args.port,
+        server_name=args.host
+    )