Spaces:
Build error
Build error
| title: MultiModel LLM ERAV2 | |
| emoji: ๐ | |
| colorFrom: red | |
| colorTo: pink | |
| sdk: gradio | |
| sdk_version: 4.44.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
| # Multi-Modal LLM Gradio App | |
| ## Project Overview | |
| This project is a **multi-modal language model** Gradio app that accepts **text**, **image**, and **audio inputs**, and outputs **text responses**. The app mimics a **ChatGPT-style interface**, allowing users to interact using multiple input modes. | |
| The app leverages: | |
| - **CLIP** for image processing | |
| - **Whisper** for audio transcription (ASR) | |
| - A **text-based model** (like GPT or Phi) for generating text responses | |
| ## Features | |
| - **Text Input**: Users can input text directly for response generation. | |
| - **Image Input**: Users can upload images, which are processed by the CLIP model. | |
| - **Audio Input**: Users can upload or record audio files, which are transcribed by the Whisper model and then processed for response. | |
| - **ChatGPT-Like Interface**: Simple and intuitive interface to handle multi-modal inputs and provide text-based output. | |
| ## Installation | |
| 1. Clone the repository: | |
| ```bash | |
| git clone https://huggingface.co/spaces/Vasudevakrishna/MultiModel_LLM_ERAV2 | |
| cd MultiModel_LLM_ERAV2 | |
| ``` | |
| 2. Install dependencies: | |
| ```bash | |
| pip -r requirements.txt | |
| ``` | |
| 3. Run the app: | |
| ```bash | |
| python app.py | |
| ``` | |
| ## How It Works | |
| 1. **Text Processing**: Input text is passed to a language model (like GPT or Phi) to generate a response. | |
| 2. **Image Processing**: Images are processed using CLIP, which extracts embeddings. These embeddings are then converted into a format understandable by the text model. | |
| 3. **Audio Processing**: Audio files are transcribed into text using Whisper. This text is passed into the language model for response generation. | |
| ## Usage | |
| - **Text Input**: Enter text in the provided textbox and click "Submit" to generate a response. | |
| - **Image Input**: Upload an image and click "Submit" to generate a response based on the image. | |
| - **Audio Input**: Upload or record an audio file, click "Submit" to transcribe and generate a response. | |
| ## Future Improvements | |
| - Add advanced features like drag-and-drop file upload or live audio recording for a better user experience. | |
| - Improve the real-time image embedding process by running CLIP embeddings in real-time with more GPU resources. | |
| - Implement end-to-end training of all components for better response quality. | |
| ## License | |
| This project is licensed under the MIT License. |