Mohamed Hisham Abdelzaher

MH0386

AI & ML interests

None yet

Recent Activity

Organizations

MH0386's activity

Reacted to Symbol-LLM's post with ๐Ÿš€ 21 days ago
view post
Post
2116
๐Ÿš€ Excited to introduce a new member of the OS-Copilot family: OS-Atlas - an open-sourced foundational action model for GUI agents

๐Ÿ“˜ Paper: OS-ATLAS: A Foundation Action Model for Generalist GUI Agents (2410.23218)
๐Ÿ”— Website: https://osatlas.github.io

๐Ÿ˜‡ TL;DR: OS-Atlas offers:
1. State-of-the-Art GUI Grounding: Helps GUI agents accurately locate GUI elements.
2. Strong OOD Performance and Cross-platform Compatibility: Excels in out-of-domain agentic tasks across MacOS, Windows, Linux, Android, and Web.
3. Complete Infrastructure for GUI Data Synthesis:
You can easily build your own OS agent upon it!

liked a Space about 1 month ago
upvoted an article 2 months ago
view article
Article

Introducing Community Tools on HuggingChat

โ€ข 31
updated a Space 3 months ago
upvoted an article 7 months ago
view article
Article

Welcome Llama 3 - Meta's new open LLM

โ€ข 278
Reacted to Jaward's post with ๐Ÿ‘ 7 months ago
view post
Post
3254
Let's breakdown the technical details in Microsoft's mind blowing Lifelike audio-driven talking faces framework - VASA and model VASA-1:

Summary of Summaries
- The paper introduces VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) from a single image and speech audio.
- Core innovations include a diffusion-based model for holistic generation of facial dynamics and head movements in an expressive, disentangled face latent space developed using video data..
- VASA-1 Generates high-quality 512x512 videos at up to 40 FPS with low latency.
- Supports real-time generation of lifelike, emotive talking faces.

Summary of Overall Framework:
- VASA generates facial dynamics and head motion in latent space, conditioned on audio and other signals
- Instead of directly generating video frames, it generates holistic facial dynamics and head motion in a latent space, conditioned on audio and optional signals.
- To achieve this, the framework uses a face encoder-decoder to extract appearance and identity features and train a Diffusion Transformer model to generate motion latent codes.

Technical Method Details:
Expressive and Disentangled Face Latent Space Construction:
- Based on 3D-AID face reenactment framework
- Decomposes face into 3D appearance volume, identity code, head pose,
and facial dynamics latents
- Uses encoders to extract these latent factors from face images.
- Applies additional losses to improve disentanglement:
- Pairwise head pose and facial dynamics transfer loss
- Face identity similarity loss for cross-identity pose/dynamics transfer

Holistic Facial Dynamics Generation with Diffusion Transformer:
- Represents all facial movements (lip, expression, gaze, etc.) as a single
latent sequence
- Applies a Diffusion Transformer model to generate the facial dynamics sequence.
- Diffusion Transformer trained with simplified denoising score matching objective.
ยท