This model can handle tasks that vary from OCR to semantic segmentation.
The difference from previous models is that the authors have compiled a dataset consisting of 126M images with 5.4B annotations labelled with their own data engine pseudolabelled by smaller specialized models and APIs.
The model has a similar architecture to previous models: an image encoder and a multimodality encoder with a text decoder. The authors have compiled the multitask dataset with prompts for each task.
You can also fine-tune this model on any task of choice. The authors also released different results on downstream tasks and reported their results when un/freezing the vision encoder π€π They have released fine-tuned models too, you can find them in the collection above π€
just landed at Hugging Face Hub: community-led computer vision course ππ€ learn from fundamentals to details of the bleeding edge vision transformers!
Released a new version of vikhyatk/moondream2 today! Primarily focused on improving OCR and captioning (e.g. "Describe this image", "Describe this image in one sentence"), but also seeing general improvement across all benchmarks.
On my device, I was able to achieve a 64.04x speedup over WASM! π€― How much does WebGPU speed up ML models running locally in your browser? Try it out and share your results! π
β Today weβre releasing The Stack v2 & StarCoder2: a series of 3B, 7B & 15B code generation models trained on 3.3 to 4.5 trillion tokens of code:
- StarCoder2-15B matches or outperforms CodeLlama 34B, and approaches DeepSeek-33B on multiple benchmarks. - StarCoder2-3B outperforms StarCoderBase-15B and similar sized models. - The Stack v2 a 4x larger dataset than the Stack v1, resulting in 900B unique code tokens π As always, we released everything from models and datasets to curation code. Enjoy!