A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.
They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.
The team built:
๐ย A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically
๐๐ฑย A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces
โ ย An instruction dataset of 10.5k operation traces for training mobile agents
Key insights:
- ๐ Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although itโs much smaller - โ๏ธ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.
I am delighted to announce the publication of my LegalKit, a French labeled dataset built for legal ML training ๐ค
This dataset comprises multiple query-document pairs (+50k) curated for training sentence embedding models within the domain of French law.
The labeling process follows a systematic approach to ensure consistency and relevance: - Initial Query Generation: Three instances of the LLaMA-3-70B model independently generate three different queries based on the same document. - Selection of Optimal Query: A fourth instance of the LLaMA-3-70B model, using a dedicated selection prompt, evaluates the generated queries and selects the most suitable one. - Final Label Assignment: The chosen query is used to label the document, aiming to ensure that the label accurately reflects the content and context of the original text.