LightAgent is a mobile agentic framework designed for efficient smartphone task execution. It features lightweight 3B-scale Vision-Language Models that can run directly on devices. The system combines these compact models with a dynamic device-cloud collaboration approach to optimize both performance and resource usage.
The framework uses a two-stage training methodology combining SFT and GRPO reinforcement learning with synthetic data generation. This approach enables the 3B models to achieve performance comparable to much larger 7B-9B models. Through intelligent task orchestration and structured memory mechanisms, LightAgent reduces cloud dependency by approximately 10% while maintaining robust performance across over 25 mobile applications in real-world scenarios.
- β¨LightAgentβ¨: Mobile Agentic Foundation Models
β’ Compact Architecture: Specialized 3B-scale Vision-Language Models optimized for mobile GUI tasks with minimal computational footprint.
β’ On-Device Deployment: True smartphone-compatible models that maintain competitive performance while running locally without cloud dependency.
β’ Dynamic Orchestration: Real-time task complexity assessment that intelligently switches between device and cloud models based on execution requirements.
β’ Cost-Performance Optimization: Strategic resource allocation that leverages cost-efficient on-device models while compensating limitations through selective cloud model usage.
β’ Extended Benchmark Suite: Beyond AndroidLab, incorporating 25+ additional tasks across popular mobile applications for real-world validation.
β’ Multi-Dimensional Assessment: Comprehensive evaluation covering performance metrics, computational efficiency, and practical deployment scenarios.
β’ Synthetic Data Generation: Leverages advanced MLLMs to create high-quality reasoning chain training data, addressing the scarcity of manual annotations.
β’ Two-Stage Training: SFT injects GUI foundational knowledge, while GRPO reinforcement learning optimizes task completion accuracy.
β’ Small Model Enhancement: Enables 3B models to achieve performance comparable to 7B-9B models on GUI tasks through structured training.
β’ Dynamic Task Assessment: Real-time complexity evaluation determines when and how frequently to monitor device model performance.
β’ Intelligent Orchestration: Seamlessly switches between device and cloud models based on execution progress and failure patterns.
β’ Cost-Performance Optimization: Reduces cloud invocations by ~10% while maintaining high task success rates through strategic resource allocation.
β’ Long-Horizon Reasoning: Multi-step chain-of-thought reasoning with reflective error correction to enhance decision-making capabilities.
β’ Text-Based Summarization: Compresses high-resolution screenshots into compact textual representations for efficient memory management.
β’ Structured Context Retention: Maintains 10-20 steps of historical context in resource-constrained environments through optimized token usage.
This project comprises three core components designed for comprehensive mobile agent development and evaluation:
- β‘ For model training, please refer to the training guide README for comprehensive setup and execution instructions.
- π§ For the data generation pipeline, please refer to the data preparation guide README for detailed implementation steps.
Below, we focus on evaluation using the AndroidLab benchmark framework.
Installation: Follow the official AndroidLab documentation AndroidLab for complete setup instructions.
Environment Configuration:
- Recommended Mode: AVD on Mac (arm64) - validated in our experiments.
- App Setup: Manual installation and task-specific configuration required.
- Compatibility Note: Original Docker images are not compatible with AVD environments.
vLLM Integration:
- Inference scripts available in ./vllm_script/ directory
- Optimized for efficient small model serving
Model Access:
- LightAgent Weights: 3B parameter model hosted on HuggingFace
- Deployment Process: Download weights β Deploy via vLLM β Configure inference service
- Service Ready: Seamless integration with evaluation pipeline
- API Setup Required: Configure cloud model credentials in ./evaluation/evaluation.py: Line 63, Line 75, Line 81
- Coming Soon: Streamlined configuration interface in development
Test individual tasks using the following command structure:
python eval.py -n test_name -c your path to config.yaml --task_id task_idExample Usage:
python eval.py -n all_cloud_v1_hyper -c ./configs/example_xml_cloud_hyper.yaml --task_id zoom_1Convenient batch testing scripts are available in ./test_script:
β’ all_test_cloud_v1_hyper.sh: Evaluates all 138 AndroidLab benchmark tasks
β’ all_test_cloud_v1_hyper_add.sh: Evaluates tasks for four additional mobile apps
For comprehensive details about the four additional app tasks, refer to the documentation: Additional Apps Documentation
Required Configuration: Set up LLM service credentials in ./evaluation/tasks/llm_evaluator.py:
β’ Line 10: API configuration
β’ Line 12: Service URL
π‘ Enhancement: Our implementation replaces AndroidLab's rule-based evaluation with LLM-powered assessment, providing more nuanced and accurate task completion evaluation.
Execute result generation with the following command:
python generate_result.py --input_folder ./logs/evaluation/ --output_folder ./logs/evaluation/ --output_excel ./logs/evaluation/test_name.xlsx
β’ Manual Transfer Required: Move generated evaluation files from script directory to ./logs/
β’ Then Execute: Run the result generation command above
β’ Error Prevention: This step prevents file path conflicts and ensures proper result compilation
The key findings from our online evaluation on AndroidLab are summarized as follows:
- LightAgent, when deployed in a device-cloud collaborative setting, incurs only a relatively small performance drop while effectively reducing the number of cloud model invocations.
- Notably, prompting large models for extended reasoning does not always yield better resultsβthis benefit depends on the capability of the cloud model, and only sufficiently strong models can take advantage of such strategies.
- We also report a comparison between LightAgent-3B and both similar-sized and larger models (such as 9B models), showing that LightAgent-3B achieves performance close to that of 9B models, making it a true "small powerhouse."
- Furthermore, when compared with closed-source models, LightAgent-3B's performance is comparable to previous or lightweight versions of these proprietary models.
For each MLLM, we measure the average total steps required to complete tasks, the proportion of steps handled by the on-device model versus the cloud model, and the average steps when using only the cloud model to quantify the reduction in cloud calls. The main results are as follows:
- The cloud model is still responsible for about 65% of the steps, mainly due to the limited capacity of the smaller on-device model.
- Introducing the on-device model leads to approximately a 10% reduction in cloud calls.
- Stronger cloud models (such as GLM-4.5V) experience a smaller reduction in cloud calls, as they are capable of solving more tasks independently without relying on the on-device model.
We evaluate the average inference time per step using vLLM under different GPU setups. GLM-4.1V-9B-Thinking could not run on a single 3090 GPU due to context length limits, so only two-GPU results are shown.
LightAgent, thanks to its lightweight architecture, demonstrates a clear advantage in inference speed, making it more suitable for real-world on-device scenarios. This advantage becomes even more pronounced as computational resources become constrained. In contrast, although GLM-4.1V-9B-Thinking achieves higher performance, its inference time on two 3090s is 3.5 times that of LightAgent on a single 3090, and 4 times that of LightAgent on two 3090s. Its inability to run on a single 3090 further limits its feasibility for on-device deployment.
| Model | GPUs | Size | SR | Time Cost / Step |
|---|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | Single 3090 | 7B | 10.1 | 6289.15 ms |
| LightAgent | Single 3090 | 3B | 15.2 | 4170.63 ms |
| GLM-4.1V-9B-Thinking | Two 3090s | 9B | 24.6 | 14584.89 ms |
| Qwen2.5-VL-7B-Instruct | Two 3090s | 7B | 10.1 | 4587.79 ms |
| LightAgent | Two 3090s | 3B | 15.2 | 3524.25 ms |
LightAgent builds upon excellent open-source projects. We sincerely thank their authors and contributors:
- AndroidLab - The benchmark framework.
- R1-V - Implementation details for the GRPO training methodology.
- LLaMA Factory - The unified training framework enabling efficient model fine-tuning.
This project is released under the MIT License.




