Agents centered around Large Language Models (LLMs) are now capable of automating mobile device operations for users. After fine-tuning to learn a user’s mobile operations, these agents can adhere to high-level user instructions online. They execute tasks such as goal decomposition, sequencing of sub-goals, and interactive environmental exploration, until the final objective is achieved. However, privacy concerns related to personalized user data arise during mobile operations, requiring user confirmation. Moreover, users’ real-world operations are exploratory, with action data being complex and redundant, posing challenges for agent learning. To address these issues, in our practical application, we have designed interactive tasks between agents and humans to identify sensitive information and align with personalized user needs. Additionally, we integrated Standard Operating Procedure (SOP) information within the model’s in-context learning to enhance the agent’s comprehension of complex task execution.
Our research involved an investigation of renowned datasets and their statistical characteristics within the domain of large model control devices. In the field of web control, the Mind2Web dataset is particularly notable. Similarly, in the realm of mobile control, the AitW dataset stands out as a prominent resource.
Examples of SOP data related to AIA and AitW scenarios in our paper can be found in the <data> folder.
Android in the Wild (AitW) is a large-scale dataset for mobile device control that contains human-collected demonstrations of natural language instructions, user interface (UI) screens, and actions for a variety of human tasks.
Web
Mind2Web is a dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Mind2Web contains 2,350 tasks from 137 websites spanning 31 domains that: Reflect diverse and practical use cases on the web. Provide challenging yet realistic environments with real-world websites.Test generalization ability across tasks and environments.
3. DashBoard for AitW
We utilized the AitW dataset as a standard test set to evaluate the performance of various team models, including our model developed based on the SOP mechanism.
We have provided the basic processing code for AITW for everyone to train their own large models. This set of code includes data processing, model training, and model evaluation.
Mobile-AITW
data process
For more information, please refer to AitW and Auto-UI.
### Data-prepair
To add a AitW dataset the following steps need to be performed.
- Create a dataset configuration after the schema described above. Examples can be found in ***code/llama-recipes-main/src/llama_recipes/configs/datasets.py***.
- Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.Examples can be found in ***code/llama-recipes-main/src/llama_recipes/datasets/aitw_dataset.py and \_\_init\_\_.py***
class aitw(Dataset):
def __init__(self, tokenizer, json_name=None,):
...
def get_dataset(dataset_config, tokenizer, json_name=None):
"""cover function for handling loading the working dataset"""
...
- Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in ***code/llama-recipes-main/src/llama_recipes/utils/dataset_utils.py***
Here we make use of Parameter Efficient Methods (PEFT) as described in the next section.
To run the command above make sure to pass the peft_method arg which can be set to lora, llama_adapter or prefix.
Once the model training is complete, the log will output information such as loss, tokenizer, LoRA storage path, and comprehensive training process details.
This code loads LoRA weights into the base model, performs predictions and evaluations on the test set.
Finally, it will return the model’s prediction results, including the number of prompts in the test samples, the corresponding user task instruction count, and the task complete score.
MobileAgent
1. Introduction
Agents centered around Large Language Models (LLMs) are now capable of automating mobile device operations for users. After fine-tuning to learn a user’s mobile operations, these agents can adhere to high-level user instructions online. They execute tasks such as goal decomposition, sequencing of sub-goals, and interactive environmental exploration, until the final objective is achieved. However, privacy concerns related to personalized user data arise during mobile operations, requiring user confirmation. Moreover, users’ real-world operations are exploratory, with action data being complex and redundant, posing challenges for agent learning. To address these issues, in our practical application, we have designed interactive tasks between agents and humans to identify sensitive information and align with personalized user needs. Additionally, we integrated Standard Operating Procedure (SOP) information within the model’s in-context learning to enhance the agent’s comprehension of complex task execution.
paper : https://arxiv.org/abs/2401.04124
Model : https://huggingface.co/tineding/ACT-SOP
2. Data
Our research involved an investigation of renowned datasets and their statistical characteristics within the domain of large model control devices. In the field of web control, the Mind2Web dataset is particularly notable. Similarly, in the realm of mobile control, the AitW dataset stands out as a prominent resource.
3. DashBoard for AitW
We utilized the AitW dataset as a standard test set to evaluate the performance of various team models, including our model developed based on the SOP mechanism.
LLM
Point-Coordinates
No
ChatGPT-CoT (5-shot)
7.72
5.93
4.38
10.47
9.39
8.42
Yes
Enlarged-Frame
Multi-Modal
Paper
4. Our Code
We have provided the basic processing code for AITW for everyone to train their own large models. This set of code includes data processing, model training, and model evaluation.
Mobile-AITW
data process
For more information, please refer to AitW and Auto-UI.
{ ‘episode_id’: ‘16150593985172894737’, ‘data’: [{ ‘goal’: “What’s the top post on reddit today?”, ‘step_id’: 1, ‘image’: tensor([-4.1089e-01, 3.0527e+00, -6.8378e-04, …, 5.6299e-01, 1.2676e+00, -9.7266e-01], dtype=torch.float16), ‘ui_positions’: array([[0.05131579, 0.09722222, 0.02171053, 0.27777779], [0.58684212, 0.31111112, 0.06447368, 0.14166667], [0.60855263, 0.83333331, 0.025 , 0.02916667], [0.67171055, 0.77916664, 0.01447368, 0.14166667], [0.67302632, 0.33888888, 0.01315789, 0.09027778], [0.67302632, 0.5625 , 0.01315789, 0.10972222], [0.77302629, 0.12083333, 0.04605263, 0.05555556], [0.77565789, 0.82499999, 0.04342105, 0.04305556], [0.77894735, 0.34999999, 0.03618421, 0.05833333], [0.81776315, 0.40277779, 0.01513158, 0.19583334], [0.87828946, 0.84027779, 0.0381579 , 0.03472222], [0.87894738, 0.12083333, 0.03618421, 0.04027778], [0.95657897, 0.19166666, 0.02894737, 0.02916667], [0.95657897, 0.48055556, 0.02828947, 0.03194445], [0.95723683, 0.77083331, 0.02697368, 0.02916667]]), ‘ui_text’: [‘Mon, Oct 10’, ‘M’, ‘’, ‘YouTube’, ‘Gmail’, ‘Photos’, ‘’, ‘’, ‘’, ‘Preferences’, ‘’, ‘G’, ‘’, ‘’, ‘’], ‘ui_type’: [‘TEXT’, ‘TEXT’, ‘ICON_PLAY’, ‘TEXT’, ‘TEXT’, ‘TEXT’, ‘ICON_CALL’, ‘ICON_LOCATION’, ‘ICON_CHAT’, ‘TEXT’, ‘ICON_MIC’, ‘ICON_GOOGLE’, ‘ICON_V_BACKWARD’, ‘ICON_NAV_BAR_CIRCLE’, ‘ICON_NAV_BAR_RECT’], ‘result_touch_yx’: [0.8917460441589355, 0.4927879273891449], ‘result_lift_yx’: [0.8917460441589355, 0.4927879273891449], ‘result_action’: [‘DUAL_POINT’, ‘’] }] }
“id”:”2##What’s the news in Chile?”,
“instruction”: “Given a mobile screen and a question, provide the action based on the screeninformation.
Previous Actions: step_id:0 action_type:PRESS_HOME step_id:1 action_type:DUAL_POINT ui_text: ui_type:ICON_MIC step_id:2 action_type:DUAL_POINT ui_text:abcnews.go.Com ui_type:TEXT step_id:3 action_type:TYPE typed_text:What’s the news in Chile? step_id:4 action_type:TYPE typed_text: step_id:5 action_type:PRESS_ENTER
Screen: id:0 ui_text: ui_type:ICON_HOME id:1 ui_text: ui_type:ICON_THREE_DOTS id:2 ui_text:google.com/search?q ui_type:TEXT id:3 ui_text:Google ui_type:TEXT id:4 ui_text:= ui_type:ICON_THREE_BARS id:5 ui_text: ui_type:ICON_MIC id:6 ui_text:Q ui_type:ICON_MAGNIFYING_GLASS id:7 ui_text:What’s the news in Chile? ui_type:TEXT\nid:8 ui_text:Al ui_type:TEXT id:9 ui_text:News ui_type:TEXT id:10 ui_text:Images ui_type:TEXT id:11 ui_text:Videos ui_type:TEXT id:12 ui_text:Maps ui_type:TEXT id:13 ui_text: ui_type:ICON_THREE_DOTS id:14 ui_text:4 ui_type:TEXT id:15 ui_text:https://www.aljazeera.com> where ui_type:TEXT id:16 ui_text:Chile | Today’s latest from Al ui_type:TEXT id:17 ui_text:Jazeera ui_type:TEXT id:18 ui_text:Stay on top of Chile latest developments on ui_type:TEXT id:19 ui_text:the ground with Al ui_type:TEXT id:20 ui_text:Jazeera’s fact-based ui_type:TEXT id:21 ui_text:news, ui_type:TEXT id:22 ui_text:exclusive video footage, ui_type:TEXT id:23 ui_text:photos ui_type:TEXT id:24 ui_text:and ui_type:TEXT id:25 ui_text:updated… ui_type:TEXT id:26 ui_text: ui_type:ICON_THREE_DOTS id:27 ui_text:https://www.reuters.com» archive ui_type:TEXT id:28 ui_text:Chile News Headlines |Reuters ui_type:TEXT id:29 ui_text:Chile permanently closes ui_type:TEXT id:30 ui_text:mining areas ui_type:TEXT id:31 ui_text:Chile files ui_type:TEXT id:32 ui_text:connected to giant sinkhole ui_type:TEXT id:33 ui_text:charges against mining company for giant… ui_type:TEXT id:34 ui_text: ui_type:ICON_THREE_DOTS id:35 ui_text:https://www.independent.co.uk> topic ui_type:TEXT id:36 ui_text:Chile ui_type:TEXT id:37 ui_text:latest news, ui_type:TEXT id:38 ui_text:breaking ui_type:TEXT id:39 ui_text:- ui_type:TEXT id:40 ui_text:stories and comMent - The ui_type:TEXT id:41 ui_text: ui_type:ICON_NAV_BAR_CIRCLE id:42 ui_text: ui_type:ICON_NAV_BAR_RECT id:43 ui_text: ui_type:ICON_V_BACKWARD
Instruction:What’s the news in Chile? Answer:”,
“input”:””,
“output”:”action_type:DUAL_POINT ui_text:Jazeera ui_type:TEXT id:17”
Here we make use of Parameter Efficient Methods (PEFT) as described in the next section.
To run the command above make sure to pass the peft_method arg which can be set to lora, llama_adapter or prefix.
Once the model training is complete, the log will output information such as loss, tokenizer, LoRA storage path, and comprehensive training process details.
Inference And Evaluate
This code loads LoRA weights into the base model, performs predictions and evaluations on the test set. Finally, it will return the model’s prediction results, including the number of prompts in the test samples, the corresponding user task instruction count, and the task complete score.