Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using llama3.2-vision:11b for the app agent #140

Open
ms-cleblanc opened this issue Dec 4, 2024 · 2 comments
Open

Using llama3.2-vision:11b for the app agent #140

ms-cleblanc opened this issue Dec 4, 2024 · 2 comments

Comments

@ms-cleblanc
Copy link

I'm using GPT for my host agent and the response has all of the components I would expect

DEBUG: Json string before loading: {
    "Observation": "I observe that the Google Chrome application is available from the control item list, with the title of 'New Tab - Google Chrome'.",
    "Thought": "The user request can be solely completed on the Google Chrome application. I need to open the Google Chrome application and click on the + icon in the top bar to open a new tab.",
    "CurrentSubtask": "Open a new tab in Google Chrome by clicking on the + icon in the top bar.",
    "Message": ["(1) Locate the + icon in the top bar of Google Chrome.", "(2) Click on the + icon to open a new tab."],
    "ControlLabel": "4",
    "ControlText": "New Tab - Google Chrome",
    "Status": "CONTINUE",
    "Plan": [],
    "Questions": [],
    "Comment": "I plan to open a new tab in Google Chrome by clicking on the + icon in the top bar."
}

However, when I use Ollama as my app agent the responses are not in the format UFO expects, I don't have Observations or Thoughts or even plans. I do get a decent response from the llama

DEBUG: Json string before loading: { "id": 3, "title": "Open a new tab in Google Chrome by clicking on the + icon in the top bar.", "steps": [ { "stepNumber": 1, "description": "Locate the + icon in the top bar of Google Chrome." }, { "stepNumber": 2, "description": "Click on the + icon to open a new tab." } ], "image": "annotated screenshot" }

What could I be doing wrong? How does the AppAgent know to provide Thoughts and Observations?

@vyokky
Copy link
Contributor

vyokky commented Dec 5, 2024

This is probably because the model you use is not strong enough. We feed the same prompts to all models. If the model fails to follow instruction, it may generate different output which we do not expect.

@ms-cleblanc
Copy link
Author

Thanks for your help! I upgraded my VM and ran the 90b model with the same issue. The context window is 128K just like GPT so I wonder why it's ignoring the prompt. I think Ollama wants the image as a filename rather than bytes in the context window. Do you think that change might help?

DEBUG: Json string before loading: {"control_text": "Customer Service workspace", "control_type": "TabItem", "label": "13"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants