Fifty/Fifty — Neutral News

Ai2 released MolmoWeb, an open-weight visual web agent that can navigate websites using screenshots, along with 30,000 human task demonstrations.

The Allen Institute for AI (Ai2) has released MolmoWeb, an open-weight visual web agent designed to navigate and interact with websites using only browser screenshots. The Seattle-based nonprofit announced the release of both 4 billion and 8 billion parameter versions of the model.

MolmoWeb operates by receiving task instructions, current screenshots, action logs, and page information, then producing natural language reasoning before executing browser actions like clicking, typing, or scrolling. The model does not parse HTML code or rely on accessibility tree representations, making it browser-agnostic and compatible with Chrome, Safari, and hosted browser services.

The release includes MolmoWebMix, a training dataset containing 30,000 human task trajectories across more than 1,100 websites, 590,000 individual subtask demonstrations, and 2.2 million screenshot question-answer pairs. Ai2 describes this as the largest publicly released collection of human web-task execution data. The dataset combines human demonstrations recorded through a custom Chrome extension, synthetic trajectories generated using text-based systems, and graphical user interface perception data.

According to Ai2's testing, MolmoWeb outperforms other open-weight models across four live-website benchmarks: WebVoyager, Online-Mind2Web, DeepShop, and WebTailBench. The institute also reports it surpasses some older API-based agents built on GPT-4o with accessibility tree and screenshot inputs.

The current browser agent market primarily consists of closed API systems like OpenAI's Operator and Anthropic's computer use API, and open-weight frameworks that require developers to supply their own language models. MolmoWeb positions itself as a third option, providing a fully trained open-weight model with complete training data and methodology.

Ai2 acknowledges several limitations in the current version, including occasional text reading errors from screenshots, unreliable drag-and-drop interactions, and reduced performance on ambiguous instructions. The model was not trained on tasks requiring user logins or financial transactions.

50/FIFTY

AI Research Institute Releases Open-Weight Web Browser Agent with Training Data

Sources (5)

Comments