Skip to main content
4 min read

How our Semantic Agent works

Other
colorful drawings of neural network nodes on dark blue background

You may have seen in an earlier post why we developed a Semantic Agent to transform the way agentic browsing works. But how does it work?

Consider that today’s browser agents act like every website and task is completely new.

To complete a task, they take a screenshot of the target page and pass that along to an LLM, along with a lengthy representation of pretty much all the HTML on the web page. They ask the LLM, “What should I do next?”

The model’s answer is usually a directive to click on a specific position or element, or to type a string of text. But at each step, the agent has to guess, act, verify, and guess again, making costly LLM calls for every single interaction.

Some browser agents have tried to optimize by asking the LLM to predict a sequence of multiple next steps. But web pages and web apps are highly dynamic.

Imagine an agent instructed to type into a combobox. The moment it starts typing, a dropdown menu might appear. Because the standard agent doesn’t structurally understand that it’s interacting with a combobox, this unexpected change in the user interface ruins its plan. So, like a driverless car confused by a traffic situation, it has to stop and ask for advice. In the agent’s case, that’s another expensive roundtrip with the LLM.

Getting an agent to understand

But what if the agent didn’t have to guess about what or where an element was on a web page? What if the web page could explicitly tell the agent, “I am a search button, and here is exactly how you interact with me”?

This is the core of our Semantic Agent architecture. Instead of relying on pixel-based vision and probabilistically guessing UI boundaries, our agent connects directly to the underlying structural meaning of the page: its native semantic HTML and web accessibility traits (WAI-ARIA).

On a web page, this information is available, to varying extents, in the accessibility tree. So our Semantic Agent agent doesn’t need to “see” the page, it doesn’t need to guess, and it doesn’t have to constantly run to the LLM to help it get over snags. Instead, it works off precise, highly structured information about states, roles, and labels.

Two kinds of knowledge

To make this architecture scalable, we gave the Semantic Agent, in effect, two kinds of knowledge: a set of skills, and a way to learn (and re-use) new ones.

  • Predefined skills. Using our Pattern Intelligence that we’ve developed elsewhere at Evinced, we taught our Semantic Agent about standard web components. If they are properly coded, our agent will know on a given page how any comboboxes, modal dialogs, tables, lists, or other established WAI-ARIA patterns are supposed to work.
  • Learned skills: When the agent successfully completes a new task, like booking a first class ticket to Paris on Air France, it will store that entire workflow for future reference. And note that the flow information that is stored relies on semantic identifiers pulled from the accessibility tree. Unlike fragile CSS selectors that fail with minor DOM updates, or visual scripts that break the moment a button moves two pixels, semantic identifiers remain predictable and resilient even through significant cosmetic redesigns of a web page. This allows the agent to execute subsequent, similar tasks with minimal LLM interaction.

Measuring the size of the opportunity

One of the nice things about the LLM and agent world is the presence of lots of benchmarks. So we could easily test the efficiency of our Semantic Agent against leading standard agents using the tough Online Mind2Web benchmark. 

We’re pleased to say that our Semantic Agent performs 32X better than standard browser agents on this benchmark. That 32X is a price-performance improvement and reflects our agent being both radically faster and cheaper than alternatives.

But there’s more. This performance is based on the average website in the benchmark. But some of those websites are more accessible than others. And when we measured the accessibility of all the websites in the benchmark, we noticed our agent performed even better, relative to alternatives, on websites that were more accessible – indeed, more like a 50X price-performance advantage. We even analyzed the sites in the benchmark and concluded that if they made a small number of changes that we could get to a 100X price-performance improvement. 

By now, it should be clear that the future of agentic browsing and accessibility are very intertwined, and that the more accessible a site is, the better an agent will perform. In our next post on this topic, we’ll turn to how to support all this change on the website end.