Agent-I ReAct

Agentic AI can be very new to mathematicians: There’s no formula. To me it’s more like Linguistics or Epistemology. It’s different but effective.

Agents use a frozen (or slightly finetuned) LLM and build larger systems based on the simple but fundamental next-token prediction.

ReAct (Yao et al. 2022) should be a foundational paradigm to start with.

In a general framework, there is an observation space $\mathcal{O}$ or environment, a language space $\mathcal{L}$, an action space $\mathcal{A}$ (tool using), and others. The notations below are slightly different from the paper and I follow the examples therein. Of course this doesn’t matter.

Idea

For an action-only paradigm, given a prompt, the agent takes action $a_1\in\mathcal{A}$ (e.g. search keyword), receives observation $o_1\in\mathcal{O}$ (e.g. search results), and repeat. Let

\[c_t=(a_1,o_1,\cdots,a_{t-1},o_{t-1})\]

be the context. The key step is to find policy $\pi(a_t\mid c_t)$ deciding on the next action based on the context, which is very hard to learn.

ReAct extends the idea: Augment the action space

\[\hat{\mathcal{A}}=\mathcal{A}\sqcup\mathcal{L}.\]

Given a prompt, make a thought $\hat a_1\in \mathcal{L}\subset \hat{\mathcal{A}}$, takes action $a_1\in\mathcal{A}\subset \hat{\mathcal{A}}$, makes an observation $o_1\in\mathcal{O}$, and repeat. Now the context is

\[c_t = (\hat a_1, a_1,o_1,\cdots,\hat a_{t-1}, a_{t-1},o_{t-1}, \hat a_t).\]

It should be easier to find policy $\pi(a_t\mid c_t)$ given the thoughts.

there could be various types of useful thoughts, e.g. decomposing task goals and create action plans, injecting commonsense knowledge relevant to task solving, extracting important parts from observations, track progress and transit action plans, handle exceptions and adjust action plans, and so on.

As the name suggests, ReAct combines Reasoning (thoughts) and Acting.

Details

The action space consists of wiki API use.

search[entity] returns first 5 sentences on wiki. If not existed return top-5 similar entities.
lookup[string] similar to Ctrl+F
finish[answer] finish and report answer

Prompting

Manually compose ReAct-format trajectories as few-shot examples for a few cases (3-6 in paper).

Combining intertal and external knowledge

If ReAct fails to return an answer in given steps, switch to CoT (chain of thoughts, reasoning only without actions, so internal knowledge).

Finetuning

Finetune base-LLM with 3000 correct trajectories.