Behavior is rarely static. Time-varying factors, both internal and external, can influence the way in which humans and animals make decisions. Different ways of choosing an action can be attributed to using different strategies. One prominent perspective on such strategy heterogeneity is that the brain contains relatively independent, separable circuits that are conceptualized as supporting distinct strategies, each potentially competing for control. For instance, instrumental behavior appears to be supported by separable circuits for automatic versus deliberative control (** Killcross and Coutureau, 2003**;

**;**

*Daw et al., 2005***), and perhaps also different systems mediate the balance between such instrumental exploitation vs. different exploration strategies (**

*Gremel and Costa, 2013***;**

*Daw et al., 2006***;**

*Blanchard and Gershman, 2018***;**

*Hogeveen et al., 2022***;**

*Cinotti et al., 2019***). The relative contribution of different strategies to behavior is often characterized via a “mixture-of-agents” (MoA) model that describes behavior as a weighted contribution of multiple “agents” that implement different strategies (**

*co*ckburn et al., 2022***;**

*Camerer and Ho, 1999***;**

*Daw et al., 2011***;**

*Miller et al., 2017***;**

*Krueger et al., 2017***;**

*Gershman, 2018***;**

*Wilson et al., 2021***). It is hypothesized that the brain acts in the same way, weighing advice from separable circuits on how best to act (**

*co*ckburn et al., 2022***). The agent with the biggest weight can be thought of as the dominant strategy. Henceforth, we will broaden our use of the word “strategy” to now refer to a given weighting over the underlying individual agents.**

*O’Doherty et al., 2021*Although one key goal of this type of modeling is to characterize how humans and animals switch between different strategies (e.g., the formation of habits with excessive training), in prac-tice MoA models are usually fit to behavior assuming a single, fixed weighting over agents. Such models will therefore fail to capture changes in strategy over time, limiting our ability to associate behavioral and neural factors underlying variable strategies, and potentially missing contributions of a lesser used strategy all-together. For example, competing systems of automation and deliberation are often formalized and measured through a reinforcement learning (RL) framework as a mixture of model-free (MF) and model-based (MB) control (** Daw et al., 2005**;

**). A multi-step reward-guided task known as the “two-step task” was developed to differentiate these strategies in humans (**

*Sutton and Barto, 2020***) and has since been used in rodents (**

*Daw et al., 2011***;**

*Miller et al., 2017***;**

*Hasz and Redish, 2018***;**

*Groman et al., 2019b***). Many versions of the task show choices are well fit by a mixture of both MB and MF behavior (**

*Akam et al., 2021***;**

*Daw et al., 2011***;**

*Hasz and Redish, 2018***;**

*Groman et al., 2019b***), whereas other studies show behavior dominated by MB planning (**

*Akam et al., 2021***;**

*Miller et al., 2017***;**

*Feher da Silva and Hare, 2020***). The lack of evidence for MF influences in these studies is somewhat puzzling, for instance because MF learning is predicted by classic models of the dopamine system (**

*Miller et al., 2022***).**

*Schultz et al., 1997*Apart from a few studies which build in some specific hypothesized change rule for strategy weighting (but do not, accordingly, measure such change in an unbiased way) (** Lee et al., 2014**;

**;**

*Kool et al., 2016***;**

*Ebitz et al., 2018***), these studies neglect the dynamic representation of strategy. This leaves open the possibility that characterizing the dynamics of the agent mixture could reveal time-dependent trade-offs between strategies. For example, it might help clarify the behavioral discrepancies between different two-step tasks by revealing bouts of where MF control is the dominant agent, in animals whose behavior is otherwise dominated by a MB agent. It might also identify dynamic shifts in the contributions from other less well understood agents, such as the novelty preference agent. This in turn might help to clarify its function, e.g. (as we show here) for exploration.**

*Costa et al., 2019*To answer these questions, we present an elaborated model that not only measures the contribution of separable agents, but also characterizes how this contribution changes over time. A recent class of models known as GLM-HMMs (** Calhoun et al., 2019**;

**;**

*Ashwood et al., 2022***;**

*Bolkan et al., 2022***) seeks to measure time-dependent shifts in strategy by nesting a generalized linear model (GLM), which maps perceptual variables to choices, under a hidden Markov model (HMM), which models an internal “hidden” state that switches between different GLM instantiations, producing a dynamic weighting over perceptual variables. This model has been successful at capturing how the dependence of actions on measurable variables – such as choice history, reward, or a task stimulus – can vary dynamically throughout the task. This can infer, for example, periods of engagement and disengagement (**

*Oostland et al., 2022***), behavioral differences from neural inhibition (**

*Ashwood et al., 2020***), or speed of task learning with respect to an experimental stimulus (**

*Bolkan et al., 2022***). However, GLM-HMMs, as originally specified, model dynamics over directly observable variables in the GLM, and do not incorporate learning rules that themselves generate internal variables (such as action values inferred from models of reinforcement learning) that intervene to explain observed data.**

*Oostland et al., 2022*Here we replace the GLM stage of the GLM-HMM with a MoA to produce the mixture-of-agents hidden Markov model (MoA-HMM). Extending both MoA models and GLM-HMMs, the MoA-HMM characterizes the weighted influence of both inferred learning rules and (optionally) observed variables on subsequent actions, and how this weighting over agents can dynamically shift over time. We apply this approach to reanalyze data from rats performing the two-step task (** Miller et al., 2017**) to examine whether there is evidence for strategy shifting, and if so — making no assumptions other than the HMM dynamics — how it proceeds. Using MB and MF learning rules to capture MB versus MF control alongside differentiating between exploratory and exploitative behavior, we show that the MoA-HMM reveals discernible shifts in the dominant agent throughout behavioral sessions. These shifts reveal a time-dependent trade-off between exploration and exploitation as well as drifting engagement in the task. We further show that these shifts in strategy provide a new lens to analyze behavioral features and neural data that were not used to train the model: the model-inferred strategy captures, out of sample, significant changes in both reaction times and neural encoding.

## Reinforcement learning in the rat two-step task

We study rat behavior and neural recordings in a multi-step, reward-guided decision-making task known as the rat two-step task. Originally described by ** Miller et al. (2017**), this task is an adaptation of a similar task in humans (

**) that was designed to study the relative contribution of model-based and model-free strategies to choices. In the current version of the task, the rat is placed in an environment with two choice states (step one) that are separated in space and time from two outcome states (step two), in addition to initiation states for each step (**

*Daw et al., 2011***Figure 1A**). Each state is characterized by its own nose port, into which the rat has been trained to only poke when lit (yellow sun). The rat initiates the trial by entering the top center port (

**i**), after which it is presented with a free decision between the two choice states (

**ii**). Once a choice is made, the animal initiates the second step (

**iv**) and is presented with a single outcome port (

**v**) that it can enter for a chance at receiving liquid reward (

**vi**). The available outcome port is dependent on a set, probabilistic link between choices in step one and outcomes in step two, which can be represented as a transition matrix – the environment’s action-outcome model. Here one choice has an 80% probability of leading to one outcome (its

*common*transition) and a 20% chance of leading to the opposite outcome (its

*rare*transition), with flipped probabilities for the opposite choice. Either outcome state has a chance of giving reward; however, one outcome will always have a greater chance than the other (80% vs 20%), where the better outcome state flips unpredictably throughout the session (

**vii**). The goal of the rat, who’s been incentivized for liquid reward, is to track the better outcome port to maximize reward, which they do successfully (

**Figure 1B**).

The strength of the two-step task is in its separation of choice and outcome states and their transition structure, which allows model-free (MF) and model-based (MB) strategies to be distinguished by the way rewards and outcomes influence subsequent choices (**Figure 1C**). The key insight is that the stochastic state transitions (common vs. rare) decouple each action’s reward history (which should drive MF learning) from the reward histories associated with its predominant outcome state (which could, alternatively, be used for MB evaluation of the action’s value in terms of where it is likely to lead).

We can summarize MB and MF reward-learning strategies via two learning rules (or agents) that differ in whether they take account of transition in learning from reward. We also acccompany them by two more agents that capture reward-independent choice autocorrelation (perseveration or alternation), either MF (i.e., over actions) or MB (i.e., over predominant destination states). These learning rules (based on the formulation from ** Ito and Doya, 2009**, extended to the two-step task, see

**);**

*Park et al. (2017***)) have some algebraic differences from variants used in some previous studies (**

*Groman et al. (2019a***;**

*Daw et al., 2011***; see Methods), but capture the same logic. In addition to the four RL agents, an intercept or “bias” agent is included that captures the tendency towards left-ward or right-ward choices.**

*Miller et al., 2017*To demonstrate how reward and transition differently affect each agent, we adopt a trial-history regression that maps the influence of each trial type (common-reward, common-omission, rarereward, and rare-omission) at varying trial lags (** Miller et al., 2016**). A positive weight indicates that choices tend to repeat following that trial type, and a negative weight indicate that choices tend to switch following that trial type. Specifically, the first agent, model-based reward learning (MBr), captures the effect of both reward and transition on future choices (

**Figure 1Ai**). The key feature is that a rare transition, relative to a common one, reverses the effect of a reward or omission on MB evaluation. This is because of the symmetry of the transition matrix: the reward received after a rare transition is at an outcome port which is most often reached by the other choice. Thus, rewards following common transitions and omissions following rare transitions will positively weight the chosen action, and omissions following a common transition and a reward following a rare transition will negatively influence the chosen action. The second agent, model-based choice learning (MBc), captures a history-dependent effect of the sampled outcome port on choices (

**Figure 1Aii**. Put simply, it captures a “common-stay, rare-switch” pattern of choices during the two-step task, leading to

*outcome perseveration*or (if negative)

*alternation*. That is, like MFc, this agent tends to repeat or switch choices regardless of reward; however, it does so by choosing or avoiding the action commonly associated with the obtained outcome port rather than the action itself. This means that (if negative) it promotes switching choices so as to sample previously unsampled outcomes, like a MB (transition-sensitive) form of directed exploration. This comes from repeating choices after “rare” transitions in attempt to sample the “common” outcome, and switching choices after “common” transitions in attempt to sample the opposite “common” outcome. These agents, then, can capture

*MB exploitation*from a positive influence of MBr and

*MB exploration*from a negative influence of MBc.

The third agent, Model-free reward learning (MFr) integrates choice and reward information by positively influencing choices that lead to reward and negatively influencing choices that lead to omissions (**Figure 1Aiii**), ignoring the transition (outcome state) that was experienced. The forth agent, Model-free choice learning (MFc), uses only choice information and captures a history-dependent tendency to repeat or avoid choices on subsequent trials, i.e., *choice perseveration* or *alternation* (**Figure 1Aiv**). In other words, depending on the sign of the agent’s mixture coefficient, each choice will favor or disfavor the same choice on future trials regardless of the reward received. Notably, a negative coefficient will drive a tendency to choose the action not previously sampled, akin to directed exploration in action space. Put together, these agents can capture *MF exploitation* from a positive influence of MFr and *MF exploration* from a negative influence of MFc.

The RL agents used here differ from those previously used to model the rat two-step task (** Miller et al. (2017**, 2022)). The main elaboration is the inclusion of MBc to capture history-dependent outcome perseveration/MB exploration. This generalizes and clarifies the role of the “novelty preference” (NP) agent used in the earlier studies, which similarly captured “common-stay, rare-switch” behavior (albeit in the opposite direction, i.e. “rare-stay/common-switch”), but lacked dependence on more than the preceding trial. Although it significantly contributed to behavior previously, its role was not thoroughly explored. The updated learning rules used here not only provide a better fit to behavior than previous versions of the MoA, (Figure 5 – figure supplement 1), but also the role of the MBc agent is further clarified especially, as we’ll see later on, when its dynamics become apparent. For further detail on specific value update rules for the new and original agents used for comparison, refer to the methods section.

## MoA-HMM framework

Before assessing how an MoA-HMM can describe behavior, it’s important to first understand how an static MoA model can generate decisions. Imagine a set of agents, *A* ∈ *{agents}*, and a set of choices, *y* ∈ *{choices}*, where each agent has its own value for each choice *Q*_{A}(*y*) that is updated at the end of every trial. Using the term “actor” to refer to overall action selection, a mixture-of-agents model dictates that the probability of a selecting choice *y*_{t} on some trial *t* arises from a softmax over net valuations for each action, computed as the weighted sum of each agent’s values

In the context of the two-step task (and similar bandit-style reward learning tasks more generally), an actor takes some choice *y* that leads to some outcome state *o* and reward *r*. For demonstration, we assume an actor’s behavior is described by an MoA with two agents, a model-free choice learning rule (MFc) and a model-based reward learning rule (MBr), *A* ∈ *{MBr, MF c}* (**Figure 2A**). This might reflect competition between goal-directed and habitual strategies (** Miller et al., 2019**). On the first trial, each agent’s values are initialized (e.g., uniform across choices),

**Q**

_{1}= [

*Q*

_{MBr},

*Q*

_{MF c}] (i)A net value is computed as a weighted sum of

**Q**

_{1}given static agent weights

**= [**

*β**β*

_{MBr},

*β*

_{MF c}] to produce a probability distribution over choices

*p*(

*y*

_{1}) (

**eq. 1**) from which the actor’s choice is drawn (ii)After a choice is made

**(iii)**, some outcome

*o*

_{1}and reward

*r*

_{1}is observed by the actor

**(iv)**, and all observed variables may be used to update each agent’s choice values

**(v)**. These values are fed into the next trial, and the process repeats until the final trial.

Next consider an HMM in isolation. An HMM produces a stochastic trajectory of hidden states that are each associated with different probabilistic patterns over observable data. Thus, it produces discrete changes in observation statistics, reflecting the underlying, temporal transition dynamics of hidden states. When paired with an MoA, these hidden states determine the set of agent weights being used to combine agent strategies at each time step (the *β*s in Equation 1), and thereby imply dynamics in how these sets of weights trade off over time. We assume that the MoA-HMM uses the same choice values and learning rules as before, but the weighting over agent values is now dependent on hidden state *z, β*_{A}(*z*). Thus, at time point *t* given hidden state *z*_{t}, equation 1 updates to

**Figure 2B** demonstrates how an MoA-HMM with three hidden states affects how choices are generated from two RL agents in an MoA model. Notably, in the case of three hidden states, we now have three sets of weights *β*^{z} = [*β*^{1}, *β*^{2}, *β*^{3}], where the set is selected according to the active hidden state. The active hidden state at any given time is, in turn, influenced by two distributions governing the HMM dynamics: 1) the initial state distribution, *p*(*z*_{1}), determines the state at the beginning of a sequence **(i)**; and 2) the conditional probability of a subsequent state given the current state *p*(*z*_{t}|*z*_{t−1}), i.e. a state transition matrix **(ii)**, iteratively determines each subsequent state in the sequence. For more details on the fitting procedure, see Methods.

How might an MoA-HMM improve our understanding of RL behavior during a reward-learning task? As stated previously, RL behavior is often modeled using a weighted mixture of model-free and model-based RL rules. This simple mixture model would fail fully to capture a situation in which one learning rule dominates during some blocks of trials and the other dominates during other blocks of trials. For example, consider an actor during the two-step task whose behavior can be described by three states, one that is more goal-directed in which MBr dominates with a slower learning rate, a second that is more habitual in which MFc dominates with a faster learning rate, and a third producing a heavy side bias (**Figure 3i,ii**). Assume this actor also has a non-uniform initial-state probability (**iii**) and asymmetric transition dynamics depicted in (**iv**). Accordingly, over the course of a session, the net values used to select choices would be primarily dominated by one of the agents dependent on which hidden state is active (**Figure 3B**), allowing for variability, clustered in time, in how these values contribute to choice selection. Fitting a single-state MoA model to data generated from this actor would describe behavior as a mix of MBr, MFc, and bias strategies (**Figure 3Ci**) and might even fail to recover the MBr and MFc learning rates (**ii**). This single-state MoA would be unable to capture the dynamic influence of each of the agents with the underlying hidden state (**Figure 3D**. A 3-state MoA-HMM, however, is able to recover the original agent weights and learning rates (**Figure 3E**) as well as approximate the underlying hidden state and dynamic influence of agents given hidden state (**Figure 3F**). Beyond this didactic example, 3-state MoA-HMMs can be well-recovered over a wide range of parameters even when including all four RL agents (Figure 3 – figure supplements 1 and 2). To test if it can reveal switching RL strategies in real data, we apply the MoA-HMM to rats behaving in two-step task.

## Dynamic reinforcement learning behavior is better captured by the MoA-HMM

We return to using a MoA model with all four RL agents alongside a “bias” agent. As stated previously, the rat two-step task specifically isolates model-based planning in rats. Alongside the choice plot in Figure 1C, this can be further demonstrated by fitting a single-state MoA (to each rat’s data separately) and looking at the difference in normalized likelihood (cross validated across sessions) when removing one of the agents (**Figure 4B**, gray). The normalized likelihood, computed as *exp*(*L*/*N*) where *L* is the cross-validated log likelihood of the model and *N* is the total number of trials, represents the proportion of the model likelihood attributed to the selected choice, where 50% is the model performing at chance level. We see the largest reduction in normalized likelihood when removing MBr (median of -3.10% across rats, p<1E-5 from a non-parametric signed rank test (SR)) and a non-significant change when removing MFr (median of -0.03%, p=0.08 SR), consistent with results from ** Miller et al. (2017**). We also see significant decrease in likelihood when removing MBc (median of -0.34%, p<1E-4 SR), MFc (median of -1.23%, p<1E-5 SR), and Bias (median of -0.30%, p<1E-5 SR) agents.

What this model does not reveal, however, is any underlying dynamics to the dominant strategy. By adding additional hidden states, we observe a significant increase in normalized, cross-validated likelihood, suggesting that additional states capture dynamics in the behavior unaccounted for by a single-state MoA. (**Figure 4C**). At two hidden states (orange), we see an average jump of 0.77% (median, p<1E-5 SR). The fit is still better with three hidden states (green), seeing an median of 1.15% jump in quality vs the one-state model, which is also significantly greater than with two states (median increase of 0.33%, p<1E-4 SR). Significant improvements can be seen until four states (purple, median of 1.37% vs one state; median increase of 0.18% vs three states, p<1E-4 SR), after which we see non-significant gains from five or more states relative to the predecessor model. This suggests that four hidden states optimally describe behavior. However, our chief goal in what follows is to explore a descriptive model. Examination revealed that four states often highlight idiosyncrasies between animals (Figure 5 – figure supplement 3), making it more difficult to draw general conclusions. Therefore, in the remaining analyses we restrict our model to three states to facilitate interpretability and comparison across animals.

Using a three-state model, we find that the state dynamics absorb some of the variance when either the MBr or MFc are removed (**Figure 4B**, green, MBr: -2.03%, p<1E-4 SR; MFc: -0.35%, p<0.001 SR). Intriguingly, the removal of MBc and Bias both show on average a greater decrease in model fit likelihood (MBc: -0.62%, p<1E-4 SR; Bias: -0.52%, p<1E-4 SR) as well as MFr showing a slightly greater, now significant decrease in likelihood (-0.06%, p=0.015 SR), possibly suggesting a greater role of these agents when dynamics are included.

## Time-varying hidden states reveal new dynamics of behavior

Fitting a three-state MoA-HMM to each rat reveals interesting properties not seen in a single-state MoA. **Figure 5A** shows an example fit on one of the animals, where we summarize the fit by displaying the agent weights split by hidden state (**i**), the initial state probability (**ii**), and the state transition probability matrix (**iii**). Additionally, we show the average likelihood of each state across sessions (**iv**), which displays the dynamics of when each state is inferred to be active within a session, averaged over sessions. From this example, we can identify some prominent features that are common to many of the rats. Since the ordering of states in the model is arbitrary, we used some of these features to identify a canonical ordering for the states across rats to allow group-level comparison.

First, the initial state probability **(ii)** reveals that a single state dominates at the beginning of the session (blue diamond), which we’ve used to label the first state (blue). Comparing the agent weights (**i**) during this initial state to the remaining states, we notice that one weight in particular, MBc, is more negative than in the remaining two states. Recall that a negative weight for MBc captures a preference to visit outcome states not recently visited, which we identified with MB exploration. That this arises most strongly at the beginning of the session is consistent with this interpretation, since the reward location is initially unknown. After the initial reward location is identified, exploratory behavior is less beneficial; with only two outcome states, there is only one alternate outcome once the other becomes less rewarding. This is corroborated by the decreasing magnitude/increasing value of MBc in the later two states.

Since we want to identify changes in planning behavior, we selected the second state (orange) based on having the highest MBr weight (orange diamond) among the remaining states. In our example rat, this state also has a higher MBr weight than the initial state, and it is most likely to follow the initial state in the transition matrix (**iii,iv**). Together with the reduced weight on MBc, this suggests a shift toward increased MB exploitation after an initial exploratory phase.

The remaining state is labeled green, and by definition, has a lower MBr weight than state two (green diamond). Additionally, this shows the most positive weight on MBc as well as the most positive weight on MFc, suggesting a stronger influence of perseveration (in both actions and outcomes) during this state. This state tends to occur towards the end of sessions. It is possible that later in the session, the rat is sated on water and is less motivated by reward-seeking. This is accompanied by increase in less cognitively-demanding strategies such as perseveration, a pattern captured by the reduced magnitude on MBr and more positive weights on MBc and MFc.

Many of these properties hold across our population of rats (**Figure 5B**), coloring the states according to the criteria described above. One state (**ii**) tends to occur with high probability initially with, on average, the most negative weight on MBc (**i**, median of -0.74 in state 1; vs 0.04 in state 2, p<0.001 SR; vs 0.32 in state 3, p<0.001 SR across rats). The second state, again selected by having the larger MBr weight of the remaining two, follows the initial state on average (**iii**, state 1 to 2 transition median: 0.037; vs state 1 to 3: 0.001, p=1E-4 SR). Alongside the increase in MBc, the second state shows a significant increase in MBr from the first state (state 1 median: 0.78; vs state 2: 1.47, p<1E-5 SR). The remaining state, defined by a decrease in MBr, is also on average most active at the end of the session (**iv**). It sees the most-positive MBc (median of 0.32; vs 0.04 in state 2, p=0.03 SR) and MFc (median of 0.35; vs 0.15 in state 1, p<1E-4 SR), although MFc is not significantly greater than state 2 (median of 0.25, p=0.35 SR). Together with the lower MBr and higher MBc, however, the relative increase in MFc from the beginning of the session is consistent with the idea of reduced engagement and motivation from reward.

Contrary to our initial expectation, we did not see any significant changes in MFr between states across our population of rats (state 1 median: 0.10; state 2 median: 0.06; state 3 median: 0.07). This further supports that rats are not substantially employing a MFr-like strategy during this version of the task, even during smaller subsets of trials. Thus, these data reveal no evidence of a tradeoff between MBr and MFr strategies in this setting, though they do suggest one between alternation/perseveration strategies in action vs outcome space (i.e., MBc vs. MFc).

To verify that these state dynamics are not an artifact of the particular agents as we have selected, we compared these results to states fit by a GLM-HMM using the same more generic, model-agnostic trial-history regressors as in Figure 1 (Figure 5 – figure supplement 2). Strikingly, the hidden state dynamics (i.e. initial state and transition matrix) are extraordinarily similar, and the expected state likelihoods from both models are highly correlated. The hidden states themselves reveal the same effects by reward and transition across both models; however, the MoA-HMM provides a more streamlined description of this behavior with far fewer parameters and more interpretable learning rules.

## Strategy shifts significantly predict changes in behavioral and neural metrics

### Changes in strategy correlate to changes in response time

So far we have suggested an interpretation that might explain the progression of states we see across rats: an initial exploration state, a middle exploitation state, and an ending state with reduced engagement. To investigate this interpretation, we next looked to how the states relate, out of sample, to other measurements. If the relative influence of learning rules is changing between states, signifying a shift in strategy, we reasoned that the response times may also change as different strategies may require different amounts of computation. For example, it is often thought that more model-based strategies elicit slower response times than model-free strategies, as model-based strategies are more computationally intensive.

Our response time of interest is when the animal is most likely to engage in the learning rule updates we’re modeling; during the rat two-step task, this likely occurs during the inter-trial interval (ITI) period following outcome receipt and prior to initiating the next trial (** Miller et al., 2022**) (

**Figure6i**). To understand whether latent state is related to ITI duration, we must also account for other variables that may also affect this response time. Thus, we account for the presence of reward due to time spent drinking. Additionally, since the progression of states is correlated with time within the session, we wish to ensure that the latent state per se is informative beyond any effects on response time that can simply be explained by the passage of time. To investigate this, we first transformed ITI duration to ITI rate (

**Figure 6ii**) to constrain long ITIs and reveal a more linear relationship between ITI duration and time. We then fit a linear multilevel model to each rat with state identity, reward, and linear and nonlinear effects of time via a third degree polynomial (

*time*+

*time*

^{2}+

*time*

^{3}, together referred to as “time”) as predictors, as well as interactions between reward and both state and time, while capturing variance in all predictors due to session-level random effects.

**Figure 6B** shows an example regression predicted on one of the sessions. Strikingly discrete shifts in response times are visible, well aligned to the inferred state (**Figure 6Bi** top panel) and captured by the predictive component due to state identity (**Figure 6Bi** bottom panel). After subtracting out drifts in ITI rate due to time, we find significant differences in ITI rate split by inferred state (**Figure 6Bii**) in this example session, where state 2 sees a significantly higher rate (or shorter ITI duration) than state 3 following both rewarded and omission trials. This effect holds across rats, where fitting a multilevel model to each rat reveals a significant increase in ITI rate following omission (0.026 ITIs/s, p<0.001 SR) and reward (0.006 ITIs/s, p=0.008 SR), where the effect after omission is significantly greater (p<0.001 SR). Since state 2 has the highest MBr weight, this may seem to contradict the expectation that MB strategies have slower response times than MF strategies. However, these states to not reflect a trade-off between MB and MF strategies; the fastest response time here is associated with the exploitative state, while slower response times are associated with the exploration and reduced engagement states. To formally disentangle the separable contributions of time and state, we use the coefficient of partial determination (CPD, or partial R-squared) to measure the explanatory power of each predictor. We find that while time contributes more to the model (median CPD=12%, p<1E-5 SR, perhaps not surprisingly because it is, conservatively, modeled with a very flexible third-order form), the state identity still significantly explains an additional 2% of variance on average (median, p<1E-5 SR) (**Figure 6C**).

Hidden states capture additional variance of neural encoding in orbitofronal cortex As a second test that the states inferred from the fit of the model to choice behavior reflect meaningful computations, we investigated whether their dynamics correlate with changes in neural encoding. As a high-level test of this possibility, we apply the MoA-HMM to four animals that performed the two-step task while simultaneously having activity recorded in orbitofrontal cortex (OFC), data first published by ** Miller et al. (2022**). Using a single-state MoA model to estimate action and outcome values,

**. found that OFC neurons primarily encoded the value of the most recently experienced outcome (or expected outcome value), used for learning about action values on subsequent trials, during the ITI. Additionally, significant modulation to reward was observed in OFC neurons. Both reward and expected outcome value are relevant for learning between trials; therefore, our next analyses will focus on whether state modulates the encoding of these variables during the ITI.**

*Miller et al***Figure 7A** looks first at OFC modulation to expected outcome value. For more direct comparison to results from ** Miller et al**., and since MBr is the largest contributor to behavior, expected outcome value is defined here as the value of the experienced outcome port predicted solely by the MBr agent prior to value updating following reward receipt. To look at a single unit’s response to expected outcome value, we binned trials by outcome value into terciles of outcome value.

**Figure 7A**shows an example unit especially responsive to outcome value. Splitting the average firing rate by the inferred latent state reveals the highest modulation to outcome value during the second state (middle panel). To test the prevalence of this effect across all units, we measured the population CPDs of relevant behavioral variables in a Poisson regression model predicting the spike counts of each neuron computed separately for trials associated with each state. Specifically, our model included main effects of choice, outcome, transition (or interaction between choice and outcome), reward, next choice, expected outcome value, and next chosen value (see Methods for additional detail). Looking at the CPD of expected outcome value split by state (

**Figure 7B**) reveals that the trend from the example neuron is consistent across the population of OFC units, where state 2 shows the greatest CPD. This is consistent with the idea that the second state generally has the highest weight on, and therefore the strongest influence by, the MBr agent.

**Figure 7C** looks next at OFC modulation to reward or omission. A second example unit (**i**) shows differentiable firing response to these events. Splitting response by state reveals that the contrast in firing rate between reward and omissions is highest during the first state (blue) and nearly absent during the third state (green). Similarly across the population using the same regression model as above, we see the greatest reward CPD during state 1 (**Figure 7D**). This progression may reflect a change in reward modulation coincident with motivation and thirst/satiety. The animal will be thirstiest at the beginning of a session, and therefore most motivated by reward; the modulation of neural encoding by reward, then, follows accordingly.

Suggestively, the modulation of the reward effect can also be seen between states 2 and 3 – state 2, on average, sees a higher modulation to reward that lasts significantly longer than modulation in state 3. Although state 2 generally occurred in the middle of the session for the behavioral rats, states 2 and 3 trade off more competitively in the model fits of the rats with physiological recordings (Figure 7 – figure supplement 2), making it less likely that the difference between these two states is simply due to further passage of time. Indeed, the example unit in Figure 7Bii is a session that ends in state 2, coinciding with an uptick in response to reward. This suggests that reward response magnitude may be further modulated by how model-based or how engaged the animal is in the task, as captured by the choice model and dissociable from the simple passage of time.

Nevertheless, we wished to formally investigate the that state effects captured in Figure 7A-B are not simply due to the passage of time. To dissociate significant effects of these correlated factors, time and state, we similarly constructed a Poisson regression model containing main effects of behavioral variables – choice, outcome, transition, reward, next choice, expected outcome value, and next chosen value – alongside state and time. Notably, the model also included interaction terms between each behavioral variable with state and time, as well as additional interactions with reward, as a more thorough representation of possible effects being encoded by the population. Each variable’s CPD, then, was computed by removing the main effect and all interaction terms containing that variable (see Methods for a full specification of factors).

**Figure 7 – figure supplement 1** highlights the average CPD of state (red) and time (blue) across all units. Similar to the CPDs computed for ITI rate, time explains a greater proportion of variance than state. However, we still observe a significant CPD for state, suggesting that state identity explains additional variance not captured by (even rather elaborate nonlinear efects of) time alone. Hidden states detected by the MoA-HMM, therefore, seem to provide useful boundaries to epoch electrophysiological data, opening a door for further analysis and contrast between identified behavioral states.

Different regions and subcircuits of the brain are thought to subserve different decision strategies — ways to learn about or evaluate candidate actions. Thus, deciding what to do may involve weighting information from multiple such systems, where the relative contribution of certain learning rules can vary based on internal motivations and/or external environmental factors. Dynamics of this sort are often thought to explain phenomena such as the formation of habits with overtraining and a shift between exploratory and exploitative choice modes. However, MoA models traditionally used descriptively to measure the influence of separable learning rules, especially in RL, are unable actually to capture such dynamics, but instead characterize all choices as arising from an IID mixture of agents. In addition to failing to capture key phenomena like habit formation endogenously in the models, this limitation may mask interesting, lesser-used strategies and limit interpretation of underlying neural correlates. Conversely, another family of models (the GLM-HMM) that attempts to account for time-dependent changes in behavior model only observable environmental factors, and therefore cannot capture changes in terms of more abstract learning rules thought to be used by the brain. Here we introduce the MoA-HMM, combining mixture modeling of abstract learning rules with temporal dynamics that can capture discrete changes in these mixtures throughout behavior. We apply this model to rats performing a multi-step, reward-learning task and examine the contribution of various reinforcement learning rules. We successfully capture shifts in strategy corresponding to differing contributions of the included RL rules, reflecting a within-session progression between exploration, exploitation, and reduced engagement in the task. Additionally, these shifts in strategies significantly predict variables not used to fit the model; e.g., response time is fastest during the exploitation state, even after accounting for drifts due to time. These shifts also capture changes in neural encoding, notably showing the highest response to model-based value during exploitation.

The two-step task has been historically used to measure the trade-off between MB and MF reward learning strategies as a proxy for the trade-off between goal-directed and habitual behavior. The rat version of the task used here, however, has not shown any significant involvement of MF reward learning in choice behavior. Accordingly, studies using this version of the task have primarily focused on its application to studying goal-directed behavior via MB planning (** Miller et al., 2017, 2022**). The dynamic shifts in strategy uncovered here, however, suggest that behavior still may reflect a trade-off between goal-directed (outcome-state focused) and habitual (choice-focused) strategies, just not via the traditional MB/MF distinction. Instead, if we identify habitual behavior by choice-level perseveration as proposed by

**), then our results show emergence of habitual control toward the end of sessions with a rise in MFc. Interestingly, the rise of MFc isn’t coupled with an opposing decay of MBr (i.e., goal-directed exploitation). Instead, we see increasing MFc accompanied by increasing (i.e., less negative) MBc, which likely reflects a decline in goal-directed exploration. This pattern further suggests that the goals vs. habits trade-off is not simply mapped to MBr/MFr as traditionally modeled. Instead, goal-directed contributions to behavior further subdivide into distinct MB exploratory and exploitative behaviors.**

*Miller et al. (2019*The nature of the MB/MF trade-off could be further explored by applying the MoA-HMM to other versions of the two-step task that do demonstrate significant contributions of of MBr and MFr agents in the single-state MoA model (** Daw et al., 2011**;

**;**

*Gillan et al., 2016***;**

*Hasz and Redish, 2018***;**

*Groman et al., 2019b***). This may uncover new properties of strategy switching when both MBr and MFr learning rules explain significant variance, especially in humans. One practical issue, and opportunity for future research, is that human datasets contain many fewer choices per subject than the rat dataset studied here. Thus to address these data, it would likely be necessary to develop a fully hierarchical MoA-HMM to pool data effectively over subjects.**

*Akam et al., 2021*Patterns of switching, in these and other cases, may also shed light on controversy about the interpretation of behaviors explained by mixtures of MBr and MFr agents in this task (** Akam et al., 2015**;

**;**

*Russek et al., 2017***), e.g., whether they really reflect two separable contributions vs. some third hybrid, or an altogether different switching strategy. Indeed, it has been suggested that seemingly MB rat behavior during various rodent versions of the two-step task could instead reflect MF belief updates that switch between two latent states, one corresponding to each rewarded side. Such switching could be behaviorally indistinguishable from MBr in single-state MoAs for reasons similar to the example we show in Figure 3C (**

*Feher da Silva and Hare, 2020***). The MoA-HMM might instead be expected to capture this pattern via two strategies with opposing biases that track reward block flips. However, this is not what we observe in the animals. Even when aligning states 2 and 3 by signs in the bias, state dynamics do not follow reward block shifts (Figure 5 – figure supplement 1). While this does not necessarily rule out the interpretation of MF behavior put forward by**

*Akam et al., 2015***), the emergent role of MBc describing changes in MB exploration throughout sessions provides additional evidence that the animals are MB in the sense of using transition information to inform decisions.**

*Akam et al. (2015*The emergence of putative exploratory and exploitative states presents a new opportunity for using this task to study the explore-exploit trade-off. First, the predominance of the MBc agent at the start of each session strongly suggests an interpretation for the function of the previously described “novelty preference” agent in directed exploration. In this sense, MBc (with a negative mixture coefficient) is analogous to including an “information bonus” for directed exploration, similar to models used to describe exploratory behavior previously (** Wilson et al., 2014, 2021**). Such a bonus, in turn, captures the qualitative features of more elaborate, exact decision-theoretic analyses of of the value of exploration (

**;**

*Gittins, 1979***;**

*Dayan and Daw, 2008***;**

*Averbeck, 2015***;**

*Costa et al., 2019***). In a different way, our results echo another recent study (**

*Hogeveen et al., 2022***) which argued that primate behavior and neural responses were well-described by a (more purpose-built) HMM that switched between discrete exploratory and exploitative choice regimes.**

*Ebitz et al., 2018*Importantly, unlike single-step “bandit” tasks used in these and other studies of exploration, the two-step task is sequential. This means that effective exploration requires not simply choosing actions about which one is directly uncertain (as with MFc in the present task) but instead choosing actions that *lead to future states* (here, outcome ports) whose rewards are uncertain; this is precisely what distinguishes MBc as model-based. In this sense, the two-step task allows for — and our result of negative MBc but no negative MFc shows to our knowledge the first evidence in animal or human behavior for — what is known in RL as “deep exploration” (** Wilson et al., 2020**), i.e., behavior targeted at reaching future states for the purposes of exploring them.

Following the GLM-HMM, HMMs have been utilized recently in reinforcement learning tasks to estimate strategy switching (** Le et al., 2023**;

**). However, these approaches are notably different than the MoA-HMM. Specifically, the blockHMM designed by**

*Li et al., 2024***. captures differences in choice-switching behavior during a reward reversal task, and is therefore limited in application against tasks that do not contain reward reversals. Furthermore, the MoA-HMM uses different cognitive models to directly infer strategy switching, where**

*Le et al***. decodes underlying algorithms post hoc to states identified by the blockHMM, where decoding is limited to mixtures of only two algorithms (model-free and inference-based learning) relevant to reward reversal tasks. Conversely, dynamic noise estimation by**

*Le et al***. extends more broadly to various tasks and cognitive models. However, this method only assumes two behavioral states, engaged or disengaged, in order to improve modeling of the “engaged” state by separating out noise from the “disengaged” state. On the other hand, the MoA-HMM can extend to an arbitrary number of states and does not assume the existence of a “disengaged” state. Furthermore, instead of attributing strategy switching to “noisy” deviations from a single cognitive model, the MoA-HMM can instead identify competing cognitive models that may underlie strategy switches.**

*Li et al*As a descriptive model, the MoA-HMM assumes discrete rather than continuous strategy shifts. It is possible that our results reflect a discrete approximation to a more continuous process. Our goal is to provide a quantitative description of significant changes to behavior, which may still offer a useful window into continuously changing strategies. A way to more explicitly assess discrete vs. continuous dynamics is to compare the fit of the MoA-HMM to that of a behavioral model with continuously drifting latent states, such as Psytrack (** Roy et al., 2021**;

**), which could be adapted (much like GLM-HMM was extended to MoA-HMM) to infer continuous shifts in learning rule mixture weights.**

*Ashwood et al., 2020*The appearance of switching dynamics might also arise if there is a single underlying agent whose continuous dynamics (e.g. algebraic value learning rules) do not match the MoA agents. In this case, HMM states might capture residual variance due to this mismatch. In this case also, the current model would still be useful descriptively (e.g., to help design better underlying agents that endogenize the dynamics described by the HMM). Anecdotally, we were not able to capture switching in the present dataset by exploring agent variants, and at least some of the dynamics we observe (e.g. motivational shifts) seem unlikely to reflect learning per se. Nevertheless, an approach to address this possibility in future work would be to pair the MoA-HMM with more flexible agent models that seek to infer the continuous learning rules being used (** Eckstein and Collins, 2021**;

**;**

*Miller et al., 2023***;**

*Ji-An et al., 2023***).**

*Correa et al., 2024*Finally, there are many promising applications of the MoA-HMM that have yet to be explored. For example, to test whether separable circuits support different strategies, temporally-specific perturbations targeting these circuits could test for isolated effects on the agent or agents describing said strategy. Additionally, finding a significant relationship between MoA-HMM states with behavioral response times and neural encoding alone cannot causally link the observed effects with inferred states. However, it does suggest that the inclusion of these additional metrics may provide a richer description of changing strategy that modeling choice alone cannot capture. Indeed, models combining both behavioral and neural data have found success in capturing behavioral and neural patterns that could not be captures by one modality alone (** Shahar et al., 2019**;

**;**

*De- Pasquale et al., 2022***). Furthermore, while the MoA-HMM has been applied here in the context of RL, the framework can be applied more generally to arbitrary learning rules and tasks, and would apply particular naturally to rule-switching or task-switching settings.**

*Luo et al., 2023*## Subjects and Behavior

The subjects used in this study were drawn from two published data sets. For behavioral analyses, we used 20 rats first published by ** Miller et al. (2017**); for analyses paired with electrophysiological recordings, we used 4 rats first published by

**). All subjects from both sources underwent the same training procedure. Briefly, rats were water deprived and trained daily on the two-step task to gain their daily allotment of water (Figure 1A). Behavioral training took place in custom behavioral chambers (Figure 1A, right inset), and each behavioral session lasted 2-3 hours during which the animal completed hundreds of trials. Before a rat would train specifically in the two-step task, they would undergo multiple training phases to 1) acclimate them to receiving water reward from successfully poking into the back-lit outcome port, 2) train them on the transition structure of the task via fully-instructed trials (which is fixed within rat and counterbalanced across rats), and 3) introduce free choice trials and reward reversals, where reversals only occurred after a performance threshold was reached. Within this stage, the reward reversal probabilities were walked down from 100%/0% to the final 80%/20% probability set. Once the rat was consistently triggering multiple reversals per session, rats were advanced to the full version of the task where reward reversals could trigger regardless of the performance of the animal. Notably, some rats still have a small percentage of forced-choice trials during the full task (up to 20%) to reduce strong side biases. For further details on the training pipeline, see**

*Miller et al. (2022***).**

*Miller et al. (2017*## Mixture-of-Agents Model

The mixture-of-agents model used here was taken and extended from ** Miller et al. (2017**). This model describes behavior as arising from a mixture of multiple agents,

*A*∈

*{*agents

*}*, each with their own method to calculate an action value

*Q*

_{A}(

*y*) for each choice

*y*. The MoA uses a weighted sum of each agents’ values to calculate choice likelihood for each

*y*according to the softmax function,

where *β*_{A} is the weighting parameter (or inverse temperature) on agent *A*. Importantly, forced-choice trials are *included* in updated action values *Q*_{A}(*y*), since the animal can still observe rewards during these trials; however, forced-choice trials are *excluded* when calculating total choice likelihood to infer the weighting parameters (fitting procedure outlined in Mixture-of-Agents Hidden Markov Model section).

## Model Agents

We use five agents in the present manuscript. Four of these agents are separable reinforcement learning rules to capture model-based and model-free influences of of reward and choice, derived from learning rules introduced by ** Ito and Doya** (

**).**

*2009*## Model-Based Reward Learning

The model-based reward (MBr) agent captures planning as the interaction of reward and transition on choice; however, the task transition probabilities, which are assumed to be known by the animals, have been simplified to definite transitions, such that common transition choices are updated directly from reward experienced at the second step. After each trial *t*, the action values for each *y, Q*_{MBr}(*y*), are updated according to

where *α*_{MBr} is a decay rate/learning rate on the influence of previous action value, constrained between [0, 1], and *r*_{t} is an integer representing the reward experienced on trial *t*, taking a value of [1, −1] for rewards and omissions, respectively. With the transition requirement on *y*, this learning rule will apply the reward value to chosen action (*y* = *y*_{t}) and decay the non-chosen action (*y* ≠ *y*_{t}) following common transitions, or apply the reward value to the non-chosen action (*y* ≠ *y*_{t}) and decay the chosen action (*y* = *y*_{t}) following rare transitions. Notably, *α*_{MBr} is not applied as a learning rate on reward *r*_{t} to reduce interaction with the weighting parameter *β*_{MBr} during model fitting, which is similarly applied in the following RL agents.

## Model-Based Choice Learning

Model-based choice (MBc, also model-based perseveration or outcome perseveration) captures the weighted influence of past transitions on choice, describing a “common-stay, rare-switch” strategy. After each trial *t*, action values *Q*_{MBc} are updated as

where *α*_{MBc} is a decay rate, similarly constrained between [0, 1], which is unique from the decay rate on the previous agent, MBr. With a similar transition requirement on choice, this agent will always add 1 to chosen actions (*y* = *y*_{t}) following common transitions or non-chosen actions (*y* ≠ *y*_{t}) following rare transitions, regardless of reward value. This amounts to reinforcing experienced outcomes by reinforcing the common transition to that outcome.

## Model-Free Reward Learning

Model-free reward learning (MFr) is equivalent to a temporal difference learning rule, specifically TD(1), where the choice values are updated directly from experienced reward at outcome time. Namely, action values *Q*_{MBr} are updated according to

where *α*_{MFr} is a decay rate constrained between [0, 1], distinct from both model-based decay rates. With transition no longer a contributing factor, this agent will always apply the experienced reward value *r*_{t} to the chosen action (*y* = *y*_{t}) and always decay the non-chosen action (*y* ≠ *y*_{t}) .

## Model-Free Choice Learning

The model-free choice agent (MFc, also model-free perseveration or choice perseveration) captures the animal’s tendency to repeat choices on subsequent trials. Specifically, action values *Q*_{MFc} are updated as

where *α*_{MFr} is a decay rate, also constrained between [0, 1], unique from the decay rate on MFr. This agent will always reinforce chosen actions (*y* = *y*_{t}) regardless of outcome or reward.

## Bias

The bias (or intercept) agent captures the animal’s overall tendency towards left or right choices, which has a set value *Q*_{bias} of

that stays constant across trials.

Miller et al. 2017 Agents (move to supplement?)

The agents used in the original MoA model used for the rat two step task are described in ** Miller et al. (2017**). In figure 4–figure supplement 1, we show that our extended learning rules show an improvement in model likelihood over these original agents. We include a brief description of these four agents below, and additionally a fifth model-free agent used in the original manuscript (but not in the reduced model) that we included for fairness in model comparison.

## Model-Based Temporal Difference Learning

Computationally similar to MBr, this agent is equated to a planning strategy, which uses knowledge of the transition structure between actions and outcomes to update action values. Specifically, the value of action *y, Q*_{MB}(*y*), is updated as the weighted sum:

where *R*_{MB}(*o*) is the value of the outcome *o*, and *T* (*y, o*) is the transition probability from action *y* to outcome *o*, which is assumed to be known. The value of outcome *o* is updated using a TD learning rule as follows:

*r*_{t} is a constant indicating reward observed on trial *t* which takes the value [−1, 1] for omission and rewarded trials, respectively, and *α*_{MB} is the learning rate which takes a value between 0 and 1. Reward value is used to learn about the experienced outcome *o* = *o*_{t}, and the non-experienced outcome *o* ≠ *o*_{t} is decayed.

## Novelty Preference

Using the same terminology and definition as ** Miller et al. (2017**), this agent captures the tendency to stay following a rare transition or switch following an common transition. The chosen action value

*Q*

_{NP}(

*y*

_{t}) and non-chosen action value

*Q*

_{NP}(

*y*≠

*y*

_{t}) are updated as:

This agent is similar to MBc, except that it is defined in the opposite direction (i.e. “rare-stay/common-switch”) and has no learning rate to capture trial history effects.

## Model-free Temporal Difference Learning

This agent is nearly identical to MFr, where action values are updates as:

Unlike MFr, however, the learning rate *α*_{MF} is applied to reward value *r*_{t} for consistency in definition with the model-based TD learning agent.

## Perseveration

The perseveration agent captures the animal’s tendency to repeat choices on subsequent trials. its value is update according to

This agent is similar to MFc, except that is has no learning rate to capture trial-history effects.

## Mixture-of-Agents Hidden Markov Model

The introduction of a hidden Markov model allows for the weighted influence of each agent to change discretely over time. Specifically, it assumes that behavior is described by multiple hidden states, each with their own set of agent weights *β*^{z}, of which one set will win out on any given trial to drive choices. Our MoA choice likelihood is updated, then, to be dependent on latent state such that:

where the weighted influence of each agent is additionally dependent on latent state. We assume that all latent states use a shared value *Q*_{A}(*y*) and learning rates *α*_{MBr}, *α*_{MBc}, *α*_{MFr}, and *α*_{MFc}.

Two additional parameters are introduced from the HMM to describe the temporal dynamics of the hidden states: the initial state probability ** π** and the hidden state transition matrix

**P**, making the full set of model parameters

*θ*=

*{*

**,**

*β**α*

_{MBr},

*α*

_{MBc},

*α*

_{MFr},

*α*

_{MFc},

*π*, P*}*where

**is the full (nAgents x nStates) weight matrix. Details on the inference procedure for each of these parameters follow.**

*β*Initial state probability ** π** is typically defined as the hidden state probability distribution on the first trial,

*p*(

*z*

_{1}). However, with multiple sessions with only a single session a day, we treat the first trial within a session as an “initial” trial. Initial state probability

**is estimated, then, from all trials that initiate a session. As stated previously, forced trials are excluded from choice likelihood estimation. For sessions that start with a forced trial, the first free choice trial is considered the initial trial of that session. We can break up the set of all trials**

*π**T*into subsets of trials belonging to each session

*s*∈

*S*, for all sessions

*S*, such that , where

*i*= 1 is the first trial and

*i*=

*Ts*is the last trial in session

*s*; therefore, the subset of all initial trials is denoted . Elements of

**are defined as the normalized average probability**

*π*The transition matrix is estimated from all trials excluding those that initiate a session, denoted as . Elements of **P** are

We fit the MoA-HMM using expectation maximization (EM), which finds the posterior distribution of the hidden state variables, *p*(*z* | *D, θ*^{old}), in the expectation (E) step to evaluate and optimize the expected data log-likelihood, *𝓁 𝓁* (*θ, θ*^{old}) = ^{∑}*z p*(*z* | *D, θ*^{old}) ln *p*(*D, z* | *θ*), in the maximization (M) step. Implementation closely follows ** Bishop** (

**), which we briefly describe to highlight differences.**

*2006*We introduce the following notation: *γ*_{tj} = *p*(*z*_{t} = *j* | *D, θ*^{old}), the marginal posterior probability of hidden state *z*_{t} = *j* at trial *t*; and *ξ*_{tij} = *p*(*z*_{t−1} = *i, z*_{t} = *j* | *D, θ*^{old}), the joint posterior distribution of two successive latent variables, *z*_{t−1} = *i* and *z*_{t} = *j* at trial *t*. With these definitions, our expected data log-likelihood becomes:

where *T* is the total number of trials and *Z* the total number of hidden states. The E step will evaluate *γ* and *ξ* using existing parameter values *θ*; Optimization of *θ* during the M step is then done by treating *γ* and *ξ* as constants.

## E Step

*γ* and *ξ* are estimated using the forward-backward algorithm as described by ** Bishop** (

**). Briefly, the forward pass is calculated recursively within sessions from**

*2006**t*= 2 to

*t*=

*T*

_{s}as:

Where *c*_{t} is a normalization factor such that , which also gives the marginal choice likelihood . Notably in our case, a is initialized at the beginning of every session as:

The backward pass is calculated backward-recursively within sessions from *t* = *T*_{s} − 1 to *t* = 1 as:

Where *c*_{t} is the same normalization factor calculated from the forward pass. Similarly, *b* is “initialized” at the end of each session as *γ* is calculated as:

Importantly, the value of *γ*(*z*_{t}) gives us the expected likelihood of state *z* on trial *t*. Finally, *ξ* is calculated as:

## M Step

Since ** π** and

**P**do not depend on hidden state

*z*, they can be maximized using Lagrange multipliers:

A notable difference is the calculation of ** π** includes all initial session trials as defined before, while

**P**excludes these trials.

Dependence on agent parameters and agent weights *θ*_{A} = *{*** β**,

*α*

_{MBr},

*α*

_{MBc},

*α*

_{MFr},

*α*

_{MFc}

*}*are limited to a single term in our data log-likelihood:

the negative of which can be optimized using gradient descent. Here we use the L-BFGS algorithm implemented by julia’s Optim package, making use of automatic differentiation to supply the gradient.

### Fitting Procedure

Parameter values were randomly initialized using the following distributions: the agent weights ** β** from a normal distribution

*N*(0, 0.1); transition matrix rows from a Dirichlet distribution, where shaping parameters were taken as the corresponding row from a

*Z*x

*Z*matrix of 1s with 5s along the diagonal; the initial state probability from a Dirichlet distribution with all shaping parameters = 1; the learning rates from Beta distribution with

*α*=

*β*= 5. In addition, a loose prior on

**, denoted**

*β**p*(

**), was included to penalize large magnitudes; explicitly, the prior was a normal distribution with mean 0 and standard deviation 10. Including a parameter prior updates the data log-likelihood used in the M Step to**

*β*where *p*(*θ*) = *p*(** β**) = Normal(0, 10) with only a prior on

**. EM was terminated if the difference in the maximum a posteriori (MAP) likelihood,**

*β**p*(

*θ*|

*Y*) ∝

*p*(

*Y*)

*p*(

*θ*), fell below 1 × 10

^{−5}, where

*p*(

*Y*) is the marginal choice likelihood computed from the forward pass during the E step, and

*p*(

*θ*) is the parameter prior, which at this point

*p*(

*θ*) =

*p*(

**) = Normal(0, 10). Alternatively, if the difference in all parameter estimates on subsequent iterations fell below 1 × 10**

*β*^{−5}, or if the number of iterations exceeded 300, EM would be terminated.

### Estimation of Population Prior

The best model was selected from 20 fits with random initializations for each rat using only a loose prior on the agent weights. The best fits were sorted according to their initial state probability for the first state, and by *β*_{MBr} magnitude for the remaining two states, and used to estimate prior distributions on each of the model parameters. The model was refit using the updated prior:

The prior on *β* was a Normal distribution for each state, with mean and standard deviation for each agent *A* and state *z* approximated from the sorted population parameters.

The prior on ** π** and each row of

**P**are Dirichlet distributions, where the vector elements of

*α*

_{π}and are estimated from population parameters using the method of moments:

where *j* can be any element of row *P*_{z}. To ensure all shaping parameters were at least 1, 1 was added to each index of *α* if any index was less than 1, effectively smoothing the prior. To account for the prior, closed form updates to ** π** and

**P**were updated as:

The prior on the learning rates was taken as a Beta distribution, with shaping parameters and estimated for each RL agent also using the method of moments:

Similarly as before, if either or were estimated as a value less than 1, 1 was added to both and . To reduce the number of iterations and for further bias towards the prior, making up for the fact that the hierarchical estimates of the prior aren’t being inferred alongside each model, the model was refit for each animal with a single initialization at the population mean.

## Data simulation and parameter recovery

To test the model’s ability to recover inferred parameter sets, we generated two-step behavioral data from a given MoA-HMM. Given a number of trials evenly split between a number of sessions, reward probabilities were first generated for left and right outcomes using task statistics: on the first trial, a either left or right was randomly chosen and assigned an 80% reward probability, and the opposite side assigned a 20% reward probability. A trial flip counter, initialized to 0 at the start of every session, was used to track the number of trials following a reward probability flip. After the trial flip counter reached 10 trials, a random number was drawn on each trial to determine whether the reward probabilities would flip sides, with flips occurring at a 5% probability. The trial flip counter was reset to 0 following a successful flip, and reward probabilities were generated in this manner throughout the set of simulated trials and sessions.

After generating reward probabilities, choices were generated from probability distribution given by equation 4 using a given MoA-HMM parameter set. Similarly to model inference, choice values were initialized uniformly at the start of every session. After a choice was drawn, a common or rare transition was randomly chosen based on an 80%/20% transition probability to determine the outcome side, and reward was drawn from the predetermined reward probabilities for the given outcome. The selected choice, outcome, and reward are used to update choice values, and this process is repeated for the specified number of trials and sessions.

The MoA-HMM was then fit to the simulated data set using the model fitting procedure outline above, finding the best model fit out of at least 5 random initializations. In order to match recovered HMM states to the simulated HMM states, we computed the expected state likelihood *γ* (equation 5) for each state across the simulated data set for both the simulated model, *γ*_{sim}, and the recovered model, *γ*_{fit}, and we paired states based on the maximum correlation between *γ*_{sim} and *γ*_{fit}. If multiple states in *γ*_{fit} were maximally correlated to the same state in *γ*_{sim}, the state with the highest correlation won out, and the losing state was paired with its next highest correlated state. This process was repeated until all states have unique pairs.

Parameter recovery was performed separately on two sets of models: 1) models with randomly drawn parameter sets, and 2) models inferred from the rats’ behavior. For the first set, model parameters were randomly drawn from user-defined distributions. Specifically, agent weights *β* for all agents were drawn from a Normal distribution with *μ*_{β} = 0 and *σ*_{β} = 2. Learning rates *α* for relevant agents were drawn from a Beta distribution with shaping parameters *α*_{α} = 5 and *β*_{α} = 5. Initial state probabilities π were drawn from a Dirichlet distribution with shaping parameter *α*_{π} = [1, 1, 1], and each row of the state transition matrix *P* was also drawn from a Dirichlet distribution using the corresponding row of a shaping parameter matrix with 1 on all off-diagonals and 10 along the diagonal. 200 models were initialized under this framework, and data for each model was generated for 5000 trials evenly split between 20 sessions. Each model was recovered with no priors on any of the parameters except for a loose prior on the agent weights, Normal(0, 10).

Due to the interaction between different model parameters (e.g. a small *β* weight will affect the recoverability of the agent’s learning rate *α*), a number of “failures” can be seen in recovery of the first model group (Figure 3 – figure supplement 1). Instead of finding restrictions on the randomization of model parameters to eliminate failures, we performed parameter recovery on a second group of models defined by those fit to real data (post-prior) to ensure confidence in our ability to recover these models specifically (Figure 3 – figure supplement 2). Data generated by these models were set to match the trial and session statistics of the rat’s behavioral data from which the model was inferred as a further check on the recoverability from the size of the data set. 5 independent data sets were generated from each of the 20 inferred models, giving a total of 100 recovered sets. Similar to the fitting procedure on each rat, the prior distributions computed from the pre-prior model fits were used when fitting the model to each simulated data set.

## Cross-validation and model comparisons

Cross-validation was used for all procedures comparing models with different parameter sets. Either 3- or 6-fold cross validation was used, where entire sessions were held out together to preserve within-session temporal structure. Sessions for each fold were selected in stride, such that sessions were evenly spaced throughout the data set (i.e. every third session belonged to the same set in 3-fold CV). The cross-validated likelihood, then, was computed by fitting the model to all data not in the held-out set, and then running the forward pass of the E-step to compute the marginal choice log-likelihood on the held-out set, ln *p*(*Y*_{set}). After being computed for each fold, each log-likelihood was added and normalized by the number of trials, and exponentiated back into likelihood space, giving the model-fit likelihood:

## GLM-HMM parameter description

The GLM used in the GLM-HMM version of the model used here was taken from ** Miller et al. (2017**). It consists of trial-history predictors that indicate what trial type was experienced at multiple delays. Specifically, there is a common-reward predictor that takes a value of 1 if the trial was a left-choice rewarded trial with a common transition, −1 for a right-choice rewarded trial with a common transition, and 0 otherwise. This pattern follows for the remaining three trial types: common-omission, rare-reward, and rare-omission. Each of these four predictors are measured at variable delays, i.e. one trial back up to five trials back; including a bias term, this gives a total of 21 predictors. This model was fit similarly to the MoA-HMM, with each predictor treated as an “agent” with no agent hyperparameters to infer.

## Multilevel linear regression of response times

Multilevel linear regression was used to understand the influence of reward, state, and time on ITI duration. ITI duration was first pseudo-linearized by transforming it into a rate, or iti_rate = . The relationship between these variables can be summarized by the following formula:

“Time” consists of three terms, time = *t* + *t*^{2} + *t*^{3}, where *t* is the initiation time of the trial, minmaxed normalized across the session such that 0 corresponds to the beginning of the session and 1 corresponds to the end of the session. Time is described by a third-degree polynomial in this way to capture the possibility of asymmetric, non-linear effects of time throughout the session.

“State” consists of two terms, state = *z*_{1} + *z*_{2}, where *z*_{1} and *z*_{2} are non-overlapping boolean vectors. *z*_{1} = 1 indicates state 1 is the most-likely inferred state, and *z*_{2} = 1 indicates state 2 is the most-likely inferred state. A third boolean vector for state 3 is implicit with the intercept and excluded. In other words, “state” is a one-hot matrix corresponding to hidden state, where the column corresponding to the third state is dropped.

“Rew” is a boolean vector, where 1 corresponds to trials following reward and 0 corresponds to trials following omissions. The interaction between each time and state term with reward was also included in the model

The regression was fit individually for each rat, with random effects of session on each of the included terms to account for variability across sessions. Regressions were fit using the MixedModels package in julia.

## Coefficient of partial determination

The coefficient of partial determination (CPD) measures the additional variance explained by a heldout regressor by accounting for variance absorbed by correlated regressors. The CPD is defined as the proportional increase in the residual sum of squares (RSS) between the difference between the held-out model (RSS_{−x}) and the full model (RSS_{full}).

The CPD can take a negative value if the RSS of the held-out model is less than the RSS of the full model.

## Non-parametric statistics

Since the distribution of behavioral and model metrics were non-normal and calculated at the level of rats, we performed non-parametric statistics to determine significant differences between groups. Specifically, we used a Wilcoxon signed-rank test on matched samples. These tests were performed using the SignedRankTest function in the HypothesisTests package in julia.

## Electrophysiological recordings and PSTH computation

Electrophysiological data was recorded from OFC in four rats, described by ** Miller et al. (2022**). Briefly, 32-channel microwire arrays (Tucker-David Technologies) were surgically implanted to unilaterally target OFC (coordinates: 3.1-4.2 mm anterior, 2.4-4.2 mm lateral relative to bregma; lowered 5.2 mm from bregma/ 4.2 mm below the brain surface). After recovery, recording sessions were acquired in a behavioral chamber outfitted with a Neuralynx recording system. Spiking data were detected from 600-600Hz bandpass filtered voltage traces using a threshold of 30

*μ*V and manually clustered.

Peristimulus time histograms (PSTHs) were computed for examples units to show changes in firing rate at outcome time due to differing task variables relative to each inferred state. Firing rates on each trial were computed in 250*ms* bins spanning 1 second before and 3 seconds after the time of outcome port entry, then smoothed using a Gaussian filter with kernel standard deviation of 1. In both cases, trials were initially split into three groups corresponding to the inferred state, measured as the state with the highest likelihood given by equation 5. For the first example, trials were further split by outcome value inferred by the MBr agent. The outcome value inferred by the MBr agent corresponds to the value described by equation 3 relative to the experienced outcome *before* the reward update has occurred. Specifically, on common trials, the outcome value of trial *t* is equivalent to the chosen value of trial *t, Q*_{MBr}(*y* = *y*_{t}), before the reward update; on rare trials, the outcome value of trial *t* is equivalent to the non-chosen value on trial *t, Q*_{MBr}(*y* ≠ *y*_{t}), before the reward update. Trials were split into three groups defined by low (−2 to −0.5), medium (−0.5 to 0.5), and high (0.5 to 2) outcome values, giving 9 total trial groups in the first example. For the second example, trials were further split into two groups corresponding to rewarded and omission trials, giving a total of 6 trial groups. PSTHs were computed as the mean firing rate within each trial group, and error bars computed as 95% confidence intervals around the mean.

## Poisson regression of neural activity and population CPD

Two Poisson regression models were used to predict spiking activity and calculate the CPD of constituent predictors. A simpler model was fit when calculating the CPD separately for each state only including main effects of each predictor and a single interaction between choice and outcome for transition, which can by summarized by the formula:

Spike count was measured in 250*ms* time bins centered around task events: −2*s* before to 2*s* after second step initiation (16 bins), −2*s* before to 4*s* after the outcome port entry (24 bins), −4*s* before to 2*s* after the next trial initiation (24 bins), and −2*s* before to 2*s* after the next trial’s choice port entry. This regression was fit separately for each unit in each time bin, and the population CPD for each time bin *i* was computed as the aggregated CPD across all units *u* within that bin:

For this first model, to ensure that effects from the included predictors are not due to trial-by-trial co-fluctuations in neural activity, trial identities were circularly permuted and the CPD recomputed for each permuted session. A predictor’s CPD was considered significant in a particular time bin if the CPD of true session was greater than 95% of permuted session CPDs. The average CPD of all permuted sessions was subtracted from the true CPD to give a meaningful comparison to 0.

The second model, used to measure the separable effects of state and time, adds complexity by including not only main effects of time and state, but also interactios of time and state with every other predictor. Additional interactions of reward with outcome and next choice were also included.

In the same way as the above model used to predict behavioral response times, the “time” predictor consists of three terms, time = *t* + *t*^{3} + *t*^{3}, and the “state” predictor consists of two terms, state = *z*_{1} + *z*_{2}. Therefore, all interactions listed above are repeated for each term in the “time” and “state” predictors. In order to calculate the CPD of each predictor, the main effect and all interaction terms were held out when fitting the reduced model. Explicitly for each predictor,

Since the effects of time and state *are* dependent on trial-by-trial co-fluctuations, permuting the trial identities does not provide a meaningful control; the inclusion of these interactions, therefore, is intended to bolster the description of possible time and state effects. Instead of reporting the aggregated population CPD of state and time, the median and 95% confidence intervals were computed across all units to measure significant variance explained by time and state. All Poisson regression models were fit using Scikit-Learn in python.

The MoA-HMM and figure code is available at https://github.com/Brody-Lab/MixtureAgentsModels

We would like to thank Iris Stone and Jonathan Pillow for helpful discussions about GLM-HMMs, and Kim Stachenfeld for feedback on the manuscript. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-1656466, a Research Innovation Award from Princeton Neuroscience Institute, and NIMH R01MH121093.