May 7, 2024
Mesolimbic dopamine adapts the rate of learning from action – Nature

Mesolimbic dopamine adapts the rate of learning from action – Nature

Animals

All protocols and animal handling procedures were carried out in strict accordance with a protocol (no. 19–190) approved by the Janelia Institutional Animal Care and Use Committee and in compliance with the standards set forth by the Association for Assessment and Accreditation of Laboratory Animal Care.

For behaviour and juxtacellular recordings, we used 24 adult male DAT-Cre::ai32 mice (3–9 months old) resulting from the cross of DATIREScre (The Jackson Laboratory stock 006660) and Ai32 (The Jackson Laboratory stock 012569) lines of mice, such that a Chr2–EYFP fusion protein was expressed under control of the endogenous dopamine transporter Slc6a3 locus to specifically label dopaminergic neurons. Mice were maintained under specific-pathogen-free conditions. Mice were housed on a free-standing, individually ventilated (about 60 air changes hourly) rack (Allentown Inc.). The holding room was ventilated with 100% outside filtered air with >15 air changes hourly. Each ventilated cage (Allentown) was provided with corncob bedding (Shepard Specialty Papers), at least 8 g of nesting material (Bed-r’Nest, The Andersons) and a red mouse tunnel (Bio-Serv). Mice were maintained on a 12:12-h (8 am–8 pm) light/dark cycle and recordings were made between 9 am and 3 pm. The holding room temperature was maintained at 21 ± 1 °C with a relative humidity of 30% to 70%. Irradiated rodent laboratory chow (LabDiet 5053) was provided ad libitum. Following at least 4 days recovery from headcap implantation surgery, animals’ water consumption was restricted to 1.2 ml per day for at least 3 days before training. Mice underwent daily health checks, and water restriction was eased if mice fell below 75% of their original body weight.

Behavioural training

Mice were habituated to head fixation in a separate area from the recording rig in multiple sessions of increasing length over ≥3 days. During this time they received some manual water administration through a syringe. Mice were then habituated to head fixation while resting in a spring-suspended basket in the recording rig for at least two sessions of 30+ min each before training commenced. No liquid rewards were administered during this recording rig acclimation; thus, trial 1 in the data represents the first time naive mice received the liquid water reward in the training environment. The reward consisted of 3 μl of water sweetened with the non-caloric sweetener saccharin delivered through a lick port under control of a solenoid. A 0.5-s, 10-kHz tone preceded reward delivery by 1.5 s on ‘cued’ trials, and 10% of randomly selected rewards were ‘uncued’. Matching our previous training schedule9, after three sessions, mice also experienced ‘omission’ probe trials, in which the cue was delivered but not followed by reward, on 10% of randomly selected trials. Intertrial intervals were chosen from a randomly permuted exponential distribution with a mean of about 25 s. Ambient room noise was 50–55 dB, and an audible click of about 53 dB accompanied solenoid opening on water delivery and the predictive tone was about 65 dB loud. Mice experienced 100 trials per session and one session per day for 8–10 days. In previous pilot experiments, it was observed that at similar intertrial intervals, behavioural responses to cues and rewards began to decrease in some mice at 150–200 trials. Thus, the 100 trials per session limit was chosen to ensure homogeneity in motivated engagement across the dataset.

Some animals received optogenetic stimulation of VTA–DA neurons concurrent with reward delivery, contingent on their behaviour during the delay period (see technical details below). Mice were randomly assigned to stimulation group (control, stimLick−, stimLick+) before training. Experimenter was not blinded to group identity during data collection. Following trace conditioning with or without exogenous dopamine stimulation, five mice experienced an extra session during which VTA–DA neurons were optogenetically stimulated concurrently with cue presentation (Extended Data Fig. 4). Mice were then randomly assigned to groups for a new experiment in which a light cue predicted VTA–DA stimulation with no concurrent liquid water reward (5–7 days, 150–200 trials per day). The light cue consisted of a 500-ms flash of a blue light-emitting diode (LED) directed at the wall in front of head fixation. Intertrial intervals were chosen from randomly permuted exponential distributions with a mean of about 13 s. Supplementary Table 1 lists the experimental groups each mouse was assigned to in the order in which experiments were experienced.

Video and behavioural measurement

Face video was captured at 100 Hz continuously across each session with a single camera (Flea 3, FLIR) positioned level with the point of head fixation, at an approximately 30º angle from horizontal, and compressed and streamed to disk with custom code written by J. Keller (available at https://github.com/neurojak/pySpinCapture). Dim visible light was maintained in the rig so that pupils were not overly dilated, and an infrared LED (model#) trained at the face provided illumination for video capture. Video was post-processed with custom MATLAB code available on request.

Briefly, for each session, a rectangular region of interest (ROI) for each measurement was defined from the mean of 500 randomly drawn frames. Pupil diameter was estimated as the mean of the major and minor axis of the object detected with the MATLAB regionprops function, following noise removal by thresholding the image to separate light and dark pixels, then applying a circular averaging filter and then dilating and eroding the image. This noise removal process accounted for frames distorted by passage of whiskers in front of the eye, and slight differences in face illumination between mice. For each session, appropriateness of fit was verified by overlaying the estimated pupil on the actual image for about 20–50 randomly drawn frames. A single variable, the dark/light pixel thresholding value, could be changed to ensure optimal fitting for each session. Nose motion was extracted as the mean of pixel displacement in the ROI y axis estimated using an image registration algorithm (MATLAB imregdemons). Whisker pad motion was estimated as the absolute difference in the whisker pad ROI between frames (MATLAB imabsdiff; this was sufficiently accurate to define whisking periods, and required much less computing time than imregdemons). Whisking was determined as the crossing of pad motion above a threshold, and whisking bouts were made continuous by convolving pad motion with a smoothing kernel. Licks were timestamped as the moment pixel intensity in the ROI in between the face and the lick port crossed a threshold.

Body movement was summarized as basket movements recorded by a triple-axis accelerometer (Adafruit, ADXL335) attached to the underside of a custom-designed three-dimensionally printed basket suspended from springs (Century Spring Corp, ZZ3-36). Relative basket position was tracked by low-pass filtering accelerometer data at 2.5 Hz. Stimulations and cue deliveries were coordinated with custom-written software using Arduino Mega hardware (https://www.arduino.cc). All measurement and control signals were synchronously recorded and digitized (at 1 kHz for behavioural data, 10 kHz for fibre photometry data) with a Cerebus Signal Processor (Blackrock Microsystems). Data were analysed using MATLAB software (Mathworks).

Preparatory and reactive measures and abstract learning trajectories

To describe the relationship between behavioural adaptations and reward collection performance, for each mouse in the control group a GLM was created to predict reward collection latency from preparatory and reactive predictor variables on each trial. Preparatory changes in licking, whisking, body movement and pupil diameter were quantified by measuring the average of each of those signals during the 1-s delay period preceding cued rewards. The nose motion signal was not included as it did not display consistent preparatory changes. Reactive responses in the whisking, nose motion and body movement were measured as the latency to the first response following reward delivery. For whisking, this was simply the first moment of whisking following reward delivery. For nose motion, the raw signal was convolved with a smoothing kernel and then the first response was detected as a threshold crossing of the cumulative sum of the signal. For body movement, the response was detected as the first peak in the data following reward delivery. On occasional trials no event was detected within the analysis window. Additionally, discrete blocks of trials were lost owing to data collection error for mouse 3, session 7; mouse 4, session 5; and mouse 9, session 4. To fit learning curves through these absent data points, missing trials were filled in using nearest-neighbour interpolation.

Trial-by-trial reward collection latencies and predictor variables (preparatory licking, whisking, body movement and pupil diameter; and reactive nose motions, whisking and body movement) were median filtered (MATLAB medfilt1(signal,10)) to minimize trial-to-trial variance in favour of variance due to learning across training. Collection latency was predicted from z-scored predictor variables using MATLAB glmfit to fit β-values for each predictor. The unique explained variance of each predictor was calculated as the difference in explained variance between the full model and a partial model in which β-values were fitted without using that predictor.

Preparatory and reactive predictor variables were used to define abstract learning trajectories that were plots of collection latency against the inferred reactive and preparatory variables for each of the first 800 cue–reward trials of training. Reactive and preparatory variables were calculated as the first principal component of the individual reactive and preparatory variables used in the GLM fits. For visualization, we fitted a parametric model to all three variables (single exponential for preparatory, double exponentials for reactive and latency using the MATLAB fit function). Quality of fits and choice of model were verified by visual inspection of all data for all mice. An individual mouse’s trajectory was then visualized by plotting downsampled versions of the fit functions for latency, reactive and preparatory. Arrowheads were placed at logarithmically spaced trials.

To quantify the total amount of preparatory behaviour in each mouse at a given point in training (final prep. behav., Extended Data Fig. 3f), each preparatory measure (pupil, licking, whisking and body movement) was z-scored and combined across mice into a single data matrix. The first principal component of this matrix was calculated and loading onto PC1 was defined as a measure of an inferred underlying ‘preparatory’ component of the behavioural policy. This created an equally weighted, variance-normalized combination of all preparatory measures to allow comparisons between individual mice. An analogous method was used to reduce the dimensionality of reactive variables down to a single ‘reactive’ dimension that captures most variance in reactive behavioural variables across animals (final reactive behav., Extended Data Fig. 3g). Initial NAc–DA signals were predicted from trained behaviour at trials 700–800 by multiple regression (specifically, pseudoinverse of the data matrix of reactive and preparatory variables at the end of training multiplied by data matrix of physiological signals for all animals).

Combined fibre photometry and optogenetic stimulation

In the course of a single surgery session, DAT-Cre::ai32 mice received: bilateral injections of AAV2/1-CAG-FLEX-jRCaMP1b in the VTA (150 nl at the coordinates −3.1 mm anterior–posterior (A–P), 1.3 mm medial–lateral (M–L) from bregma, at depths of 4.6 and 4.3 mm) and in the substantia nigra pars compacta (100 nl at the coordinates −3.2 mm A–P, 0.5 mm M–L, depth of 4.1, mm); custom 0.39-NA, 200-μm fibre cannulas implanted bilaterally above the VTA (−3.2 mm A–P, 0.5 mm M–L, depth of −4.1 mm); and fibre cannula implanted unilaterally in the DS (0.9 mm A–P, 1.5 mm M–L, depth of 2.5 mm) and NAc (1.2 mm A–P, 0.85 mm M–L, depth of 4.3 mm). Hemisphere choice was counterbalanced across individuals. A detailed description of the methods has been published previously56.

Imaging began >20 days post-injections using custom-built fibre photometry systems (Fig. 2a)56. Two parallel excitation–emission channels through a five-port filter cube (FMC5, Doric Lenses) allowed for simultaneous measurement of RCaMP1b and eYFP fluorescence, the latter channel having the purpose of controlling for the presence of movement artefacts. Fibre-coupled LEDs of 470 nm and 565 nm (M470F3, M565F3, Thorlabs) were connected to excitation ports with acceptance bandwidths of 465–490 nm and 555–570 nm, respectively, with 200-μm, 0.22-NA fibres (Doric Lenses). Light was conveyed between the sample port of the cube and the animal by a 200-μm-core, 0.39-NA fibre (Doric Lenses) terminating in a ceramic ferrule that was connected to the implanted fibre cannula by a ceramic mating sleeve (ADAL1, Thorlabs) using index matching gel to improve coupling efficiency (G608N3, Thorlabs). Light collected from the sample fibre was measured at separate output ports (emission bandwidths 500–540 nm and 600–680 nm) by 600-μm-core, 0.48-NA fibres (Doric Lenses) connected to silicon photoreceivers (2151, Newport).

A time-division multiplexing strategy was used in which LEDs were controlled at a frequency of 100 Hz (1 ms on, 10 ms off), offset from each other to avoid crosstalk between channels. A Y-cable split each LED output between the filter cube and a photodetector to measure output power. LED output power was 50–80 μW. This low power combined with the 10% duty cycle used for multiplexing prevented local ChR2 excitation56 by 473 nm eYFP excitation. Excitation-specific signals were recovered in post-processing by only keeping data from each channel when its LED output power was high. Data were downsampled to 100 Hz, and then band-pass filtered between 0.01 and 40 Hz with a second-order Butterworth filter. Although movement artefacts were negligible when mice were head-fixed in the rig (the movable basket was designed to minimize brain movement with respect to the skull9), according to standard procedure the least-squares fit of the eYFP movement artefact signal was subtracted from the jRCaMP1b signal. dF/F was calculated by dividing the raw signal by a baseline defined as the polynomial trend (MATLAB detrend) across the entire session. This preserved local slow signal changes while correcting for photobleaching. Comparisons between mice were carried out using the z-scored dF/F.

Experimenters were blind to group identity during the initial stages of analysis when analysis windows were determined and custom code was established to quantify fibre photometry signals and behavioural measurements. Analysis windows were chosen to capture the extent of mean phasic activations following each kind of stimulus. For NAc–DA and VTA–DA, reward responses were quantified from 0 to 2 s after reward delivery and cue responses were quantified from 0 to 1 s after cue delivery. DS–DA exhibited much faster kinetics, and thus reward and cue responses were quantified from 0 to 0.75 s after delivery.

Somatic Chr2 excitation was carried out with a 473-nm laser (50 mW, OEM Laser Systems) coupled by a branching fibre patch cord (200 μm, Doric Lenses) to the VTA-implanted fibres using ceramic mating sleeves. Burst activations of 30 Hz (10 ms on, 23 ms off) were delivered with durations of either 150 ms for calibrated stimulation or 500 ms for large stimulations. For calibrated stimulation, laser power was set between 1 and 3 mW (steady-state output) to produce a NAc–DA reactive of similar amplitude to the largest transients observed during the first several trials of the session. This was confirmed post hoc to have roughly doubled the size of reward-related NAc–DA transients (Figs. 3a and 5b). For large stimulations, steady-state laser output was set to 10 mW.

ACTR computational learning model

Behavioural plant

An important aspect of this modelling work was to create a generative agent model that would produce core aspects of reward-seeking behaviour in mice. To this end, we focused on licking, which in the context of this task is the unique aspect of behaviour critical for reward collection. A reader may look at the function dlRNN_Pcheck_transfer.m within the software repository to appreciate the structure of the plant model. We describe the function of the plant briefly here. It is well known that during consumptive, repetitive licking mice exhibit preparatory periods of about 7 Hz licking. We modelled a simple fixed rate plant with an active, ‘lick’ state that emitted observed licks at a fixed time interval of 150 ms. The onset of this lick pattern relative to entry into the lick state was started at a variable phase of the interval (average latency to lick initialization from transition into lick state about 100 ms). Stochastic transitions between ‘rest’ and ‘lick’ states were governed by forward and backward transition rates. The reverse transition rate was a constant that depended on the presence of reward (5 × 10−3 ms without reward, 5 × 10−1 ms with reward). This change in the backwards rate captured the average duration of consumptive licking bouts. The forward rate was governed by the scaled policy network output and a background tendency to transition to licking as a function of trial time (analogous to an exponential rising hazard function; 𝜏 = 100 ms). The output unit of the policy network was the sum of the RNN output unit (constrained {−1,1} by the tanh activation function) and a large reactive transient proportional to the sensory weight ({0,max_scale}), in which max_scale was a free parameter generally bounded from 5 to 10 during initialization. This net output was scaled by S = 0.02 ms−1 to convert to a scaled transition rate in the policy output. Behaviour of the plant for a range of policies is illustrated in Extended Data Fig. 2. A large range of parameterizations were explored with qualitatively similar results. Chosen parameters were arrived at by scanning many different simulations and matching average initial and final latencies for cue–reward pairings across the population of animals. More complicated versions (high-pass filtered, nonlinear scaling) of the transition from RNN output to transition rate can be explored in the provided code. However, all transformations were found to produce qualitatively similar results, and thus the simplest (scalar) transformation was chosen for reported simulations for clarity of presentation.

RNN

As noted in the main text, the RNN component of the model and the learning rules used for training drew on inspiration from ref. 36, which itself drew on inspiration variants of node perturbation methods61 and the classic policy optimization methods known as REINFORCE rules3,21. Briefly, ref. 36 demonstrated that a relatively simple learning rule that computed a nonlinear function of the correlation between a change in input and change in output multiplied by the change in performance on the objective was sufficiently correlated with the analytic gradient to allow efficient training of the RNN. We implemented a few changes relative to this prior work. Below we delve into the learning rule as implemented here or a reader may examine the commented open source code to get further clarification as well. First, we describe the structure of the RNN and some core aspects of its function in the context of the model. The RNN was constructed largely as described in ref. 36, and was very comparable to the structure of a re-implementation of that model in ref. 62.

Although we explored a range of parameters governing RNN construction, many examples of which are shown in Extended Data Fig. 2, the simulations shown in the main results come from a network with 50 units (Nu = 50; chosen for simulation efficiency; larger networks were explored extensively as well), densely connected (Pc = 0.9), spectral scaling to produce preparatory dynamics (g = 1.3), a characteristic time constant (𝝉 = 25 ms) and a standard tanh activation function for individual units. Initial internal weights of the network (Wij) were assigned according to the equation (in RNN-dudlab-master-LearnDA.m)

$${W}_{ij}={boldsymbol{g}}times {mathscr{N}}(0,1)times {({P}_{{rm{c}}}times {N}_{{rm{u}}})}^{-1/2}$$

(1)

The RNN had a single primary output unit with activity that constituted the continuous time policy (that is, π(t)) input to the behaviour plant (see above), and a ‘feedback’ unit that did not project back into the network as would be standard, but rather was used to produce adaptive changes in the learning rate (described in more detail in the section below entitled Learning rules).

Objective function

Evaluation of model performance was calculated according to an objective function that defines the cost as the performance cost (equation (2), costP) and an optional network stability cost (equation (3), costN) (for example, lines 269 and 387 in dlRNN-train_learnDA.m, for equations (4) and (5), respectively)

$${{rm{cost}}}_{{rm{P}}}={1-{rm{e}}}^{-Delta {rm{t}}/500}$$

(2)

$${{rm{cost}}}_{{rm{N}}}={rm{sum}}({rm{| }},{boldsymbol{delta }}{boldsymbol{pi }}(t)/{boldsymbol{delta }}t| )$$

(3)

$${R}_{{rm{obj}}}=(1-{{rm{cost}}}_{{rm{P}}})-{boldsymbol{pi }}({t}_{{rm{reward}}})$$

(4)

$$langle R(T)rangle ={{boldsymbol{alpha }}}_{{rm{R}}}times {R}_{{rm{obj}}}(T)+left(1-{{boldsymbol{alpha }}}_{{rm{R}}}right)times {R}_{{rm{obj}}}(T-1)$$

(5)

in which T is the trial index. In all presented simulations, WN = 0.25. A filtered average cost, R, was computed as before36 with αR = 0.75 and used in the update equation for changing network weights through the learning rule described below. For all constants a range of values were tried with qualitatively similar results. The performance objective was defined by costP, for which ∆t is the latency to collected reward after it is available. The network stability cost (costN) penalizes high-frequency oscillatory dynamics that can emerge in some (but not all) simulations. Such oscillations are inconsistent with observed dynamics of neural activity so far.

Identifying properties of RNN required for optimal performance

To examine what properties of the RNN were required for optimal performance, we scanned through thousands of simulated network configurations (random initializations of Wij) and ranked those networks according to their mean cost (Robj) when run through the behaviour plant for 50 trials (an illustrative group of such simulations is shown in Extended Data Fig. 2). This analysis revealed a few key aspects of the RNN required for optimality. First, a preparatory policy that spans time from the detection of the cue through the delivery of water reward minimizes latency cost. Second, although optimal RNNs are relatively indifferent to some parameters (for example, Pc), they tend to require a coupling coefficient (g)  1.2. This range of values for the coupling coefficient is known to determine the capacity of an RNN to develop preparatory dynamics63. Consistent with this interpretation, our findings showed that optimal policies were observed uniquely in RNNs with large leading eigenvalues (Extended Data Fig. 2; that is, long-time-constant dynamics64). These analyses define the optimal policy as one that requires preparatory dynamics of output unit activity that span the interval between the cue offset and reward delivery and further reveal that an RNN with long-timescale dynamics is required to realize such a policy. Intuitively: preparatory anticipatory behaviour, or ‘conditioned responding’, optimizes reward collection latency. If an agent is already licking when reward is delivered the latency to collect that reward is minimized.

RNN initialization for simulations

All mice tested in our experiments began training with no preparatory licking to cues and a long latency (about 1 s or more) to collect water rewards. This indicates that animal behaviour is consistent with an RNN initialization that has a policy π(t) ≈ 0 for the entire trial. As noted above, there are many random initializations of the RNN that can produce clear preparatory behaviour and even optimal performance. Thus, we carried out large searches of RNN initializations (random matrices Wij) and used only those that had approximately 0 average activity in the output unit. We used a variety of different initializations across the simulations reported (Fig. 1 and Extended Data Fig. 2) and indeed there can be substantial differences in the observed rate of convergence depending on initial conditions (as there are across mice as well). For simulations of individual differences (Fig. 1j and Extended Data Fig. 2), distinct network initializations were chosen (as described above), and paired comparisons were made for the control initialization and an initialization in which the weights of the inputs from the reward to the internal RNN units were tripled.

Learning rules

Below we articulate how each aspect of the model acronym, ACTR (adaptive rate cost of performance to REINFORCE), is reflected in the learning rule that governs updates to the RNN. The connections between the variant of node perturbation used here and REINFORCE21 has been discussed in detail previously36. There are two key classes of weight changes governed by distinct learning rules within the ACTR model. First, we will discuss the learning that governs changes in the ‘internal’ weights of the RNN (Wij). The idea of the rule is to use perturbations (1–10 Hz rate of perturbations in each unit; simulations reported used 3 Hz) to drive fluctuations in activity and corresponding changes in the output unit that could improve or degrade performance. To solve the temporal credit assignment problem, we used eligibility traces similar to those described previously36. One difference here was that the eligibility trace decayed exponentially with a time constant of 500 ms and it was unclear whether decay was a feature of prior work. The eligibility trace (({mathcal{e}})) for a given connection i,j could be changed at any time point by computing a nonlinear function (({mathcal{S}})) of the product of the derivative in the input from the ith unit (({x})i) and the output rate of the jth unit (rj) in the RNN according to the equation (in dlRNN_engine.m)

$${{mathcal{e}}}_{i,j}(t)={{mathcal{e}}}_{i,j}(t-1)+{boldsymbol{varphi }}[{{r}}_{j}(t-1)times ({{x}}_{i}(t)-langle {{x}}_{i}rangle )]$$

(6)

As noted in ref. 36, the function ({mathcal{S}}) need only be a signed, nonlinear function. Similarly, in our simulations we also found that a range of functions could all be used. Typically, we used either ϕ(y) = y3 or ϕ(y) = |y| × y, and simulations presented were generally the latter, which runs more rapidly.

The change in a connection weight (Wij) in the RNN in the original formulation36 is then computed as the product of the eligibility trace and the change in PE scaled by a learning rate parameter. Our implementation kept this core aspect of the computation, but several critical updates were made and will be described. First, as the eligibility trace is believed to be ‘read out’ into a plastic change in the synapse by a phasic burst of dopamine firing58, we chose to evaluate the eligibility at the time of the computed burst of dopamine activity estimated from the activity of the parallel feedback unit (see below for further details). Again, models that do not use this convention can also converge, but in general converge worse than and less similarly to observed mice. The update equation is thus (for example, line 330 in dlRNN-train_learnDA.m)

$${{rm{W}}}_{i,j}(T)={{rm{W}}}_{i,j}(T-1)+{{boldsymbol{beta }}}_{{rm{DA}}}times {eta }_{{mathscr{S}}}times {e}_{i,j}({t}_{{rm{DA}}})times ({R}_{{rm{obj}}}(T)-langle R(T)rangle )$$

(7)

in which ({eta }_{{mathcal{S}}}) is the baseline learning rate parameter and is generally used in the range 5 × 10−4 ± 1 × 10−3 and βDA is the ‘adaptive rate’ parameter that is a nonlinear function (sigmoid) of the sum of the derivative of the policy at the time of reward plus the magnitude of the reactive response component plus a tonic activity component, T (T = 1 except in Extended Data Fig. 2 where noted and ϕ is a sigmoid function mapping inputs from {0,10} to {0,3} with parameters: σ = 1.25, μ = 7) (for example, line 259 in dlRNN-train_learnDA.m):

$${{boldsymbol{beta }}}_{{rm{DA}}}=T+{boldsymbol{varphi }}(Delta {boldsymbol{pi }}({t}_{{rm{reward}}})+{{mathscr{S}}}_{i,{rm{reward}}})$$

(8)

As noted in the description of the behavioural data described in Fig. 1, it is clear that animal behaviour exhibits learning of both preparatory behavioural responses to the cue as well as reactive learning that reduces reaction times between sensory input (either cues or rewards) and motor outputs. This is particularly prominent in early training during which a marked decrease in reward collection latency occurs even in the absence of particularly large changes in the preparatory component of behaviour. We interpreted this reactive component as a ‘direct’ sensorimotor transformation consistent with the treatment of reaction times in the literature65, and thus reactive learning updates weights between sensory inputs and the output unit (one specific element of the RNN indexed as ‘o’ below). This reactive learning was also updated according to PEs. In particular, the difference between Robj(T) and the activity of the output unit at the time of reward delivery. For the cue, updates were proportional to the difference between the derivative in the output unit activity at the cue and the PE at the reward delivery. These rates were also scaled by the same 𝜷DA adaptive learning rate parameter (for example, line 346 in dlRNN-train_learnDA.m):

$${W}_{{rm{trans}},{rm{o}}}(T)={W}_{{rm{trans}},{rm{o}}}(T-1)+{{boldsymbol{beta }}}_{{rm{DA}}}times {eta }_{{mathscr{S}}}times ({R}_{{rm{obj}}}(T)-{boldsymbol{pi }}({t}_{{rm{reward}}}))$$

(9)

in which ηI is the baseline reactive learning rate and typical values were about 0.02 in presented simulations (again a range of different initializations were tested).

We compared acquisition learning in the complete ACTR model to observed mouse behaviour using a variety of approaches. We scanned about two orders of magnitude for two critical parameters ηI and ηW. We also aimed to sample the model across a range of initializations that approximately covered the range of learning curves exhibited by control mice. To scan this space, we followed the following procedure. We initialized 500–1,000 networks with random internal weights and initial sensory input weights (as described above). As no mice that we observed initially exhibited sustained licking, we selected six network initializations with preparatory policies approximately constant and 0. For these 6 net initializations, we ran 24 simulations with 4 conditions for each initialization. Specifically, we simulated input vectors with initial weights ({mathcal{S}}) = [0.1, 0.125, 0.15, 0.175] and baseline learning rates ηI = [2, 2.25, 2.5, 2.75] × 8 × 10−3. Representative curves of these simulations are shown in Fig. 1j.

Visualizing the objective surface

To visualize the objective surface that governs learning, we scanned a range of policies (combinations of reactive and preparatory components) passed through the behaviour plant. The range of reactive components covered was [0:1.1] and preparatory was [−0.25:1]. This range corresponded to the space of all possible policy outputs realizable by the ACTR network. For each pair of values, a policy was computed and passed through the behaviour plant 50 times to get an estimate of the mean performance cost. These simulations were then fitted using a third-order, two-dimensional polynomial (analogous to the procedure used for experimental data) and visualized as a three-dimensional surface.

In the case of experimental data, the full distribution of individual trial data points across all mice (N = 7,200 observations) was used to fit a third-order, two-dimensional polynomial (MATLAB; fit). Observed trajectories of preparatory versus reactive were superimposed on this surface by finding the nearest corresponding point on the fitted two-dimensional surface for the parametric preparatory and reactive trajectories. These data are presented in Fig. 1j.

Simulating closed-loop stimulation of mDA experiments

We sought to develop an experimental test of the model that was tractable (as opposed to inferring the unobserved policy for example). The experimenter in principle has access to real-time detection of licking during the cue–reward interval. In simulations, this also can easily be observed by monitoring the output of the behavioural plant. Thus, in the model we kept track of individual trials and the number of licks produced in the cue–reward interval. For analysis experiments (Fig. 5e), we tracked these trials and separately calculated the predicted dopamine responses depending on trial type classification (lick– vs lick+). For simulations in Fig. 5e, we ran simulations from the same initialization in nine replicates (matched to the number of control mice) and error bars reflect the standard error.

To simulate calibrated stimulation of mDA neurons, we multiplied the adaptive rate parameter, βDA, by 2 on the appropriate trials For simulations reported in Fig. 5e, we used three conditions: control, stimLick– and stimLick+. For each of these three conditions, we ran 9 simulations (3 different initializations, 3 replicates) for 27 total learning simulations (800 trials). This choice was an attempt to estimate the expected experimental variance as trial classification scheme is an imperfect estimate of underlying policy.

Pseudocode summary of model

Here we provide a description of how the model functions in pseudocode to complement the graphical diagrams in the main figures and the discursive descriptions of individual elements that are used below.

Initialize trial to T = 0

Initialize ACTR with W(0), ({mathcal{S}})rew(T), ({mathcal{S}})cue(T)

repeat

Run RNN simulation engine for trial T

Compute plant input π(T) = O(T) + ({mathcal{S}})(T)

Compute lick output L(t) = Plant(π(T))

Compute latency to collect reward tcollect ← find L(t) > treward

Compute cost(T) = 1 −exp(−∆t/500)

Evaluate eligibility trace at collection ({mathcal{e}}) ← ({mathcal{e}})i,j(tcollect)

Compute βDA = 1 + ϕ(∆π(treward) + ({mathcal{S}})rew)

Compute Robj(T) = 1 − (1 − exp(−∆t/500)) − O(T, treward − 1)

Estimate objective gradient PE = Robj(T) − R(T)

Compute update ∆W = − ηJ × ({mathcal{e}})× PE × βDA

Update W(T + 1) ← W(T) + ∆W

Update ({mathcal{S}})reward(T + 1) ← ({mathcal{S}})rew(T) + ({eta }_{{mathcal{S}}}) × Robj(T) × βDA

Update ({mathcal{S}})cue(T + 1) ← ({mathcal{S}})cue(T) + ({eta }_{{mathcal{S}}})× Robj(T) × βDA

Until T == 800

in which T is the current trial and t is time within a trial, W is the RNN connection weight matrix, ({mathcal{S}}) is the sensory input strength, O is the RNN output, π is the behavioural policy, ∆t = tcollect − treward, ϕ is the nonlinear (sigmoid) transform, R(T) is the running mean PE, ηJ is the baseline learning rate for W and ({eta }_{{mathcal{S}}}) is the baseline learning rate for input ({mathcal{S}}).

ACTR model variants

In Fig. 1k, we consider three model variants equivalent to dopamine signalling PEs, dopamine depletion and loss of phasic dopamine activity—all manipulations that have been published in the literature. To accomplish these simulations, we: changed βDA to equal PE; changed βDA offset to 0.1 from 1; and changed βDA to equal 1 and removed the adaptive term.

In Figs. 3 and 5, calibrated stimulation was modelled as setting βDA to double the maximal possible magnitude of βDA under normal learning. In Figs. 3c–e and 5i, we modelled uncalibrated dopamine stimulation as setting PE = +1 in addition to the calibrated stimulation effect.

TD learning model

To model a standard TD value learning model we reimplemented a previously published model that spanned a range of model parameterizations from ref. 66.

Policy learning model equivalent to the low-parameter TD learning model

The ACTR model that we articulate seeks to provide a plausible mechanistic account of naive trace conditioning learning using: RNNs; a biologically plausible synaptic plasticity rule; conceptually accurate circuit organization of mDA neurons; a ‘plant’ to control realistic behaviour; and multiple components of processing of sensory cues and rewards. However, to facilitate formal comparison between value learning and direct policy learning models, we sought to develop a simplified model that captures a key aspect of ACTR (the specific gradient it uses) and allows for explicit comparison against existing value learning models with the same number of free parameters. To model a low-parameter (as compared to ACTR) policy learning equivalent of the TD value learning model from ref. 67, we used the same core structure, basis function representation and free parameters. However, rather than using an RPE (value gradient) for updating, we follow previous work32 and consider a direct policy learning version in which a policy gradient is used for updates as originally described in ref. 21 and equivalent in terms of the effective gradient to the ACTR implementation. First, we consider the latency to collect reward rather than the reward value per se as used in TD models. The latency to collect reward is a monotonic function of the underlying policy such that increased policy leads to increased anticipatory licking as a reduction in the collection latency (Fig. 1). Typically one uses a nonlinearity that saturates towards the limits 0,1. For simplicity, we choose a soft nonlinearity (half-Gaussian) for convenience of the simple policy gradient that results. Regardless of the scaling parameter of the Gaussian (sigma), the derivative of the log of the policy is then proportional to 1 − pt, in which pt is the policy on trial t (subject to scaling by a constant proportional to sigma that is subsumed into a learning rate term in the update equation). According to the REINFORCE algorithm family21, we have an update function proportional to (rcurr − b) × (1 − pt), in which rcurr is the current trial reward collection latency and b is a local average of the latency calculated by b = υ × rcurr + (1 − υ) × b. Typical values for υ were 0.25 (although a range of different calculations for b, including b = 0, yield consistent results as noted previously21).

Formal model comparison

As in previous work32, we sought to compare the relative likelihood of the observed data under the optimal parameterization of either the value learning (TD) model or the direct policy learning model. The data we aimed to evaluate were the frequency of anticipatory licking during the delay period over the first approximately 1,000 trials of naive learning for each mouse. We used a recent model formalization proposed to describe naive learning67 and used grid search to find optimal values of the parameters λ, α and γ. To compute the probability of observing a given amount of anticipatory licking as a function of the value function or policy, respectively, we used a normal probability density (sigma = 1) centred on the predicted lick frequency (7 Hz × value or policy). Initial examination revealed that sigma = 1 minimized the LL for all models, but the trends were the same across a range of sigma. The −LL of a given parameterization of the model was computed as the negative sum of log probabilities over trials for all combinations of free parameters. We also computed the Akaike information criterion68—sum of ln(sum(residuals2))—as preferred in some previous work69. The results were consistent and the number of free parameters was equivalent; thus, we primarily report −LL in the manuscript. For direct comparison, we took the minimum of the −LL for each model (that is, its optimal parameterization) and compared these minima across all animals. To examine the ‘brittleness’ of the model fit, we compare the median −LL across the entire grid search parameter space for each model.

Estimating PEs from behavioural data

First, we assume that on average the number of anticipatory licks is an unbiased estimate of the underlying policy (the core assumption of the low-parameter models described above). The latency to collect reward can be converted into a performance cost using the same equation (2) described for ACTR. The PE was then computed as in equation (4). A smoothed baseline estimate was calculated by smoothing PE with a 3-order, 41-trial-wide Savitzky–Golay filter and the baseline subtracted PE calculated analogous to equations (4) and (5).

Histology

Mice were killed by anaesthetic overdose (isoflurane, >3%) and perfused with ice-cold phosphate-buffered saline, followed by paraformaldehyde (4% wt/vol in phosphate-buffered saline). Brains were post-fixed for 2 h at 4 °C and then rinsed in saline. Whole brains were then sectioned (100 μm thickness) using a vibrating microtome (VT-1200, Leica Microsystems). Fibre tip positions were estimated by referencing standard mouse brain coordinates70.

Statistical analysis

Two-sample, unpaired comparisons were made using Wilcoxon’s rank sum test (MATLAB rank sum); paired comparisons using Wilcoxon signed rank test (MATLAB signrank). Multiple comparisons with repeated measures were made using Friedman’s test (MATLAB friedman). Comparisons between groups across training were made using two-way ANOVA (MATLAB anova2). Correlations were quantified using Pearson’s correlation coefficient (MATLAB corr). Linear regression to estimate contribution of fibre position to variance in mDA reward signals was fitted using MATLAB fitlm. Polynomial regression used to fit objective surfaces were third order and (MATLAB fit). Errors are reported as s.e.m. Sample sizes (n) refer to biological, not technical, replicates. No statistical methods were used to predetermine sample size. Data visualizations were created in MATLAB or GraphPad Prism.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link