Spontaneous behaviour is structured by reinforcement without explicit reward - Nature - Alert Breaking News

A list of reagents and resources is provided in Extended Data Table 1.

Ethical compliance

All experimental procedures were approved by the Harvard Medical School Institutional Animal Care and Use Committee (Protocol Number 04930) and were performed in compliance with the ethical regulations of Harvard University as well as the Guide for Animal Care and Use of Laboratory Animals.

MoSeq

Overview

MoSeq (described previously in refs. ^4,27,60) is an unsupervised machine learning method that identifies brief, re-used behavioural motifs that mice perform spontaneously. MoSeq takes as its input 3D imaging data of mice and returns a set of behavioural ‘syllables’ that characterizes the expressed behaviour of those mice, and the statistics that govern the order in which those syllables were expressed in the experiment. MoSeq was used as it is originally described to explore relationships between endogenous DLS dopamine release and behaviour. This technology was further adapted to accommodate real-time syllable identification for closed-loop manipulations of neural activity as described below. Importantly, the underlying fitted autoregressive hidden Markov model (AR-HMM) for both the ‘offline’ and ‘online’ variants of MoSeq used in this study are the same, enabling comparisons of neural activity associated with syllables that were recognized and performed across multiple experiments.

Pre-processing

MoSeq consists of two essential workflows: one for pre-processing depth data and converting it into a low-dimensional time series that describes pose dynamics, and another for modelling the low-dimensional time-series data. As previously described, in order to focus on pose dynamics, raw depth frames were first background-subtracted to convert depth units from distance to height from the floor (in millimeters). Next, the location of the mouse was identified by finding the centroid of the contour with the largest area using the OpenCV findcontours function. An 80 × 80 pixel bounding box was drawn around the identified centroid, and the orientation was estimated using an ellipse fit (with a previously described correction for ±180-degree ambiguities^4,27). The mouse was rotated in the bounding box to face the right side. The 80 × 80 pixel depth video of the centred, oriented mouse was then used to estimate pose dynamics.

Size-normalizing deep network

To accommodate noise in online syllable estimation and other sources of variation in the depth images not due to changes in pose dynamics (for example, occluding objects such as fibre-optic cables), we designed a denoising convolution autoencoder. The network was designed using TensorFlow to process images in <33 ms, the time between frame captures on the Microsoft Kinect V2⁶¹. On the encoder side, 4 layers of 2D convolutions (ReLu activation) followed by max pooling were used to downsample the 80 × 80 images to 5 × 5. Another 4 layers of 2D convolutions with successive upsampling layers were used on the decoding side to reconstruct the 80 × 80 images (10,310,041 total parameters). Batch normalization was used during training with a batch size of 128. In order to train the network, we used a size- and age-matched dataset (7–8 weeks of age). Mouse images were corrupted through rotation, position jitter, zooming in and out (that is, changing size), and superimposing depth images of fibre-optic cables. The network was fed corrupted mouse images as input and was trained to minimize the reconstruction loss of the original, corresponding uncorrupted mouse images (Extended Data Fig. 8a–c). The model was trained for 100 epochs using stochastic gradient descent with early stopping. Both online and offline variants of MoSeq included the size-normalizing network to ensure results were comparable.

Dimensionality reduction and AR-HMM training

In order to represent pose dynamics in a common space for all experiments, principal components and an AR-HMM time-series model were trained offline on a sample dataset of genotype- and age-matched mice. The parameters describing the principal components and AR-HMM model were saved. All depth videos acquired for this paper were then projected onto these same principal components for all experiments, whether they used the online or offline variant. As previously described, principal components were estimated from cropped, oriented depth videos, and the AR-HMM was trained on the top 10 principal components. Since the denoising autoencoder was used for all experiments, mouse videos from the size-and-age-matched dataset were fed through the denoising autoencoder prior to principal component estimation.

Offline variant

In the offline variant, the Viterbi algorithm was used to estimate the most probable discrete latent state sequence according to the trained AR-HMM for each experiment post hoc. This variant was used to analyse all data except for the Opto-DA experiments shown in Figs. 3 and 4.

Online variant

In the online variant, syllable likelihoods were computed and updated by computing the forward probabilities of the discrete latent states for each frame as they arrived from the depth sensor. To avoid spurious syllable detections, the targeted syllable probability had to cross a user-defined threshold for three consecutive frames.

Histological verification

Mice were euthanized following completion of behavioural tests. Mice were first perfused with cold 1× PBS and subsequently with 4% paraformaldehyde. Fifty-micrometre sections of extracted brain tissue were sliced on a Leica VT1000 vibratome. All slices were mounted on glass slides using Vectashield with DAPI (Vector Laboratories) and imaged with an Olympus VS120 Virtual Slide Microscope.

dLight validation and variant selection

dLight1.1 was selected to visualize dopamine release dynamics in the DLS owing to its rapid rise and decay times, comparatively lower dopamine affinity (so as to not saturate binding), as well as its responsiveness over much of the physiological range of known DA concentrations in freely moving rodents^31,62,63,64.

Since dopamine-free and dopamine-bound excitation spectra have yet to be reported for the dLight1.1 sensor, a series of in vitro experiments was performed to identify an excitation wavelength whose fluorescence was stable and independent of dopamine levels, and which therefore could be used for post hoc motion artefact correction. Like GCaMP, dLight1.1 uses cpGFP as a chromophore, and various generations of GCaMP have been shown to: (1) have an increase in ligand-free fluorescence when excited with 400 nm wavelengths and (2) have an isosbestic wavelength in the UV to blue region^65,66,67. To test whether UV excitation could be a suitable reference wavelength for dLight1.1, HEK 293 cells (ATCC, cells were validated by ATCC via short tandem repeat analysis and were not tested for mycoplasma) were transfected with the dLight1.1 plasmid (Addgene 111067-AAV5) using Mirus TransIT-LT1 (MIR 2304). Cells were imaged using an Olympus BX51W I upright microscope and a LUMPlanFl/IR 60×/0.90W objective. Excitation light was delivered by an AURA light engine (Lumencor) at 400 and 480 nm with 50 ms exposure time. Emission light was split with an FF395/495/610-Di01 dichroic mirror and bandpass filtered with an FF01-425/527/685 filter (all filter optics from Semrock). Images were collected with a CCD camera (IMAGO-QE, Thermo Fisher Scientific), at a rate of one frame every two seconds, alternating the excitation wavelengths in each frame. Image acquisition and analysis were performed using custom-built software written in MATLAB⁶⁸ (Mathworks). Cells were segmented from maximum-projection fluorescence images using Cellpose⁶⁹. Cells with a diameter of less than 30 pixels were excluded from downstream analysis. Fluorescence traces were denoised using a hampel filter (window size 10 and threshold set to 2 median absolute deviations from the median) and normalized to ΔF/F₀. Cells were included if their maximum ΔF/F₀ exceeded 5%. F₀ was computed by fitting a bi-exponential function to the time series.

Stereotaxic surgery for open field photometric recordings

Eight- to ten-week-old C57BL/6J (n = 6 mice, The Jackson Laboratory stock no. 000664) mice of either sex were anaesthetized using 1–2% isofluorane in oxygen, at a flow rate of 1 l min⁻¹ for the duration of the procedure. AAV5.CAG.dLight1.1 (Addgene #111067, titre: 4.85 × 10¹²) was injected at a 1:2 dilution (either sterile PBS or sterile Ringer’s solution) into the DLS (AP 0.260; ML 2.550; DV −2.40), in a total volume of 400 nl per injection. For all stereotaxic implants, AP and ML were zeroed relative to bregma, DV was zeroed relative to the pial surface, and coordinates are in units of mm. Injections were performed by a Nanoject II or a Nanoject III (Drummond) at a rate of 10 nl per 10 s, unilaterally in each mouse. A single 200-µm diameter, 0.37–0.57 NA fibre cannula was implanted 200 µm above the injection site at the DLS (DV −2.20) for photometry data collection. Finally, medical-grade titanium headbars (South Shore Manufacturing) were secured to the skull with cyanoacrylate glue (Loctite 454).

Mice were group-housed prior to stereotaxic surgery procedures, and following surgery were individually housed on a 12-hour dark–light cycle (09:00–21:00). All behavioural recordings were done between 010:00 and 17:00.

Stereotaxic surgery for simultaneous photometric recordings and optogenetic stimulation

Six- to 12-week old DAT-IRES-cre mice (n = 10 mice, The Jackson Laboratory stock no. 006660) of either sex were injected with the same dLight1.1 virus described above into the right hemisphere DLS. Additionally, using the same previously described surgical procedure, 350 nl of AAV1.Syn.Flex.ChrimsonR.tdTomato (UNC Vector Core, titre: 4.1 × 10¹²) was injected into the right hemisphere SNc (AP −3.160; ML 1.400; DV −4.200 from pia), in a 1:2 dilution for calibration and stimulation experiments (see below). Mice were implanted unilaterally with a 200 µm core 0.37–0.57 NA fibre over the DLS for simultaneous stimulation and photometric data collection.

Two of the ten mice were used to calibrate optogenetic stimulation (see ‘dLight calibration experiments’). The other 8 mice injected with dLight and ChrimsonR were also run through the 3 complete closed-loop experiments described in ‘Closed-loop DLS dopamine stimulation experiments’ (one experiment with 250 ms continuous wave (CW) stimulation, one with 2 s CW stimulation, and another with 3 pulsed stimulation, 25 Hz frequency with 5 ms pulse width). Baseline data from these experiments were combined with mice described in ‘Fibre Photometry for dLight recordings’, thus yielding a total of n = 14 mice. Two of the 12 dLight only mice did not pass our quality control criteria for dLight recordings and were thus excluded from all dLight analysis (note that they were included in Extended Data Fig. 2a–b,d only, which strictly used behavioural data). Baseline data were considered data from the day prior to a stimulation day, or the day after with the targeted syllable excluded (yielding n = 378 experiments total). If the targeted syllable could not be reasonably excluded then data from the day after a stimulation day was excluded entirely.

dLight behaviour procedures

OFA experiments

Depth videos of mouse behaviour were acquired at 30 Hz using a Kinect 2 for Windows (Microsoft) using a custom user interface written in Python (similar to ref. ⁶⁰) on a Linux computer. For all OFA experiments, except where noted, mice were placed in a circular open field (US Plastics 14317) in the dark for 30 min per experiment, for 2 experiments per day. As described previously, the open field was sanded and painted black with spray paint (Acryli-Quik Ultra Flat Black; 132496) to eliminate reflective artefacts in the depth video.

Food reward experiments

To assess whether spontaneous dLight transients in the DLS were of appreciable magnitude compared to reward consumption-related transients, a series of separate dLight photometry experiments were run to measure reward consumption-related transient magnitudes (n = 6 mice). For two days prior to the experiment, mice were habituated to the open field arena for two 30-min experiments on each day. On the morning of the experiment, to increase the salience of food reward, mice were habituated to the experimental room and food and water restricted for 3–5 h prior to beginning the experiment. Mice were placed in the arena, and behaviour and photometry data were simultaneously acquired. Chocolate chips (Nestle Toll House Milk Chocolate) were divided into quarters and introduced into the arena at random intervals and locations decided by the experimenter (with an average of 1 chocolate chip piece every 4 min) for mice to freely consume for a total of 30 min. To identify reward consumption-related responses, a human observer indicated each moment in time during the experiment where mice began to consume the chocolate via post hoc inspection of the infrared video captured by the Kinect. Photometry signal peaks for Fig. 2a were identified at the onset of consumption. Mean spontaneous transient peak had observed magnitudes of 2.12 ± 0.80 ΔF/F₀ (z) (n = 5,247 transients). By comparison, mean reward consumption-associated transients had an approximate magnitude of 2.36 ± 0.92 ΔF/F₀ (z) (n = 10 transients).

Fibre photometry for dLight recordings

Photometry and behavioural data were collected simultaneously. A digital lock-in amplifier was implemented using a TDT RX8 digital signal processor as previously described²⁷. A 470 nm (blue) LED and a 405nM (UV) LED (Mightex) were sinusoidally modulated at 161 Hz and 381 Hz, respectively (these frequencies were chosen to avoid harmonic cross-talk). Modulated excitation light was passed through a three-colour fluorescence mini-cube (Doric Lenses FMC7_E1(400-410)_F1(420-450)_E2(460-490)_F2(500-540)_E3(550-575)_F3(600-680)_S), then through a pigtailed rotary joint (Doric Lenses B300-0089, FRJ_1x1_PT_200/220/LWMJ-0.37_1.0m_FCM_0.08m_FCM) and finally into a low-autofluorescence fibre-optic patch cord (Doric Lenses MFP_200/230/900-0.37_0.75m_FCM-MF1.25_LAF or MFP_200/230/900-0.57_0.75m_FCM-MF1.25_LAF) connected to the optical implant in the freely moving mouse. Emission light was collected through the same patch cord, then passed back through the mini-cube. Light on the F2 port was bandpass filtered for green emission (500–540 nm) and sent to a silicon photomultiplier with an integrated transimpedance amplifier (SensL MiniSM-30035-X08). Voltages from the SensL unit were collected through the TDT Active X interface using 24-bit analogue-to-digital convertors at >6 kHz, and voltage signals driving the UV and blue LEDs were also stored for offline analysis.

The output of the PMT was then demodulated into the components generated by the blue and UV LEDs. The voltage signal was multiplied by the two driving signals—corresponding to the green emission due separately to blue and UV LED excitation—and low-passed using a third order elliptic filter (max ripple: 0.1; stop attenuation: 40 dB; corner frequency: 8 Hz). The UV component was used a reference signal.

Synchronizing depth video and photometry

To align photometry and behavioural data, a custom IR led-based synchronization system was implemented. Two sets of 3 IR (850 nm) LEDs (Mouser part # 720-SFH4550) were attached to the walls of the recording bucket and directed towards the Kinect depth sensor. The signal used to power the LEDs was digitally copied to the TDT. An Arduino was used to generate a sequence of pulses for each LED set. One LED set transitioned between on and off states every 2 s while the other transitioned into an on state randomly every 2–5 s and remained in the on state for 1 s. The sequences of on and off states of each LED set were detected in the photometry data acquired with the TDT and IR videos captured by the Kinect. The timestamps of the sequences were aligned across each recording modality and photometry recordings were down sampled to 30 Hz to match the depth video sampling rate. This same mechanism was used to align photometry data to keypoints in Extended Data Fig. 3.

Photometry pre-processing

Demodulated photometry traces were normalized by first computing ΔF/F₀. F₀ was estimated by calculating the 10th percentile of the photometry amplitude using a 5-s sliding window to account for slow, correlated fluorescence changes between dLight and the UV reference channels. Both the dLight and reference channels were normalized using this procedure. Since the UV reference signal captures non-ligand-associated fluctuations in fluorescence (deriving from hemodynamics, pH changes, autofluorescence, motion artefact, mechanical shifts, and so on), a fit reference signal was subtracted from the dLight channel (see ‘Photometry active referencing’). Finally, referenced dLight traces were z-scored using a 20-s sliding window with a single sample step size slid over the entire experiment to remove slow trends in ΔF/F₀ amplitudes due to long timescale effects—for example, photobleaching. Only experiments where the maximum percentage ΔF/F₀ exceeded 1.5 and the dLight to reference correlation was below 0.6 were included for further analysis.

Photometry active referencing

In order to remove the effects of motion and mechanical artefacts from downstream analysis, a fit reference signal was subtracted from the demodulated dLight photometry trace as initially mentioned in ‘Photometry pre-processing’^31,54 (Extended Data Fig. 1g). First, the reference signal was low-pass filtered with a second-order Butterworth filter with a 3 Hz corner frequency. Next, to account for differences in gain or DC offset, RANSAC ordinary least squares regression was used to find the slope and bias with which to transform the reference signal to minimize the difference between the reference and the dLight photometry traces. Finally, the transformed reference trace was subtracted from the dLight trace.

Capturing 3D keypoints

To capture 3D keypoints, mice were recorded in a multi-camera open field arena with transparent floor and walls. Near-infrared video recordings at 30 Hz were obtained from six cameras (Microsoft Azure Kinect; cameras were placed above, below and at four cardinal directions). Separate deep neural networks with an HRNet architecture were trained to detect keypoints in each view (top, bottom and side) using ~1,000 hand-labelled frames⁷⁰. Frame-labelling was crowdsourced through a commercial service (Scale AI), and included the tail tip, tail base, three points along the spine, the ankle and toe of each hind limb, the forepaws, ears, nose and implant. After detection of 2D keypoints from each camera, 3D keypoint coordinates were triangulated and then refined using GIMBAL—a model-based approach that leverages anatomical constraints and motion continuity⁷¹. GIMBAL requires learning an anatomical model and then applying the model to multi-camera behaviour recordings. For model fitting, we followed the approach described in ref. ⁷¹, using 50 pose states and excluding outlier poses using the EllipticEnvelope method from sklearn. For applying GIMBAL to behaviour recordings, we again followed⁷¹, setting the parameters obs_outlier_variance, obs_inlier_variance, and pos_dt_variance to 1e6, 10 and 10, respectively for all keypoints.

Computing 2D and 3D velocity

To compute 2D translational velocity, the centroid of the keypoints associated with the spine (approximating whole-body movement) was computed for the x and y planes (the z plane was disregarded). Then, the velocity was computed from the difference in position between every 2 frames and divided by 2 (to provide a smoother estimate of velocity). 3D translational velocity was computed the same way, except the z plane was included in the calculation. The average velocity of the keypoints associated with the forepaws were used to compute 3D forelimb velocity.

Partialing kinematic parameters from dLight

To compute the relationship between dLight and forelimb velocity, other kinematic parameters known to be correlated with dLight were partialed out of the dLight fluorescence signal. Specifically, 2D velocity, 3D velocity and height were partialed out of dLight using linear regression. Then, the correlation between the partialed dLight signal and 3D forelimb velocity were computed and compared to 1,000 bootstrapped shuffles.

Movement initiation analyses

A changepoint detection algorithm was used to find moments where mice transitioned from periods of relative stillness to movement. To capture long bouts of movement, the velocity of the 2D centroid of the mouse was z-scored across each experiment and then smoothed with a 50-point (1.67s) boxcar window. To find sharp changes in velocity, the derivative of smoothed velocity trace was computed, and the result was raised to the third power. Peaks in this velocity changepoint score were discovered using SciPy’s findpeaks function with the following parameters: height 1, width 1, prominence 1 so that consecutive data points around each peak were disregarded.

dLight time warping

To account for variability in syllable duration, dLight traces were time warped for Extended Data Fig. 4a. Here, all dLight traces were linearly interpolated using the numpy.interp function to a duration of 0.83 s, or 25 samples. Thus, syllables longer than 0.83 s were linearly compressed, and syllables shorter than 0.83 s were linearly expanded. We obtained similar results time warping dLight traces to 0.4 s; thus, the duration of time warped instances did not affect interpretation of subsequent analyses.

dLight average waveform z-scoring

For dLight waveforms shown in Fig. 1f, top and bottom, h,i,k and Extended Data Figs. 4c–g, 5c,f and 7c, first onset-aligned waveforms were z-scored using the mean and s.d. of fluorescence values from 10 s prior to 10 s after onset. Next, to account for differences in the number of syllable instances (trials) in each average, waveforms were additionally normalized by z-scoring relative to the mean and s.d. of 1,000 shuffle averages, where individual trials were circularly permuted prior to averaging.

Decoding syllable identity from dLight waveforms

To decode syllable identity from dLight waveforms or dLight peaks, a random forest classifier⁷² (cuRF = 1,000 trees, max depth = 1,000, number of bins = 128, with cross-validation on 5 folds of data) was trained to predict syllable and syllable group identity on held-out data (similar to ref. ²⁷). Syllable groups were created by hierarchically clustering syllables based on their pairwise MoSeq distance (see below) and thresholds were increased with a distance cut-off in steps of 0.2. The inputs to the random forest classifier were either: (1) the maximum z-scored dLight value from syllable onset to 300 ms after syllable onset for each syllable instance or (2) dLight waveforms and their derivatives starting at syllable onset up to 300 ms into the future for individual syllable instances. Held-out accuracy was compared to 100 shuffles of syllable identity.

Decoding turning orientation from dLight waveforms

To decode turning orientation from dLight waveforms (Extended Data Fig. 5c), a linear support vector machine was trained to classify whether a particular syllable instance is a left- or rightward turning syllable using cross-validation on five folds of data. To sample the behaviour space of turning syllables, eight syllables with the largest angular velocities were chosen, four for each turning orientation. The model was fit to dLight waveforms and their derivatives starting at syllable onset up to 300 ms after onset for individual syllable instances and was tested on held-out data.

MoSeq distance

The MoSeq distance between two syllables was computed as previously described²⁷. In brief, the estimated autoregressive matrices for each syllable were used to generate synthetic trajectories through principal component space (that is, in the space defined by the first ten principal components of the depth video). Then, the correlation distance between trajectories for all pairs of syllables were computed. Since the online and offline variants of MoSeq used the same autoregressive matrices, these distances are equivalent in the online and offline variants.

Analysing the relationship between dLight and syllable statistics within an experiment across syllables

The dLight fluorescence associated with syllable transitions was computed as the maximum z-scored dLight value from syllable onset to 300 ms after syllable onset for each syllable transition, to account for jitter in dopamine release or technical jitter in defining syllable changepoints. Throughout the text, we refer to syllable-associated waveform peak amplitudes in z-scored ΔF/F₀ units as ‘syllable-associated dLight’. These dLight values were then averaged for each syllable and for each experiment. To assess the correlation between syllable-associated dLight and syllable counts, the dLight averages were z-scored across syllables in each experiment. These normalized dLight peaks represented whether a syllable had relatively higher or lower dLight during a given experiment. Finally, experiment-normalized dLight values along with syllable counts were then averaged across experiments for each mouse, thus leaving a value for each mouse and each syllable.

In order to measure the linear relationship between dLight peak values and syllable counts, a robust linear regression using the Huber regressor⁷³ predicted average syllable counts from average dLight peaks. The regression model was evaluated using a fivefold cross-validation repeated 100 times. Reported correlation values in Figs. 1j and 2 were estimated over the held-out data. P-values were estimated by comparing held-out correlation values to those estimated from a linear model computed on shuffled data. To remove syllables that varied due to finite size effects, only syllables that occurred at least 100 times total across all experiments per mouse were included.

To compute syllable entropy (estimating the randomness of outgoing transitions associated with each syllable), the outgoing transition probabilities associated with each syllable for each mouse were computed by counting the number of occurrences a syllable transitions to all others within an experiment and expressing this as a probability distribution. Next, the Shannon entropy was estimated over the outgoing transition probabilities for each syllable. Finally, the linear regression was estimated using the exact same procedure used for syllable counts.

Analysing the relationship between dLight and syllable statistics across experiments for each syllable

This series of analyses queried a total of 379 experiments. To capture the correlation between syllable-associated dLight peaks and syllable-associated behavioural features (syllable frequency, syllable entropy) within syllables but across experiments, first, the maximum z-scored dLight amplitude from onset to 300 ms after syllable onset at each syllable transition was computed. These syllable-associated dLight peaks were averaged for each experiment and syllable. Then, the dLight peak averages for each syllable and mouse were z-scored separately across experiments. Additionally, to put variation of each syllable across experiments on the same scale, syllable frequency, and syllable entropy were also z-scored for each syllable and mouse across experiments (Fig. 2b,i, bottom). Next, to remove variability in the calculation, values were pooled across syllables for each experiment, thus leaving a value for each experiment and mouse. To remove syllables that varied due to finite size effects, first only syllables that occurred at least 50 times per session on average were considered for downstream analysis. Linear models (Huber regressors) were fit to the resulting average dLight peaks, syllable frequency, and syllable entropy and evaluated as described in the previous section.

Analysing the moment-to-moment relationship between dLight and syllable statistics within an experiment

This series of analyses queried a total of 760 syllable–experiment pairs. dLight peak values were estimated by taking the maximum dLight value from onset to 300 ms after onset at each syllable transition. Velocity, syllable counts, and dLight peak values were averaged per syllable and per mouse over an expanding bin size; that is, velocity, syllable counts, and dLight peak values were estimated over the subsequent n syllables after the transition were dLight value was calculated, where n varied from 5 syllables up to 400 (Fig. 2e). For sequence randomness, to avoid finite size effects, dLight values were binned into 20 equally spaced bins per syllable (Fig. 2k). Then, transition matrices were combined within each bin across all syllables per mouse and per time bin. Finally, Pearson correlation values were then calculated between dLight values and the behavioural features estimated at each bin size. Pearson coefficients were z-scored using the mean and s.d. from Pearson coefficients estimated after shuffling dLight peak values.

Note that, in order to prevent the measurement from being influence by consistent non-stationarities in behaviour, these correlations were computed within each of the five time segments shown by dashed lines in Extended Data Fig. 2e. Then, per-segment correlations were averaged.

Time-constants associated with the correlation between dLight values and behavioural features over increasing bin sizes were estimated by fitting an exponential decay curve to the correlation values at each bin size using the SciPy’s curvefit function⁷⁴. Decay functions were fit over 1,000 bootstrap resamples of the data; the depicted distributions are taus fit over each resample.

Analysing the cross-correlation between syllable-associated dLight and syllable usage

The dLight fluorescence associated with all instances of a given syllable was binned across a three-minute window (chosen based upon the decay in Fig. 2f) and correlated with the use of that same syllable across a 3-min window, with the windows shifted the indicated amounts (x-axis). Correlation values (in Fig. 2g,h) were z-scored using the mean and s.d. from shuffles. P-values were estimated via shuffle test.

Analysing the relationship between syllable-associated dLight and syllable classes

Syllables were manually classified into 6 classes by hand-labelling crowd videos summarizing model output^4,27,60. Then, syllable-associated dLight was averaged for all syllables within each class.

Encoding model predicting average dLight from behaviour

As with the linear regression analysis (previous section), dLight peaks were estimated by taking the maximum z-scored dLight amplitude from syllable onset to 300 ms after onset. Behavioural features (entropy, velocity, and syllable counts) after each transition were computed across various bin sizes as described in ‘Analysing the relationship between dLight and syllable statistics within an experiment’. The bin sizes used were 5, 10, 25, 50, 100, 200 300, 400, 800 and 1,600 syllables. Syllable frequency, syllable entropy and velocity were averaged for each experiment and syllable in each bin size. These syllable and experiment-wide average values were then z-scored separately for each mouse and then averaged for each mouse and each syllable. In order to remove correlations between behavioural features they were whitened using zero-phase component analysis (ZCA) whitening. Whitened behavioural features were then fed to a Bayesian linear regression model to predict average dLight peak amplitudes per syllable and per mouse according to the following equation:

$$p(,y|X,beta ,{sigma }^{2})=N({beta }^{T}X,{sigma }^{2})$$

where X is defined as features, β is regression coefficients, y is dLight peak values, σ is the s.d., and N is the normal distribution. A normal prior was placed on the regression coefficients, and an exponential prior was placed on the s.d. Samples from the posterior were drawn via the no u-turn sampler (NUTS) using NumPyro (n = 1,000 warmup samples, then n = 3,000 samples)⁷⁵. To assess the temporal relationship between behavioural features and dLight, a separate model was fit at each lag (here, features were whitened separately within each lag, Extended Data Fig. 6c). Overall model performance was quantified by feeding in features at their approximate best bin size to the model. For kinematic parameters and for entropy, this bin size (lag) was 10 timesteps; for syllable counts, this bin size (lag) was 100 timesteps (in syllable time). Then, each feature was fed in separately to quantify the performance of feature subsets.

Encoding model predicting instantaneous dLight from behaviour

In order to predict instantaneous dLight amplitudes from syllable counts, syllable entropy, velocity (2D, angular and height velocity), and acceleration, a series of convolution kernels were estimated, each of which map from each behavioural feature to dLight amplitude. Mathematically, the model can be written as follows:

$${rm{dLight}}left(tright)=sum _{fin F}mathop{sum }limits_{t{prime} =-2s}^{2s}{{rm{beta }}}_{f}left(t-{t}^{{prime} }right)fleft(tright)$$

where dLight (t) corresponds to the dLight trace at time step t, f(t) is the behavioural feature at time step t, and β is the weight of the convolution kernel. Kernel weights were optimized using a Huber loss via the Jax library⁷⁶. That is to say, the dLight amplitude at each time sample is predicted by convolving each behavioural feature (frequency, entropy, velocity, and acceleration) with a convolution kernel and then summing the result across features. The model was trained and evaluated using twofold cross-validation by recording experiment, and the Pearson correlation between predicted dLight amplitudes and actual amplitudes was assessed on held-out experiments. In order to remove the effects of high frequency noise on training and evaluation, the dLight traces were smoothed using a 60-sample (2-s) boxcar filter prior to training and evaluation.

Decoding model predicting behaviour from dLight

The decoding model was designed to capture the two main effects of dopamine on behavioural statistics—usage and sequencing. The goal of the decoding model is to predict the likelihood of a sequence of syllables given past dopamine. The model comprises two key features: (1) a component that scales syllable usage by past syllable-associated dopamine, and (2) a component that scales randomness of the next syllable choice by past global dopamine. This can be summed up with the following equation:

$$P({s}_{t}=i)propto exp left(frac{{alpha }_{a}mathop{sum }limits_{n=1}^{250}left({rm{d}}{a}_{t-n}exp left(frac{-n}{{tau }_{a}}right)delta left({s}_{t-n}=iright)right)}{{alpha }_{b}mathop{sum }limits_{n=1}^{250}left({rm{d}}{a}_{t-n}exp left(frac{-n}{{tau }_{b}}right)right)}right)$$

where s_t is the syllable a mouse performs at time t during a behaviour experiment, da_t is the peak dLight recorded for syllable s_t, τ_a and τ_b describe the timescale of the usage and choice randomness component respectively, α_a and α_b scale the usage and choice randomness components respectively, and δ is the Dirac delta function (that is, one-hot encoding) that returns 1 when s_t − 1 = i and 0 otherwise.

The parameters α_b, τ_a and τ_b were fixed using approximations of analysis of the behavioural data (Fig. 2), and only α_a was learned by maximizing the likelihood of the function above given the sequence of syllables mice perform across a group of experiments and peak dLight measurements associated with the syllable sequence z-scored across each experiment. This was done via evaluating the likelihood of the function over multiple values of α_a. τ_a (describing the effect of dopamine on future syllable usage/counts) was fixed at 100 syllable timesteps, and τ_b (describing the effect of dopamine on syllable sequence entropy) was fixed at 10 syllable timesteps. These values were approximated from the median τ values reported in Fig. 2.

To test model performance, data were split into 5 folds of training and test experiments and repeated 100 times using repeated K-fold cross-validation. We then computed the Pearson correlation between syllable counts from model simulations and actual syllable counts after smoothing with a 50-point rolling average. The one free parameter was fit using the training dataset and assessed on the test dataset. To avoid degradation in performance due to syllable sparsity, the top 10 syllables were used. The model was compared to a suite of control models, each evaluated over the same folds. The dopamine phase shift model was evaluated on the same data, but with all dopamine traces circularly shifted by a random integer between 1 and 1,000, and the noise model was evaluated with dopamine traces replaced by numbers drawn from a unit variance random normal distribution (since the traces were z-scored). In order to determine the maximum possible performance, the per experiment number of counts per syllable was correlated with the across-experiment average. Here, the model performed significantly better than controls. Median Pearson correlation between held-out predictions and observed data: actual model r = 0.20, phase shift control r = 0.04, noise model r = 0.04. Comparison between actual model and controls, P = 7 × 10⁻¹⁸, U = 2,500, f = 1, Mann–Whitney U test, n = 50 model restarts.

To test the hypothesis that endogenous and exogenous dopamine linearly combine to alter the future usage of single syllables of behaviour, the present decoding model was modified. Maximal correlations were identified between predicted and observed syllable usages when adding (or subtracting) extra dopamine (termed ‘extra DA’) to the syllable-associated dopamine amplitudes observed on catch trials (Fig. 4g–i). Model-based log likelihoods of held-out syllable choices from Opto-DA stimulation day experiments were then computed. Other versions of this model (shown in Fig. 4h) included: (1) a control model in which no ‘extra DA’ is added to the model (‘no offset’), (2) a control that uses a phase-shifted version of the dLight trace (‘random shift’), and (3) a model that uses random numbers from a normal distribution with mean and variance matched to the dLight signal (‘noise’).

dLight calibration experiments

In order to characterize the speed and magnitude of evoked dopamine transients in the open field, dLight transients were elicited using brief optogenetic stimulation of SNc axons in the DLS expressing ChrimsonR while mice freely explored an open field arena⁷⁷. A number of stimulation parameters were tested, using varying light intensity, stimulation length, and whether the stimulus was delivered in as a single continuous-wave pulse or delivered as multiple rapid short pulses. A single, short (250 ms; roughly the timescale of syllables), continuous stimulation pulse of red light at 10 mW (Opto Engine MRL-III-635; SKU: RD-635-00500-CWM-SD-03-LED-0) most effectively matched the amplitude and dynamics of endogenous dLight transients observed in the open field. The mean Opto-DA peak was measured at 2.18 ± 0.85 ΔF/F₀ (z), mean spontaneous peak = 2.23 ± 0.62 ΔF/F₀ (z) and 99th percentile spontaneous peak = 3.40 ΔF/F₀ (z) Pulsed stimulation was also disfavoured as numerous studies have shown that pulsed stimulation can cause synchrony in neural and axonal networks that can evoke prolonged release^78,79,80. Note that when excited with 635 nm light, the efficiency with which light evokes spiking in neurons expressing ChrimsonR is similar to efficiency with which blue light evokes spiking in neurons expressing ChR2⁷⁷.

Once a single 250 ms continuous pulse of 10 mW light was preliminarily chosen as the desired optogenetic stimulus to evoke dopamine release from DLS dopamine axons, another round of open-loop stimulation with these stimulation parameters was performed in the open field in two of the 10 total mice injected with dLight and ChrimsonR. In these two mice, the intervals between stimulation times were drawn by randomly choosing an integer delay between 6 and 17 s for each stimulation. This range was chosen to guarantee each animal received at least 100 stimulations during an experiment. This enabled analysis of more stimulation trials with intended parameters to verify that the amplitude of evoked transients were within the same order of magnitude as spontaneously evoked transients (Fig. 3c).

DMS dLight recordings

As a series of control experiments to establish the specificity of DLS dopamine encodings, dLight recordings were performed in the DMS using the same techniques described above. dLight stereotactic injections in wild-type mice of either sex (C57BL/6J, n = 8) were performed at AP: 0.26, ML: 1.5, and DV: −2.2. Fibres for photometry (in C57BL/6J mice, n = 8, n = 64 recording experiments) were implanted in the manner described above at coordinates: AP: 0.26, ML: 1.5, DV: −2.0. Open field behavioural recordings and encoding models were performed for these data exactly as described above.

Stereotaxic surgery for optogenetics

Eight- to fifteen-week-old DAT-IRES-cre::Ai32 mice resulting from the cross of DAT-IRES-cre mice (The Jackson Laboratory, 006660) and Ai32 mice (The Jackson Lab, 012569) of either sex were used. The double transgenic DAT-IRES-cre::Ai32 mouse line has previously been used to conduct specific dopaminergic neuron activation^10,81,82. Similar surgical procedures were used as described above, except two 200 µm 0.37 NA multimode optical fibres were implanted bilaterally over DLS (AP 0.260; ML 2.550; DV −2.300), in DAT-IRES-cre::Ai32 mice (n = 20). Control animals (DAT-IRES-cre mice, n = 12) of either sex were implanted bilaterally at the same coordinates, with 6 of these animals implanted in the nucleus accumbens (AP 1.300; ML 1.000; DV −4.000). These animals are collectively termed ‘no-opsin controls’ throughout the manuscript. Medical-grade titanium headbars were secured to the skull using cyanoacrylate. Optical stimulation experiments were then performed 2–3 weeks post-surgery.

Closed-loop stimulation behavioural paradigm

For two days prior to the closed-loop stimulation schedule (Fig. 3d), mice were habituated to the bucket for two 30-min experiments on each day. To test the change in statistics of specific syllables via syllable-triggered optogenetic stimulation, experiments were performed in a three-day schedule for each of six chosen target syllables. On the first day, two 30-min experiments were run for each mouse to characterize baseline target syllable usage. On the second day, two 30-min ‘stimulation’ experiments were performed for each mouse. During these experiments blue light (470 nm, 10 mW, a single 250-ms continuous-wave pulse) was delivered on 75% of target syllable detections. Stimulation was not conditioned on syllables occurring before the target. Finally, on the third day, baseline experiment recordings were repeated to assess syllable usage memory and usage decay after reinforcement. For half of the targeted syllables for each mouse (randomized across mice), the pre-stimulation baseline experiment is the same experiment as the post-stimulation baseline experiment for a different syllable (see Fig. 3d). A three-day cadence with multiple, short behavioural recording experiments per day was chosen to both minimize non-stationarities in syllable usage within an experiment, as well as to not expose the mice to the behavioural arena for more than one total hour per day. To control for order effects on changes in target syllable usages over time, animals were randomly split into two groups, each of which had a unique ordering of target syllables across the six stimulation days of the three-week cadence. The time interval between the first experiment and the second experiment for the same mouse on each day (either recording or stimulation) was 195 min on average ±58 min (s.d.). Mice were euthanized following completion of behavioural tests, and histology was performed using procedures described above.

To assess the effect of increased dopamine release these experiments were repeated with 3-s pulsed stimulation (25 Hz, 5 ms pulse width) in n = 3 DAT-cre::Ai32 and n = 2 (DAT-IRES-cre) control animals.

Closed-loop velocity modulation experiments

DAT-IRES-cre::Ai32 mice (n = 5) of either sex underwent 90-min recording and manipulation experiments. For the first 30 min, we estimated the distribution of velocities for a specific target syllable. Then, for the next 30 min, optogenetic stimulation was triggered both when the syllable was expressed according to our closed-loop system and when the animal’s syllable-specific velocity exceeded the 75th percentile or went below the 25th percentile. Experiments were analysed only if the mouse received at least 50 stimulations and they increased the usage of the target syllable on average relative to their average baseline (established via separate recording experiments with no stimulation).

Quantifying changes in target syllable counts

First, the number of times the targeted syllable was performed during a 30-s sliding window (non-overlapping) for each 30-min stimulation experiment was computed. Then, a cumulative sum was taken. To turn the result into an estimate of excess target counts, a cumulative sum was also computed from the morning and evening experiments from the most recent previous baseline day. Finally, the average of the morning and evening baseline estimates was averaged and subtracted off.

‘Learner’ mice were defined as mice whose average change in target counts above baseline across all syllables exceed the maximum average change in target counts exhibited by no-opsin control animals. These n = 9 animals were used for subsequent analyses of target kinematics and learning specificity (Extended Data Fig. 10).

Quantifying effects on syllables near the target in time

To assess whether syllables temporally adjacent to the target were reinforced as a result of optogenetic stimulation, syllables were identified that—on average—were near to the target in time. Specifically, the average time between all non-targeted syllables and the target was computed, along with their change in counts above baseline. Then, syllables were binned when they occurred on average relative to the target in syllable units in equally spaced bins from ten syllables before the target to ten syllables after. Finally, for each experiment, a weighted average of the change in counts above baseline for all syllables in each bin was computed, where a syllable’s weight was defined by its relative frequency in an experiment.

Quantifying effects on syllables whose velocity was similar to the target

To understand whether syllables with similar velocity profiles to the target were also reinforced, the average velocity from onset to offset for each syllable was computed and z-scored across instances within an experiment. Then, the average velocity of the target was subtracted from each syllable’s average velocity. Finally, the change in count above baseline for each syllable was binned by its target-velocity-difference.

Quantifying Opto-DA effects on movement parameters and sequence randomness

To quantify the effects of Opto-DA on movement parameters and sequence randomness over short timescales, sequence entropy, velocity (2D, angular and height velocity) and acceleration were estimated in five-syllable-long non-overlapping bins starting from stimulation onset. This window was chosen to minimize noise in downstream calculations while retaining reasonable time-resolution. To compensate for non-stationarities in behaviour across the experiments, mice, and targeted syllables, entropy, velocity and acceleration pre-stimulation-onset were subtracted from their values post-stimulation. Finally, these baseline-subtracted values were z-scored using the mean and s.d. estimated from catch trials.

Analysing the influence of dopamine on optogenetic reinforcement

Mice used to assess the influence of endogenous dopamine fluctuations on optogenetic reinforcement

As described above, eight mice injected with dLight and ChrimsonR were also run through closed-loop reinforcement experiments. The reinforcement experiment run with 250 ms 10mW CW stimulation enabled decoding analysis of how exogenous dopamine release altered usage of syllables during experiments in which ‘extra DA’ was added (Fig. 4g–i).

Predicting the amount of exogenously added dopamine during Opto-DA experiments using the decoding model

To predict the magnitude of exogenously evoked dLight fluorescence using the decoding model, dLight fluorescence on each instance in which the mouse expressed the target syllable and received stimulation was replaced with the average dLight fluorescence observed for the target syllable on catch trials during which there was no optogenetic stimulation. Then an offset (denoted as ‘extra DA’) was added to each syllable instance in which the mouse received stimulation. The likelihood of the syllable sequences expressed during Opto-DA experiments was computed for a range of extra DA offsets (and hence a range of exogenously added dopamine). The model was evaluated using the exact same procedure described in ‘Decoding model predicting behaviour from dLight’, except the repeated K-fold splits (5-fold split repeated 100 times) was performed over stimulation experiments. The ‘extra DA’ outputs of the model were compared to empirical photometric data collected from animals expressing dLight that underwent ChrimsonR-mediated closed-loop reinforcement (Fig. 4i).

Using the influence of endogenous dopamine to predict Opto-DA reinforcement

In order to assess whether the impact of dopamine at baseline could predict Opto-DA reinforcement, we used the correlation between dopamine fluctuations and syllable statistics (usage and entropy) within an experiment. Specifically, we computed the correlation between dLight levels and usage as outlined in ‘Analysing the moment-to-moment relationship between dLight and syllable statistics within an experiment’ (Fig. 2e,k), except correlations were assessed per mouse and per syllable. Values at each bin size were z-scored using the mean and s.d. correlations computed over shuffled data. Here, n = 100 shuffles were used for the correlation with entropy for computational efficiency. To determine the modulation depth of these correlation curves for each mouse and syllable, we used the s.d. of the correlation values across bin sizes. This resulted in a value that reflected the short-term influence of dopamine on usage (Endo-DA count) and entropy (Endo-DA entropy) for all syllable–mouse pairs. Finally, these estimates were averaged per mouse for Fig. 4b,c, and per syllable for Fig. 4d. Then, the log₂ fold change in target counts on stimulation days relative to baseline days was used as an estimate of Opto-DA learning. To mitigate mouse-to-mouse variability, the log₂ fold change in target counts was normalized by computing the log₂ fold change in target counts against all pairs of non-stimulation days per mouse. The mean and s.d. of this distribution was used to z-score Opto-DA learning per mouse.

Bayesian linear regression models were used in Fig. 4b,c. A normal prior was placed on the regression coefficients, and an exponential prior on the variance. Samples from the posterior were drawn via the no u-turn sampler (NUTS) using NumPyro (n = 1,000 warmup samples, n = 2,000 samples)⁷⁵. Performance was assessed using leave-two-out cross-validation. The linear regression model presented in Fig. 4f utilized a Huber regressor⁷³. Performance of the Huber regressors was assessed using fivefold cross-validation repeated five times.

Applying RL models to open field behaviour

Reinforcement-only RL model

RL models have four key components: a reward signal, a state, a state-dependent set of available actions, and a policy (which governs how actions are chosen). Here, a simple Q-learning agent with a softmax policy was designed to model mouse behaviour in the open field as an RL process over endogenous dopamine levels⁴⁴. Our model was recast (specifically a Q-learning agent with a softmax policy) to use endogenous dopamine (that is, syllable-associated dLight) as a reward signal, behavioural syllables as states, and transitions between behavioural syllables as actions. Given a syllable at time t + 1, the dLight peak occurring during the syllable at time t is considered the ‘reward’. The Q-table for the model was initialized with a uniform matrix with the diagonal set to 0, since by definition there are no self-transitions in our data. For every step of each simulation, given the currently expressed syllable (that is, the state), the model samples possible future syllables (actions) based on the behavioural policy and the expected dLight transient magnitude (expected reward, specified by the Q-table) associated with each syllable transition. Then, the model selected actions according to the softmax equation

$$p(a|s)=frac{{e}^{{Q}_{s}(a)/tau }}{mathop{sum }limits_{b=1}^{n}{e}^{{Q}_{s}(b)/tau }}$$

where τ is the temperature. The model is fed 30-min experiments of actual data. Data was formatted as a sequence of states and syllable-associated dopamine. Given the current state, the model selects an action according to the softmax equation. To update the Q-table and simulate the effect of endogenous dopamine as reward, the syllable-associated dopamine is presented to the model as reward in a standard Q-learning equation. Specifically, the Q-table was then updated according to

$$Q({s}_{t},{a}_{t})leftarrow Q({s}_{t},{a}_{t})+alpha [{r}_{t+1}+gamma {max}_{a}Q({s}_{t+1},a)-Q({s}_{t},{a}_{t})]$$

where Q is the Q-table that defines the probability of action a while in state s, α is the learning rate, r is the reward associated with action a and state s (the dLight peak value at the transition between syllable a and syllable s), and γ is the discount factor. Performance was assessed by taking the Pearson correlation between the model’s resulting Q-table at the end of the simulation and the empirical transition matrix observed in the experimental data. Here, each row of the empirical transition matrix and the Q-table were separately z-scored prior to computing the Pearson correlation. Note that the learned Q-table is functionally equivalent to a transition matrix in this formulation. To avoid degradation in performance due to syllable sparsity, the top 10 syllables were used.

Dynamic RL model

To account for the short-term effect of dopamine on sequence randomness, a dopamine-dependent term was added to the baseline model’s policy

$$p(a|s)=frac{{e}^{{Q}_{s}(a)/tau (t)}}{mathop{sum }limits_{b=1}^{n}{e}^{{Q}_{s}(b)/tau ({rm{t}})}}$$

where temperature is now time-dependent and evolves according to,

$$tau left(tright)=Ileft(tright)exp left(frac{t-n}{{{tau }}_{{rm{decay}}}}right)+,{tau }_{{rm{baseline}}}$$

and,

$$Ileft(tright)=v,,text{if},,rleft(tright)ge lambda $$

Here, τ_decay corresponds to the time constant with which dopamine’s effect on temperature decays, τ_baseline is the baseline temperature, ν is the amount by which temperature is increased if the r(t) goes above the threshold λ, and n is the number of timesteps after the threshold has been crossed. Experiments were split into training and test datasets via twofold cross-validation, and the training set was used to fit all free parameters. To compare the dynamic to the reinforcement-only model, v was set to 0—this turns off the temperature varying component of the dynamic model. Note that we observe qualitatively similar results under an alternative formulation. Rather than feeding the model 30-min sessions of actual data, we allow the model to freely select actions, and reward was randomly drawn from dLight peaks associated with that action in actual data.

Reward-prediction error model variant

Models were fit using observed dopamine magnitude as either the (1) reward term (see above) or (2) reward-prediction error term ([{r}_{t+1}+gamma {max}_{a}Q({s}_{t+1},a)-Q({s}_{t},{a}_{t})]). For each model type, a grid search was performed across values of α (learning rate), γ (discount factor, used in the reward model only), and temperature (randomness of the next action). Held-out log likelihood was computed for each fit and z-scored using the mean and variance of the held-out log likelihood from models fit to data shuffled between experiments (n = 10 shuffles). This comparison is only valid for our particular model formulation. There are alternative formulations for which dopamine acting as a reward-prediction error are consistent with our data.

Statistics

All hypothesis tests were non-parametric. Effect sizes for Mann–Whitney U tests are presented as the common language effect size f. Correlations were established as significant by comparing to n = 1,000 shuffled correlations (referred to as the shuffle test throughout the manuscript). For shuffle test if all correlations exceeded the 1,000 shuffles, the P-value is listed as P < 0.001 rather than P = 0. P-values were adjusted to account for multiple comparisons where appropriate using the Holm–Bonferonni stepdown procedure. Sample sizes were not pre-determined but are consistent with sample sizes typically used in the field. For examples using similar techniques see^10,14. Blinding was not performed, but MoSeq-based analysis of behaviour was automated.

Plotting

Box plots (here and throughout) obey standard conventions: edges represent the first and third quartiles, whereas whiskers extend to include the furthest data point within 1.5 interquartile ranges of either the first or third quartile.

Software packages

In addition to analysis-specific packages cited in the relevant sections above, the following packages were used for analysis: NumPy⁸³, Python⁸⁴, Seaborn⁸⁵, Matplotlib⁸⁶ and Python 3 (ref. ⁸⁷).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link