May 19, 2024

An atlas of gene regulatory elements in adult mouse cerebrum – Nature

Tissue preparation and nuclei isolation

All experimental procedures using live animals were approved by the SALK Institute Animal Care and Use Committee under protocol number 18-00006. Adult C57BL/6J male mice were purchased from Jackson Laboratories. Brains were extracted from 56–63-day-old mice and sectioned into 600-μm coronal sections along the anterior-posterior axis in ice-cold dissection media62,63, by using a brain slicer from CNS muSlice LLC. Specific brain regions were dissected according to the Allen Brain Reference Atlas28 (Extended Data Fig. 1) and nuclei isolated as previously described63. Regions were pooled from 3–31 of the same sex to obtain enough nuclei for snATAC-seq for each biological replica, and two biological replicas were processed for each dissection region.

snATAC-seq using combinatorial indexing

snATAC-seq was performed as described with steps optimized for automation25,26. A step-by-step-protocol for library preparation is available at: https://www.protocols.io/view/sequencing-open-chromatin-of-single-cell-nuclei-sn-pjudknw/abstract.

Brain nuclei were pelleted with a swinging bucket centrifuge (500g, 5 min, 4 °C; 5920R, Eppendorf). Nuclei pellets were resuspended in 1 ml nuclei permeabilization buffer (5% BSA, 0.2% IGEPAL-CA630, 1 mM DTT and cOmpleteTM, EDTA-free protease inhibitor cocktail (Roche) in PBS) and pelleted again (500g, 5 min, 4 °C; 5920R, Eppendorf). Nuclei were resuspended in 500 μl high-salt tagmentation buffer (36.3 mM Tris-acetate (pH 7.8), 72.6 mM potassium-acetate, 11 mM Mg-acetate, 17.6% dimethylformamide) and counted using a haemocytometer. Concentration was adjusted to 1,000–4,500 nuclei per 9 μl, and 1,000–4,500 nuclei were dispensed into each well of a 96-well plate. For tagmentation, 1 μl barcoded Tn5 transposomes26 were added using a BenchSmart 96 (Mettler Toledo, RRID:SCR_018093) (Supplementary Table 26), mixed five times, and incubated for 60 min at 37 °C with shaking (500 rpm). To inhibit the Tn5 reaction, 10 μl of 40 mM EDTA was added to each well with a BenchSmart 96 (Mettler Toledo, RRID:SCR_018093) and the plate was incubated at 37 °C for 15 min with shaking (500 rpm). Next, 20 μl 2× sort buffer (2% BSA, 2 mM EDTA in PBS) were added using a BenchSmart 96 (Mettler Toledo, RRID:SCR_018093). All wells were combined into a FACS tube and stained with 3 μM Draq7 (Cell Signaling) (Extended Data Fig. 23). Using a SH800 (Sony), 20 nuclei were sorted per well into eight 96-well plates (total of 768 wells, 30,720 nuclei total, 15,360 nuclei per sample) containing 10.5 μl EB (25 pmol primer i7, 25 pmol primer i5, 200 ng BSA (Sigma). If processing two samples per day, tagmentation was performed with different sets of barcodes in separate 96 well plates. After tagmentation nuclei from individual plates were pooled together. Preparation of sort plates and all downstream pipetting steps were performed on a Biomek i7 Automated Workstation (Beckman Coulter, RRID:SCR_018094). After the addition of 1 μl 0.2% SDS, samples were incubated at 55 °C for 7 min with shaking (500 rpm). Then, 1 μl 12.5% Triton X-100 was added to each well to quench the SDS. Next, 12.5 μl NEBNext High-Fidelity 2 × PCR Master Mix (NEB) were added and samples were amplified by PCR (72 °C 5 min, 98 °C 30 s, (98 °C 10 s, 63 °C 30 s, 72 °C 60 s) × 12 cycles, held at 12 °C). After PCR, all wells were combined. Libraries were purified according to the MinElute PCR Purification Kit manual (Qiagen) using a vacuum manifold (QIAvac 24 plus, Qiagen) and size selection was performed with SPRI Beads (Beckmann Coulter, 0.55x and 1.5x). Libraries were purified one more time with SPRI Beads (Beckmann Coulter, 1.5x). Libraries were quantified using a Qubit fluorimeter (Life Technologies, RRID:SCR_018095) and the nucleosomal pattern was verified using a Tapestation (High Sensitivity D1000, Agilent). Libraries generated with indexing version 1 (Supplementary Table 1) were sequenced on a HiSeq2500 sequencer (RRID:SCR_016383, Illumina) using custom sequencing primers, 25% spike-in library and following read lengths: 50 + 43 + 37 + 50 (Read1 + Index1 + Index2 + Read2). Libraries generated with indexing version 2 (Supplementary Table 1) were sequenced on a HiSeq4000 (RRID:SCR_016386, Illumina) using custom sequencing primers with following read lengths: 50 + 10 + 12 + 50 (Read1 + Index1 + Index2 + Read2). Indexing primers and sequencing primers are in Supplementary Table 26. The nuclei indexing version (v1 or v2) used for each library is indicated in Supplementary Table 26.

Nuclei indexing scheme

To generate snATAC-seq libraries we used initially an indexing scheme as previously described (version 1)22,25. Here, 16 p5 and 24 p7 indexes were combined to generate an array of 384 indexes for tagmentation and 16 i5 as well as 48 i7 indexes were combined for an array of 768 PCR indexes. Owing to this library design, it is required to sequence all four indexes to assign a read to a specific nucleus with long reads and a constant base sequence for both indices reads between i and p barcodes. Therefore, the resulting libraries were sequenced with 25% spike-in library on a HiSeq2500 (RRID:SCR_016383) and these read lengths: 50 + 43 + 37 + 50 (ref. 25).

To generate libraries compatible with other sequencers and not requiring spike-in libraries or custom sequencing recipes, we modified the library scheme (Version 2). For this, we used 384 individual indices for T7 and combined with one T5 with a universal index sequence for tagmentation (for a total of 384 tagmentation indexes). For PCR, we used 768 different i5 indexes and combined with a universal i7 primer index sequence. Tagmentation indexes were 10 bp and PCR indexes 12-bp long. We made sure, that the hamming distance between every two barcodes was ≥ 4, the GC content between 37.5–62.5%, and the number of repeats ≤ 3. The resulting libraries were sequenced on a HiSeq4000 with custom primers and these read lengths: 50 + 10 + 12 + 50 (Supplementary Table 26).

snATAC-seq data using the 10x Chromium platform

Brain nuclei were pelleted with a swinging bucket centrifuge (500g, 5 min, 4 °C; 5920R, Eppendorf). Nuclei pellets were resuspended in 1 ml nuclei permeabilization buffer (5% BSA, 0.2% IGEPAL-CA630, 1 mM DTT and cOmpleteTM, EDTA-free protease inhibitor cocktail (Roche) in PBS). Nuclei were pelleted again (500g, 5 min, 4 °C; 5920R, Eppendorf) and washed with wash buffer (10 mM Tris-HCl (pH 7.5), 10mM NaCl, 3 mM MgCl2, 0.1% Tween-20, and 1% BSA (Proliant 7500804) in molecular biology-grade water). Nuclei were pelleted (500g, 5 min, 4 °C; 5920R, Eppendorf) and resuspended in 30 μl of 1× nuclei Buffer (10x Genomics). Nuclei were counted using a haemocytometer, and 15,360 nuclei were used for tagmentation. Single-cell ATAC-seq libraries were generated using the Chromium Single Cell ATAC Library & Gel Bead Kit (10x Genomics, 1000110), Chromium Chip E Single Cell ATAC kit (10x Genomics, 1000155) and Chromium i7 Multiplex Kit N, Set A (10x Genomics, 1000084) following manufacturer instructions. Final libraries were quantified using a Qubit fluorimeter (Life Technologies) and the nucleosomal pattern was verified using a Tapestation (High Sensitivity D1000, Agilent). Libraries were sequenced on NextSeq 500 and NovaSeq 6000 sequencers (Illumina) with following read lengths: 50 + 8 + 16 + 50 (Read1 + Index1 + Index2 + Read2). After demultiplexing, the Index2 (cell index) was transferred to the read name, in order to keep the same fastq format for downstream processing.

Processing and alignment of sequencing reads

Paired-end sequencing reads were demultiplexed and the cell index transferred to the read name. Sequencing reads were aligned to mm10 reference genome using bwa64. After alignment, we used the R package ATACseqQC (1.10.2)65 to check for fragment length contribution which is characteristic for ATAC-seq libraries. Next, we combined the sequencing reads to fragments, and for each fragment we performed the following quality control: (1) keep only fragments quality score MAPQ > 30; (2) keep only the properly paired fragments with length < 1,000 bp; (3) PCR duplicates were further removed with SnapTools (https://github.com/r3fang/SnapTools, RRID:SCR_018097)26. Reads were sorted on the basis of the cell barcode in the read name.

TSS enrichment calculation

Enrichment of ATAC-seq accessibility at TSSs was used to quantify data quality without the need for a defined peak set. The method for calculating enrichment at TSS was adapted from previously described. TSS positions were obtained from the GENCODE database v16 (RRID:SCR_014966)38. In brief, Tn5 corrected insertions (reads aligned to the positive strand were shifted +4 bp and reads aligned to the negative strand were shifted –5 bp) were aggregated ± 2,000 bp relative (TSS strand-corrected) to each unique TSS genome-wide. Then this profile was normalized to the mean accessibility ± 1,900–2,000 bp from the TSS and smoothed every 11 bp. The max of the smoothed profile was taken as the TSS enrichment.

Doublet removal

We used a modified version of Scrublet (RRID:SCR_018098)66 to remove potential doublets for every dataset independently. Peaks were called using MACS2 scores for aggregate accessibility profiles on each sample. Next, cell-by-peak count matrices were calculated and used as input, with default parameters. Doublet scores were calculated for both observed nuclei {xi} and simulated doublets {yi} using Scrublet (RRID:SCR_018098)66. Next, a threshold θ is selected based on the distribution of {yi}, and observed nuclei with doublet score larger than θ are predicted as doublets. To determine θ, we fit a two-component mixture distribution by using function normalmixEM from R package mixtools67. The lower component contained most embedded doublet types, and the other component contained majority of neo-typic doublets (collision between nuclei from different clusters. We selected the threshold θ where the ({p}_{1}cdot {rm{pdf}}(x,,{mu }_{1},{sigma }_{1})={p}_{2}cdot {rm{pdf}}(x,,{mu }_{2},{sigma }_{2})) in which p denotes probability; pdf denotes probability density function. This value suggested that the nuclei have same chance of belonging to both classes.

Cell clustering

We used an iterative clustering strategy using the snapATAC26 package (RRID:SCR_018097) with slight modifications as detailed below. For round 1 clustering, we clustered and finally merged single nuclei to three main cell classes: non-neurons, GABAergic neurons, and glutamatergic neurons. For each main cell class, we performed another round of clustering to identify major cell subclasses. Last, for each subclass, we performed a third round of clustering to find cell types.

Detailed description for every step is as follows. (1) Nuclei filtering. Nuclei with ≥1,000 uniquely mapped fragments and TSS enrichment > 10 were filtered for individual dataset. Second, potential barcode collisions were also removed for individual datasets. (2) Feature bin selection. First, we calculated a cell-by-bin matrix at 5-kb resolution for every dataset independently and subsequently merged the matrices. Second, we converted the cell-by-bin count matrix to a binary matrix. Third, we filtered out any bins overlapping with the ENCODE blacklist (mm10, http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz). Fourth, we focused on bins on chromosomes 1–19, X and Y. Last, we removed the top 5% bins with the highest read coverage from the count matrix. (3) Dimensionality reduction. SnapATAC applies a nonlinear dimensionality reduction method called diffusion maps, which is highly robust to noise and perturbation26. However, the computational time of the diffusion maps algorithm scales exponentially with the increase of number of cells. To overcome this limitation, we combined the Nyström method (a sampling technique)68 and diffusion maps to present Nyström landmark diffusion map to generate the low-dimensional embedding for large-scale dataset.

A Nyström landmark diffusion maps algorithm includes three major steps:

(1) Sampling. We sampled a subset of K (K N) cells from N total cells as ‘landmarks’.

(2) Embedding. We computed a diffusion map embedding for K landmarks.

(3) Extension. We projected the remaining (– K) cells onto the low-dimensional embedding as learned from the landmarks to create a joint embedding space for all cells. Having more than 800,000 single nuclei at the beginning, we decided to apply this strategy on round 1 and 2 clustering. A total of 10,000 cells were sampled as landmarks and the remaining query cells were projected onto the diffusion maps embedding of landmarks. Later for the round 3 clustering, diffusion map embeddings were directly calculated from all nuclei.

(4) Eigenvector selection. To determine the number of eigenvectors of diffusion operator to include for downstream analysis, we generated an ‘elbow plot’, to rank all eigenvectors on the basis of the percentage of variance explained by each one. For each round of clustering, we selected the top 10–20 eigenvectors that captured most of the variance.

(5) Graph-based clustering. Using the selected significant eigenvectors, we next construct a k-nearest neighbour graph. Each cell is a node and the k-nearest neighbours of each cell were identified according to the Euclidian distance and edges were drawn between neighbours in the graph. Next, we applied the Leiden algorithm on the k-nearest neighbour graph using Python package leidenalg (https://github.com/vtraag/leidenalg)69.

(6) Optimization on cluster resolution. We tested different ‘resolution_parameter’ parameters (step between 0 and 1 by 0.1) to determine the optimal resolution for different cell populations. For each resolution value, we tested whether there was clear separation between nuclei. To do so, we generated a cell-by-cell consensus matrix in which each element represents the fraction of observations two nuclei are part of the same cluster. A perfectly stable matrix would consist entirely of zeros and ones, meaning that two nuclei either cluster together or not in every iteration. The relative stability of the consensus matrices can be used to infer the optimal resolution. To this end, we generated a consensus matrix based on 300 rounds of Leiden clustering with randomized starting seed s. Let Ms denote the N × N connectivity matrix resulting from applying Leiden algorithm to the dataset Ds with different seeds. The entries of Ms are defined as follows:

$${M}^{s}(i,j)=f(x)=left{begin{array}{c}1,,{rm{if}},{rm{single}},{rm{nucleus}},i,{rm{and}},j,{rm{belong}},{rm{to}},{rm{the}},{rm{same}},{rm{cluster}}\ 0,,{rm{otherwise}}end{array}right.$$

Let Is be the N × N identicator matrix in which the (i, j)-th entry is equal to 1 if nucleus i and j are in the same perturbed dataset Ds, and 0 otherwise. Then, the consensus matrix C is defined as the normalized sum of all connectivity matrices of all the perturbed Ds.

$$C(i,j)=,left(frac{{sum }_{s=1}^{S}{M}^{s}(i,j)}{{sum }_{s=1}^{S}{I}^{s}(i,j)}right)$$

The entry (i,j) in the consensus matrix is the number of times single nucleus i and j were clustered together divided by the total number of times they were selected together. The matrix is symmetric, and each element is defined within the range ([0,1]). We examined the cumulative distribution function (CDF) curve and calculated proportion of ambiguous clustering (PAC) score to quantify stability at each resolution. The resolution with a local minimum of the PAC scores denotes the parameters for the optimal clusters. In the case these were multiple local minimal PACs, we picked the one with higher resolution. Another measurement is dispersion coefficient, which reflects the dispersion (ranges from 0 to 1) of the consensus matrix M from the value 0.5. The closer to 1 is the dispersion coefficient, the more perfect is consensus matrix, and thus the more stable is the clustering. In a perfect consensus matrix, all entries are 0 or 1, meaning that all connectivity matrices are identical. The dispersion coefficient is defined as:

$$rho =frac{1}{{n}^{2}}sum _{i=1}sum _{j=1}4{left(M(i,j)-frac{1}{2}right)}^{2}$$

Finally, for every cluster, we tested whether we could identify differential features compared to all other nuclei (background) and the nearest nuclei (local background) using the function ‘findDAR’.

(7) Visualization. For visualization, we applied UMAP58. Using the cell embedding, we applied both k-nearest neighbor batch effect test (kBET) and local inverse Simpson’s index (LISI) analysis to test the robustness of the clutering results to variation of sequencing depth, signal-to-noise ratios, and batches.

Clustering for adult neurogenesis lineages

We performed separated cell clustering following above strategy for two lineages:

1. Adult neurogenesis in the SGZ: we extracted 83,583 nuclei from 8 brain dissections at or surrounding the SGZ, including CA-1, CA-2, CA-3, CA-4, DG-1, DG-2, DG-3, DG-4 (Supplementary Table 1). Then, we performed cell clustering on 83,583 nuclei for 6 cell types: astrocytes (ASC); dentate gyrus radial glia-like cells; NIPCs; granule neuroblasts (DGNBL1/2); and granule neurons.

2. Adult neurogenesis in the SVZ: we extracted 25,923 nuclei from 8 brain dissections at or surrounding the SVZ, including MOB, AON, ACB-1, ACB-2 CP-1, CP-2, LSX-1 and LSX-2 (Supplementary Table 1). Then, we performed cell clustering on 25,923 nuclei for 5 cell types: astrocytes; subventricular zone radial glia-like cells; neuronal intermediate progenitor cells; neuroblasts (OBNBL); and inhibitory neurons in olfactory (OBGA1).

Dendrogram construction for mouse brain cell types

First, we calculated for cCRE the median accessibility per cluster and used this value as cluster centroid. Next, we calculated the coefficient of variant for the cluster centroid of each element across major cell types. Finally, we only kept variable elements with coefficient of variants that were larger than 1.5 for dendrogram construction.

We used the set of variable features defined above to calculate a correlation-based distance matrix. Next, we performed linkage hierarchical clustering using the R package pvclust (v.2.0)70 with parameters method.dist = “cor” and method.hclust = “ward.D2”. The confidence for each branch of the tree was estimated by the bootstrap resampling approach with 1,000 rounds.

Regional specificity of cell types

The specificity score is defined as Jensen–Shannon divergence, which measures the similarity between two probability distributions. For each cell type, the contribution of different brain regions was first calculated. Then, we compared this distribution with the contribution of brain regions calculated from all sampled cells. We used the function ‘JSD’ from R package philentropy for this analysis71.

Identification of reproducible peak sets in each cell cluster

We performed peak calling according to the ENCODE ATAC-seq pipeline (https://www.encodeproject.org/atac-seq/). For every cell cluster, we combined all properly paired reads to generate a pseudo-bulk ATAC-seq dataset for individual biological replicates. In addition, we generated two pseudo-replicates that comprise half of the reads from each biological replicate. We called peak for each of the four datasets and a pool of both replicates independently. Peak calling was performed on the Tn5-corrected single-base insertions using the MACS2 score30 with these parameters:–shift -75–extsize 150–nomodel–call-summits–SPMR -q 0.01. Finally, we extended peak summits by 250 bp on either side to a final width of 501 bp for merging and downstream analysis. To generate a list of reproducible peaks, we kept peaks that (1) were detected in the pooled dataset and overlapped ≥50% of peak length with a peak in both individual replicates; or (2) were detected in the pooled dataset and overlapped ≥50% of peak length with a peak in both pseudo-replicates.

We found that when cell population varied in read depth or the number of nuclei, the MACS2 score varied proportionally owing to the nature of the Poisson distribution test in MACS2 scores30. Ideally, we would perform a reads-in-peaks normalization, but in practice, this type of normalization was not possible because we did not know how many peaks we would get. To account for differences in performance of MACS2 scores30 on the basis of read depth and/or number of nuclei in individual clusters, we converted MACS2 peak scores (−log10(q-value)) to ‘score per million’72. We filtered reproducible peaks by choosing a score-per-million cut-off of 2 to filter reproducible peaks.

We only kept reproducible peaks on chromosome 1–19 and both sex chromosomes, and filtered ENCODE mm10 blacklist regions (mm10, http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz). A union peak list for the whole dataset was obtained by merging peak sets from all cell clusters using BEDtools (RRID:SCR_006646)73.

Lastly, because snATAC-seq data are very sparse, we selected only elements that were identified as open chromatin in a significant fraction of the cells in each cluster. To this end, we first randomly selected the same number of non-DHS regions (approximately 670,000 elements) from the genome as background and calculated the fraction of nuclei for each cell type that showed a signal at these sites. Next, we fitted a zero-inflated beta model, and empirically identified a significance threshold of FDR < 0.01 to filter potential false positive peaks. Peak regions with FDR < 0.01 in at least one of the clusters were included in downstream analysis.

Computing chromatin accessibility scores

Accessibility of cCREs in individual clusters was quantified by counting the fragments in individual clusters normalized by read depth (CPM). For each gene, we summed up the counts within the gene body +2 kb upstream to calculate the gene activity score. The gene activity score were used for integrative analysis with scRNA-seq. For better visualization, we smoothed the gene activity score to 50 nearest neighbour nuclei using Markov affinity-based graph imputation of cells (MAGIC)74.

Integrative analysis of snATAC-seq and scRNA-seq datasets

For integrative analysis, we downloaded level 5 clustering data from the Mouse Brain Atlas website (http://mousebrain.org)2. First, we filtered brain regions that matched samples profiled in this study using these attributes for ‘region’: ‘CNS’, ‘cortex’, ‘hippocampus’, ‘hippocampus,cortex’, ‘olfactory bulb’, ‘striatum dorsal’, ‘striatum ventral’, ‘dentate gyrus’, ‘striatum dorsal, striatum ventral’, ‘striatum dorsal, striatum ventral, dentate gyrus’, ‘pallidum’, ‘striatum dorsal, striatum ventral, amygdala’, ‘striatum dorsal, striatum ventral’, ‘telencephalon’, ‘brain’ and ‘sub ventricular zone, dentate gyrus’.

Second, we manually subset cell types into three groups by checking the attribute in ‘taxonomy_group’: non-neurons: ‘vascular and leptomeningeal cells’, ‘astrocytes’, ‘oligodendrocytes’, ‘ependymal cells’, ‘microglia’, ‘oligodendrocyte precursor cells’, ‘olfactory ensheathing cells’, ‘pericytes’, ‘vascular smooth muscle cells’, ‘perivascular macrophages’, ‘dentate gyrus radial glia-like cells’, ‘subventricular zone radial glia-like cells’, ‘vascular smooth muscle cells’, ‘vascular endothelial cells’, ‘vascular and leptomeningeal cells’; gabaergic neurons: ‘non-glutamatergic neuroblasts’, ‘telencephalon projecting inhibitory neurons’, ‘olfactory inhibitory neurons’, ‘glutamatergic neuroblasts’, ‘cholinergic and monoaminergic neurons’, ‘di- and mesencephalon inhibitory neurons’, ‘telencephalon inhibitory interneurons’, ‘peptidergic neurons’; glutamatergic neurons: ‘dentate gyrus granule neurons’, ‘di- and mesencephalon excitatory neurons’, ‘telencephalon projecting excitatory neurons’.

To directly compare our single-nucleus chromatin accessibility-derived cell clusters with the single-cell transcriptomics defined taxonomy of the mouse brain2, we first used the snATAC-seq data to impute RNA expression levels (gene activity scores) according to the chromatin accessibility of gene promoter and gene body as previously described75. We performed integrative analysis with scRNA-seq using Seurat 3.0 (RRID:SCR_016341) to compare cell annotation between different modalities75. We randomly selected 200 nuclei (and used all nuclei for cell cluster with fewer than 200 nuclei) from each cell cluster for integrative analysis. We first generated a Seurat object in R by using previously calculated gene activity scores, diffusion map embeddings and cell cluster labels from snATAC-seq. Then, variable genes were identified from scRNA-seq and used for identifying anchors between these two modalities. Finally, to visualize all the cells together, we co-embedded the scRNA-seq and snATAC-seq profiles in the same low dimensional space.

To quantify the similarity between cell clusters from two modalities, we calculated an overlapping score as the sum of the minimum proportion of cells or nuclei in each cluster that overlapped within each co-embedding cluster5. Cluster overlaps varied from 0 to 1 and were visualized as a heat map with snATAC-seq clusters in rows and scRNA-seq clusters in columns. We found strong correspondence between the two modalities, which was evidenced by co-embedding of both transcriptomic (T-type) and chromatin accessibility (A-type) cells in the same RNA–ATAC joint clusters (Extended Data Fig. 10a, Supplementary Table 5). For this analysis, we examined GABAergic neurons, glutamatergic neurons and non-neuronal cell classes separately (Extended Data Fig. 10, Supplementary Table 5).

Identification of cis-regulatory modules

We used non-negative matrix factorization76 to group cCREs into cis-regulatory modules on the basis of their relative accessibility across major clusters. We adapted non-negative matrix factorization (Python package: sklearn77) to decompose the cell-by-cCRE matrix V (N × M, N rows: cCRE, M columns: cell clusters) into a coefficient matrix H (R × M, R rows: number of modules) and a basis matrix W (N × R), with a given rank R:

The basis matrix defines module related accessible cCREs, and the coefficient matrix defines the cell cluster components and their weights in each module. The key issue to decompose the occupancy profile matrix was to find a reasonable value for the rank R (that is, the number of modules). Several criteria have been proposed to decide whether a given rank R decomposes the occupancy profile matrix into meaningful clusters. Here we applied two measurements ‘sparseness’78 and ‘entropy’79 to evaluate the clustering result. Average values were calculated from 100 times for non-negative matrix factorization runs at each given rank with random seed, which will ensure the measurements are stable.

Next, we used the coefficient matrix to associate modules with distinct cell clusters. In the coefficient matrix, each row represents a module and each column represents a cell cluster. The values in the matrix indicate the weights of clusters in their corresponding module. The coefficient matrix was then scaled by column (cluster) from 0 to 1. Subsequently, we used a coefficient > 0.1 (approximately the 95th percentile of the whole matrix) as threshold to associate a cluster with a module.

In addition, we associated each module with accessible elements using the basis matrix. For each element and each module, we derived a basis coefficient score, which represents the accessible signal contributed by all cluster in the defined module. In addition, we also implemented and calculated a basis-specificity score called ‘feature score’ for each accessible element using the ‘kim’ method79. The feature score ranges from 0 to 1. A high feature score means that a distinct element is specifically associated with a specific module. Only features that fulfil both of the following criteria were retained as module specific elements: (1) feature score greater than median + 3 standard deviations; (2) the maximum contribution to a basis component is greater than the median of all contributions (that is, of all elements of W).

Identification of differentially accessible regions and definition of specificity score

To identify cCREs that were differentially accessible either in subtypes or in brain regions, we constructed a logistic regression model predicting cluster/region membership based on each cCRE individually and compares this to a null model with a likelihood ratio test. We used two functions ‘fit_models’ and ‘compare_models’ in R package Monocle3 (v.0.2.2)80 to perform the differential test. We designed the full model as

$${rm{logit}}({P}_{ij})=,{a}_{j}+,{m}_{j}+{r}_{j}+{varepsilon }_{j}$$

and a reduced mode as

$${rm{logit}}({P}_{ij})=,{a}_{j}+{r}_{j}+{varepsilon }_{j}$$

in which Pij represents the probability of ith site is accessible in the jth cell, a is the log10-transformed total number of sites observed as accessible for the jth cell, m is membership of the jth cell in either cluster or region being tested, r is the replicate label for jth cell and ε is an error term.

For each set of testing, between subtypes or between regions in cell type, we kept only cCREs that overlapped with peaks identified in corresponding cell types. A likelihood ratio test was then used to determine whether the full model (including cell cluster or region membership) provided a significantly better fit of the data than the reduced model. After correcting P values using Benjamini–Hochberg method, we set an FDR cut-off as 0.001 to filter out significant differential cCREs.

The log2-transfomed fold change is used for two-group comparison, for multiple groups, we calculated a Jensen–Shannon divergence-based specificity score previously described22 to better assign differential cCREs to cell type or brain region. The fraction of accessibility of each cluster f was first calculated for each ith site. We normalized these scores by multiplying by corresponding scaling factors, which are considering different overall complexity across groups. To do so, median number of sites accessible c in individual cells for each cluster was calculated and followed with log10-transforming. Then, we took the ratio of the average of c (across all clusters) over the median accessibility in each cluster as scaling factor. These corrected fraction of accessibility for each cCRE was then converted to probability by scaling by groups. Then, we calculated Jensen–Shannon divergence between two probability distributions. For example, the probability distribution for the first cCRE as d1, to test whether this cCRE is specific in group 1, we assumed another probability distribution:

$$d2=left{begin{array}{cc}1, & {rm{g}}{rm{r}}{rm{o}}{rm{u}}{rm{p}},1\ 0, & {rm{o}}{rm{t}}{rm{h}}{rm{e}}{rm{r}}{rm{w}}{rm{i}}{rm{s}}{rm{e}}end{array}right.$$

Function ‘JSD’ in R package ‘philentropy’ was used to calculate Jensen–Shannon divergence between these two probability distributions, and Jensen–Shannon-based specificity (JSS) scores was defined as:

$${rm{JSS}}=1-,sqrt{{rm{Jensen}},{rm{Shannon}},{rm{divergence}}}$$

For each group, we calculated the JSS score for every cCRE. To find a reasonable cut-off to determine restricted or general cCREs, we consider JSS scores from all cCREs that are not identified as differential accessible (from likelihood ratio test) as a background distribution, and JSS scores from cCREs that passed our likelihood ratio test threshold and had positive values to be true positives. We set an empirical FDR cut-off where the type I error was no more than 5%.

Finally, the differential cCREs could be aligned to several cell types or brain regions based on the JSS score, we named the one can be assigned to only one type or region as region-specific or cell-type-specific cCREs.

Comparison between the regional specificity of cell types defined by snATAC-seq data and the spatial ISH signals of cell-type-specific genes

To validate the regional specificity of cell types, we took advantage of the spatially mapped quantified ISH expression from ABA44 in five matched major brain structures, isocortex, olfactory areas, hippocampal formation (HPF), striatum (STR), pallidum. We used the ‘differential search’ function to identify 10,269 genes with increased expression in these five brain regions compared to all brain regions ‘grey matter’ (expression level >1 and fold change > 1). We also identified cell-type-specific genes using Seurat75 with default parameters for each joint cluster from integrative analysis (fold change > 1 and FDR < 0.05) (Extended Data Fig. 10). 505 cell-type-specific genes (range from 1 to 53, 15 on average) overlapped with the list of genes with increased expression in the five brain regions. For each cell type, we calculated the regional specificity score (see previous section: regional specificity of cell types) on the basis of the relative contribution from five brain regions estimated from snATAC-seq datasets, and also a coefficient of variation based on averaged normalized ISH signals of cell-type-specific marker genes. For each cell-type-specific gene, we calculated the PCC between cell composition in five brain structures and spatial expression levels across the five brain structures derived from ISH.

Because the astrocyte subtypes identified in our study were not resolved in scRNA-seq studies, we identified subtype-specific genes for astrocyte subtypes using chromatin accessibility from snATAC-seq using a likelihood ratio test. The cell-type-specific genes were filtered by FDR less than 0.001 from the likelihood ratio test and empirical FDR cut-off of no more than 5% for JSS scores. Then, we calculated the fraction of overlap between spatially mapped ISH genes from different brain structures and genes with astrocyte subtype-specific accessibility.

Predicting enhancer–promoter interactions

First, co-accessible regions were identified for all open regions in each cell cluster (randomly selected 200 nuclei, and using all nuclei for cell cluster with <200 nuclei) separately, using Cicero37 with following parameters: aggregation k = 10, window size = 500 kb, distance constraint = 250 kb. To find an optimal co-accessibility threshold for each cluster, we generated a random shuffled cCRE-by-cell matrix as background and identified co-accessible regions from this shuffled matrix. We fitted the distribution of co-accessibility scores from random shuffled background into a normal distribution model by using the R package fitdistrplus81. Next, we tested every co-accessibility pairs and set the cut-off at co-accessibility score with an empirically defined significance threshold of FDR < 0.01.

CCRE outside of ±1 kb of the TSS in GENCODE mm10 (v.16, RRID:SCR_014966)38. were considered distal. Next, we assigned co-accessibility pairs to three groups: proximal-to-proximal, distal-to-distal, and distal-to-proximal. In this study, we focus only on distal-to-proximal pairs. We further used RNA expression from matched T-types to filter out pairs that were linked to non-expressed genes (normalized UMI < 5).

We calculated PCC between gene expression and cCRE accessibility across joint RNA-ATAC clusters to examine the relationship between co-accessibility pairs. To do so, we first aggregated all nuclei or cells from scRNA-seq and snATAC-seq for every joint cluster to calculate accessibility scores (log2(CPM)) and relative expression levels (log2(normalized UMI)). Then, PCC was calculated for every cCRE–gene pair within a 1-Mb window centred on the TSS for every gene. We also generated a set of background pairs by randomly selecting regions from different chromosomes and shuffling of cluster labels. Finally, we fit a normal distribution model and defined a cut-off at PCC score with empirically defined significance threshold of FDR < 0.01, to select significant positively correlated cCRE–gene pairs.

Identification of candidate driver transcription factors

We used the Taiji pipeline45 to identify candidate driver transcription factors in cell clusters. In brief, for each cell type cluster, we constructed the transcription factor regulatory network by scanning transcription factor motifs at the accessible chromatin regions and linking them to the nearest genes. The network is directed with edges from transcription factors to target genes. The weights of the genes in the network were determined on the basis of the RNA expression level (gene activity score for SGZ NIPCs only, because there is no corresponding T-type) of corresponding T-types. The weights of the edges were calculated by the relative accessibility of the promoters of the source transcription factors. We then used the personalized PageRank algorithm to rank the transcription factors in the network. The output of Taiji pipeline is transcription-factor-by-cell type matrix with PageRank scores. From the output matrix, we calculated coefficient of variation across cell types. To identify driver transcription factors, we used following criteria: (1) transcription factors have FDR less than 0.001; (2) transcription factors have coefficient of variant larger than the mean of coefficient of variant; (3) PageRank score should be ranked in the top 25% of all transcription factors for each cell type; (4) RNA expression level (CPM) is larger than 20, which we consider as an expressed transcription factor in corresponding cell type.

Motif enrichment

We performed both de novo and known motif enrichment analysis using Homer (v.4.11, RRID:SCR_010881)61. For cCREs in the consensus list, we scanned a region of ±250 bp around the centre of the element. And for proximal or promoter regions, we scanned a region of ±1,000 bp around the TSS. Randomly selected background regions are used for motif discovery. To identify motif enriched in different cell types or brain regions, we use variable cCREs as input and invariable cCREs as background.

GREAT analysis

Gene Ontology annotation of cCREs was performed using GREAT (version 4.0.4, RRID:SCR_005807)82 with default parameters. Gene Ontology biological process was used for annotations.

Gene Ontology enrichment

We perform Gene Ontology enrichment analysis using the R package Enrichr (RRID:SCR_001575)83. The gene set library ‘GO_Biological_Process_2018’ was used with default parameters. The combined score is defined as the P value computed using the Fisher’s exact test multiplied with the z-score of the deviation from the expected rank.

GWAS enrichment

To enable comparison to GWASs of human phenotypes, we used liftOver with settings ‘-minMatch=0.5’ to convert accessible elements from mm10 to hg19 genomic coordinates51. Next, we reciprocal lifted the elements back to mm10 and only kept the regions that mapped to original loci. We further removed converted regions with lengths greater than 1 kb.

We obtained GWAS summary statistics for quantitative traits related to neurological disease and control traits (Supplementary Table 25): age first birth and number of children born84, tiredness85, Crohns disease86, attention deficit hyperactivity disorder87, allergy88, birth weight89, bipolar disorder90, insomnia91, sleep duration92, neuroticism93, coronary artery disease94, rheumatoid arthritis95, educational attainment96, schizophrenia97, age at menarche98, tobacco use disorder (ftp://share.sph.umich.edu/UKBB_SAIGE_HRC/, Phenotype code: 318)99, intelligence100, amyotrophic lateral sclerosis101, anorexia nervosa102 and height103.

We prepared summary statistics to the standard format for linkage disequilibrium score regression. We used homologous sequences for each major cell types as a binary annotation, and the superset of all candidate regulatory peaks as the background control.

For each trait, we used cell-type-specific linkage disequilibrium score regression (https://github.com/bulik/ldsc) to estimate the enrichment coefficient of each annotation jointly with the background control52.

Fine mapping

We obtained 99% credible sets for schizophrenia from the Psychiatric Genomics Consortium website (https://www.med.unc.edu/pgc/). Potential causal variants with a posterior probabilities of association score larger than 1% were used for overlapping with cCREs.

External datasets

The datasets used for intersection analysis area as follows: representative DNase hypersensitivity site regions for both hg19 and mm10 were obtained from SCREEN database (https://screen.encodeproject.org)104,105. ChromHMM31,33 states for mouse brain were downloaded from GitHub (https://github.com/gireeshkbogu/chromatin_states_chromHMM_mm9), and coordinates of ChromHMM states were mapped using LiftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver) to mm10 with default parameters51. PhastCons59 conserved elements were download from the UCSC genome browser (http://hgdownload.cse.ucsc.edu/goldenpath/mm10/phastCons60way/). CTCF-binding sites were downloaded from the Mouse Encode Project31 (http://chromosome.sdsc.edu/mouse/). CTCF-binding sites from the cortex and olfactory bulb were used in this study. Peaks were extended ±500 bp from the loci of peak summits and mapped using LiftOver to mm1051.

Statistics

No statistical methods were used to predetermine sample sizes. There was no randomization of the samples, and investigators were not blinded to the specimens being investigated. However, the clustering of single nuclei on the basis of chromatin accessibility was performed in an unbiased manner, and cell types were assigned after clustering. Low-quality nuclei and potential barcode collisions were excluded from downstream analysis as outlined above.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Source link