Broad transcriptomic dysregulation occurs across the cerebral cortex in ASD - Nature - Alert Breaking News

Sample acquisition and preparation for RNA-seq

Post-mortem cortical brain samples were acquired from the Harvard Brain Bank as part of the Autism BrainNet project, formerly the Autism Tissue Project (ATP), and the University of Maryland Brain Banks (UMDB). Sample acquisition protocols were followed for each brain bank, and samples were de-identified before acquisition and thus exempt from IRB review. A total of 842 samples from individuals with ASD or dup15q syndrome, and non-psychiatric controls (112 unique subjects) across 11 cortical regions encompassing all major cortical lobes—frontal: BA4/6, BA9, BA44/45, BA24; temporal: BA38, BA41/42/22, BA20/37; parietal: BA3/1/2/5, BA7, BA39/40; and occipital, BA17—were acquired. These included 253 samples previously published collected by Parikshak et al. (2016)⁵ from BA9 and BA41/42/22 and/or Gandal et al. (2018)^1,5 from BA9, BA4/6 and BA41/42/22. An ASD diagnosis was confirmed by the Autism Diagnostic Interview–Revised (ADIR) in 30 of the subjects. In the remaining 19 subjects, diagnosis was supported by clinical history. Additional samples with ‘NCTL’ diagnoses (samples with Angelman’s syndrome, certain CNVs, or epilepsy—but not ASD) were obtained and sequenced, but these samples were ultimately not utilized for the analyses presented here (but they are included in the raw data files accompanying this work). Frozen brain samples were stored at −80 °C. To extract RNA from these samples, approximately 50–100 mg of tissue was dissected from the cortical regions of interest on dry ice in a dehydrated dissection chamber to reduce degradation effects from sample thawing and/or humidity. RNA was then isolated using the miRNeasy kit with no modifications (Qiagen). For each RNA sample, RNA quality was quantified using the RNA integrity number (RIN) on an Agilent Bioanalyzer.

RNA-seq and RNA data processing

Initial sequencing of BA9 and BA41/42/22 samples was performed in three batches as described⁵. For those batches, starting with total RNA, ribosomal RNA (rRNA) was depleted (RiboZero Gold, Illumina) and libraries were prepared using the TruSeq v2 kit (Illumina) to construct unstranded libraries. The remaining samples (additional brain regions and technical replicates from BA9 and BA41/42/22) were sequenced across three new batches, with strand-specific libraries prepared using the TruSeq Stranded Total RNA sample prep kit with rRNA depletion (RiboZero Gold, Illumina). All libraries were randomly pooled to multiplex 24 samples per lane using Illumina TruSeq barcodes. Each lane was sequenced 5 times on an Illumina HiSeq 2500 or 4000 instrument using high-output mode with standard chemistry and protocols for 50, 69 or 100 bp paired-end reads (read length varied by batch) to achieve a target depth of 70 million reads (see Extended Data Fig. 2a).

After sequencing, the resulting sample FASTQ files from all batches (including the Parikshak et al.⁵ samples) were subjected to the same processing pipeline. First, FASTQ files were assessed with FastQC⁴⁶ (v0.11.2) to verify that quality was sufficient for further processing. FASTQ files were then aligned to the human reference genome (GRCh37⁴⁷ Ensembl v75) with STAR⁴⁸ (v2.5.2b). Picard tools (v2.5.0) was used with the resulting BAM files to collect various read quality measures, in addition to the quality measures collected by STAR. verifyBAMID⁴⁹ was also used with these BAM files along with known sample genotypes from Parikshak et al.⁵ to validate that sample identity was correct for all BAM files. Additionally, the expression of XIST (a female-specific gene) was assessed to contribute to sample identity verification. Finally, RSEM⁵⁰ (v1.3.0) was used for quantification (Gencode⁵¹ release 25lift37) to obtain expected read counts at the gene and transcript levels.

Expected gene and transcript read counts were then subjected to several processing steps in preparation for downstream analysis, mainly using R⁵². First, counts per million (CPM) were obtained from counts for gene and transcript filtering purposes. Genes and transcripts were filtered such that genes and transcripts with a CPM > 0.1 in at least 30% of samples were retained. Genes and transcripts were also removed which had an effective length (measured by RSEM) of less than 15 bp. Transcripts were additionally filtered such that all transcripts corresponded with genes in the gene-level analysis. The counts for the remaining genes (24,836) and transcripts (99,819) passing these filters were normalized using the limma-trend approach in the limma⁵³ R package. Briefly, the limma-trend approach obtains normalized expression data through taking the log₂(CPM) of read counts with an adjustment for sample read depth variance. An offset value calculated with CQN⁵⁴ accounting for GC content bias and gene/transcript effective length bias in read quantification was also incorporated during the normalization process. With this normalized expression data, sample outliers were identified in each sequencing batch by cortical lobe (frontal, parietal, temporal, and occipital) group that had both (1) an absolute z-score greater than 3 for any of the top 10 expression principal components and (2) a sample connectivity score less than −2. Sample connectivity was calculated using the fundamentalNetworkConcepts function in the WGCNA¹⁷ R package, with the signed adjacency matrix (soft power of 2) of the sample biweight midcorrelation as input. This process identified 34 outliers, resulting in a final total of 808 samples (341 control, 384 ASD and 83 dup15q), which were carried forward for analysis.

Evaluating ASD differentially expressed genes and transcripts cortex-wide

Linear models for all subsequent analyses are described in the Supplementary Methods. The limma⁵³ R package was used to identify differentially expressed genes and transcripts in ASD both within specific regions and cortex-wide, controlling for known biological and sequencing related technical covariates. Biological covariates included: diagnosis, region, sequencing batch, sex, ancestry, age and age squared. Technical covariates are listed in the Supplementary Methods. The limma::duplicateCorrelation function was used to account for the non-independence of samples derived from the same subject across multiple brain regions. For both region-specific and whole-cortex effects, genes or transcripts with an FDR-corrected P– value of <0.05 were considered significantly dysregulated. dup15q region-specific and whole-cortex dysregulation was also established in this manner. The fixed effects of sex, age and age squared were also acquired using the full gene and transcript models (Supplementary Data 3).

To compare the magnitude of the ASD transcriptomic signature across regions, we sought to compute the slope of the linear regression of ASD effect-size (log₂FC) changes between whole-cortex and regional comparisons (Fig. 1e). However, the linear regression slope is dependent on the (arbitrary) ordering of the response (Y) and predictor (X) variables, both of which are estimated with error, and we found that in practice the slope can change considerably according to this order. To circumvent this issue, we used total least square regression (also known as orthogonal regression), which provides an estimate of slope that is invariant to the choice of predictor and response variables, as we have previously published⁶ (see Supplementary Methods).

Transcriptomic regional-identity analysis

To identify differentially expressed genes and transcripts between all 55 pairs of cortical regions with our permutation-based approach, a regressed gene-expression dataset containing only the random effect of subject and the fixed effects of diagnosis and region (along with the model residual) was used. This regressed dataset was created with the ‘lmerTest’⁵⁵ package in R through subtracting the effects of technical covariates and all biological covariates other than subject, diagnosis, and region from each gene, leaving only the random intercept, these three remaining biological covariate effects, and the residual. Significant attenuation of differentially expressed genes between each pair of regions (a reduction in transcriptomic regional-identity differences) in ASD was established through the following process. (1) ASD and control subjects containing each region in the regional pair were extracted for use in the analysis. (2) Separately in ASD and control subjects, the number of differentially expressed genes between regions was calculated using the paired Wilcoxon signed-rank test. Genes with an FDR-corrected P-value < 0.05 were considered differentially expressed. (3) The difference in the number of differentially expressed genes between regions for ASD vs control subjects was calculated (the ‘true’ difference). (4) A permuted distribution of the difference in differentially expressed genes between regions for ASD vs control subjects was generated to test the ‘true’ difference. Each permutation (10,000 in total) randomly assigned ‘ASD’ and ‘control’ status to subjects but kept the number of ASD and control subjects consistent with the true number of ASD and control subjects. (5) A two-tailed P-value was obtained from testing the ‘true’ difference against the permuted distribution. If the regional comparison P-value < 0.05, with the number of differentially expressed genes between regions in ASD less than that in controls, then the regional comparison was considered significantly attenuated in ASD. Otherwise, the regional comparison was considered over-patterned in ASD. This procedure was repeated with transcript-level regressed gene-expression data (similarly, only containing the random effect of subject and the fixed effects of diagnosis and region, along with the model residual) to identify altered transcriptomic identities in ASD at the transcript level.

The previously described permutation approach was designed to identify differences in transcriptomic regional identity in ASD. Importantly, this method is not appropriate for assessing variance in expected numbers of differentially expressed genes between regions across regional pairs and diagnoses, since the number of ASD and control subjects varied across regional pairs. To examine this, we also implemented a bootstrap-based approach. For each regional comparison we subset to ten pairs of ASD and control subjects (ten was selected since every regional comparison had at least this many subjects). When subsetting, subjects were removed such that the remaining subjects were closest in age to the median age of the available samples for that regional comparison. A bootstrap approach was then used to calculate the number of differentially expressed genes (Wilcoxon test for all genes, FDR < 0.05) between regions separately in control and ASD subjects through sampling subjects with replacement. After 10,000 bootstraps, control and ASD distributions were compared with Wilcoxon tests to determine if there was significant attenuation of regional identities (an FDR < 0.05 for the 55 regional pair Wilcoxon tests comparing the differentially expressed gene distributions from the bootstraps). The same regressed expression dataset used for the permutation approach was utilized for this bootstrap analysis. Any regional comparison in which the number of differentially expressed genes between regions was less in ASD than in control subjects, and the FDR-corrected P-value was less than 0.05, was considered attenuated in ASD.

To validate our bootstrapped estimates for the number of differentially expressed genes between pairs of regions in controls, we compared these estimates to those of the Allen Brain Atlas¹³, which is the best publicly available work for comparison. Allen Brain Atlas regions were matched to Brodmann regions (Supplementary Data 4) and matching regional pairs were extracted for comparison with this work. When the Allen Brain Atlas had two or more regional pairs matching one regional pair in this work, the mean was taken across the Allen Brain Atlas regional pairs. A P-value for the association of the number of differentially expressed genes between regions in Controls obtained in this work compared to the Allen Brain Atlas was calculated from a linear model (cortex-wide bootstrap mean ~ Allen Brain Atlas mean).

We applied a stringent filtering process to identify high-confidence ARI genes from each significantly attenuated regional comparison identified with the permutation procedure described above. First, for each of the attenuated regional comparisons, we extracted the genes which were identified as differentially expressed between regions in subjects labelled as controls in each of the 10,000 permutations. Then, we calculated how many times each of the genes truly differentially expressed between pairs of regions in the control subjects were present in their respective permuted groups (ranging from a possible 0 to 10,000 occurrences). Those ‘true’ differentially expressed genes which were present in less than 95% of their respective permutations were retained as ARI genes for each attenuated regional comparison. For each set of ARI genes (ten total), each gene was matched to the region in which it had higher expression in control subjects. The paired Wilcoxon signed-rank P-values identified for these genes in controls (those subjects used for the permutation analysis) were also extracted and are shared in Supplementary Data 4.

ARI gene groups (ARI downregulated genes, those highly expressed in BA17 and BA39/40 relative to other regions in controls; ARI upregulated genes, those expressed at low level in BA17 and BA39/40 relative to other regions in controls) were created through taking the union (without duplicates) across all ten identified ASD-attenuated regional comparisons, and sorting genes into the two groups based on gene-expression profiles across regions. The details of this process are described in the Supplementary Methods, along with functional annotation procedures.

Network-based functional characterization

Standard workflows using WGCNA¹⁷ were followed as previously described in Parikshak et al.⁵ and Gandal et al.¹ (with minor modifications) to identify gene and transcript co-expression modules. Details regarding network formation, module identification, and module functional characterization are described in the Supplementary Methods.

snRNA-seq

Frozen brain samples were placed on dry ice in a dehydrated dissection chamber to reduce degradation effects from sample thawing and/or humidity. Approximately 50 mg of cortex was sectioned, ensuring specific grey matter–white matter boundary. The tissue section was homogenized in RNase-free conditions with a light detergent briefly on ice using a dounce homogenizer, filtered through a 40-μM filter and centrifuged at 1,000g for 8 min at 4 °C. The pelleted nuclei were then filtered through a two-part micro gradient (30%/50%) for 20 min at 4 °C. Clean nuclei were pelleted away from debris. The nuclei were washed two more times with PBS/1%BSA/RNase and spun down at 500g for 5 min. Cells were inspected for quality (shape, colour and membrane integrity) and counted on a Countess II instrument. They were then loaded onto the 10X Genomics platform to isolate single nuclei and generate libraries for RNA sequencing on the NovaS4 or NovaS2 Illumina machines.

After sequencing, Cell Ranger software (10X Genomics) was used to prepare fastq file and reads were aligned to the human GRCh38 pre-mRNA genome to generate gene by cell matrices for each library. Pegasus (v.1.4.0) was used to stringently filter cells, remove doublets, integrate and batch-correct all libraries together. Cells were removed if they expressed more than 6,000 or less than 750 genes, or had more than 10% mitochondrially mapped reads. Utilizing 65 PCs, Harmony (as part of the Pegasus suite) was used to integrate and batch-correct libraries, Louvain clustering was performed to cluster the cells and visualize resulting clusters with UMAP⁵⁶. Cell types were annotated based on expression of known marker genes visualized via UMAP and by performing unbiased gene marker analysis (Supplementary Data 8). Canonical genes were selected based on mouse and human studies as well as published reference atlas enriched genes^23,57.

For cell composition analyses of snRNA-seq data, cell fractions were calculated for each sample and then underwent centred-log ratio (clr) transformation, which accounts for compositional data, and values are interpreted relative to the geometric mean⁵⁸. A repeated-measures ANOVA was used to assess group-level significance, controlling for fixed effects of region, sex, age, library size, nGene and percent_mito, with a random effect for subject. Differential expression was assessed using a negative binomial mixed model as implemented in the NEBULA R package (v1.2.0)⁵⁹. We used a model matrix ~ Diagnosis + Age + Sex + nGene + percent_mito, where quantitative data are scaled. Filtered raw counts are provided as input and NEBULA-LN method is used (default values are used for the other options). Final differentially regulated genes are determined by Benjamini–Hochberg corrected P-values.

Methylation-based CTD analyses

CTD was calculated using bulk methylation array data with matched samples⁶⁰ in BA9 and BA41/42/22 (Supplementary Data 7), we performed reference-based deconvolution for seven brain cell types (excitatory neurons, inhibitory neurons, astrocytes, microglia, endothelial cells, oligodendrocytes and oligodendrocyte precursor cells) using single-cell methylome reference data²⁹. We chose to emphasize these results in this manuscript as the zero-to-one scaling and the constant number of potentially methylated sites within each cell makes it more amenable to deconvolution than transcriptomic data. We used our methylation CTD cell-type proportions to test if cell-type proportion could explain our observed ASD differential expression effects in the bulk RNA-seq data. We addressed this through adding cell-type proportions as covariates to our linear model for differentially expressed gene analysis. We tried two different models: one including our broad cell-type proportions, and the other including the top two principal components of the cell-type proportions (compositional PCs; Supplementary Data 7). Neither of these models had any substantial effect on the ASD log₂FC derived from the differentially expressed gene analysis (Extended Data Fig. 9 and Supplementary Data 7). Please see the Supplementary Note for a detailed explanation of these analysis.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link