May 5, 2024
An atlas of healthy and injured cell states and niches in the human kidney – Nature

An atlas of healthy and injured cell states and niches in the human kidney – Nature

Statistics and reproducibility

For 3D imaging and immunofluorescence staining experiments, each staining was repeated on at least two separate individuals or separate regions. For immunofluorescence validation studies, commercially available antibodies were used; 13 out of the 15 tissue samples were also analysed using snCv3 or scCv3. For ISH, 6 tissue samples (4 biopsies and 2 nephrectomies) were analysed. For Slide-seq, 67 tissue pucks (6 individuals) were analysed, with 2 individuals also analysed using snCv3 or Visium. For Visium, 23 kidney tissue sections (22 individuals) were imaged, including 6 that were also analysed using snCv3 or scCv3 and one examined using Slide-seq. Orthogonal validation of spatial transcriptomic annotations revealed similar marker gene expression in snCv3/scCv3 and these technologies, as well as spatial localization that corresponded with histologically validated Visium spot mapping. Although multiomic data from the same samples would be the most informative, this remains technically challenging. However, wherever possible, several technologies were performed on a subset of samples from the same patient and, in some cases, the same tissue block was used to generate multimodal data (Extended Data Fig. 1a and Supplementary Table 3). This heterogeneous sampling approach ensured cell type discovery while minimizing assay-dependent biases or artifacts encountered when using different sources of kidney tissue. We recognize that the heterogeneity of sample sources for several technologies is a potential limitation due logistics and limited patient biopsy material.

Ethical compliance

We have complied with all ethical regulations related to this study. All experiments on human samples followed all relevant guidelines and regulations. Human samples (Supplementary Table 1) collected as part of the KPMP consortium (https://KPMP.org) were obtained with informed consent and approved under a protocol by the KPMP single IRB of the University of Washington Institutional Review Board (IRB 20190213). Samples as part of the HuBMAP consortium were collected by the Kidney Translational Research Center (KTRC) under a protocol approved by the Washington University Institutional Review Board (IRB 201102312). Informed consent was obtained for the use of data and samples for all participants at Washington University, including living patients undergoing partial or total nephrectomy or rejected kidneys from deceased donors. Cortical and papillary biopsy samples from patients with stone disease were obtained with informed consent from Indiana University and approved by the Indiana University Institutional Review Board (IRB 1010002261). For Visium spatial gene expression, reference nephrectomies and kidney biopsy samples were obtained from the KPMP under informed consent or the Biopsy Biobank Cohort of Indiana (BBCI)50 under waived consent as approved by the Indiana University Institutional Review Board (IRB 1906572234). Living donor biopsies as part of the HCA were obtained with informed consent under the Human Kidney Transplant Transcriptomic Atlas (HKTTA) under the University of Michigan IRB HUM00150968. Deidentified leftover frozen COVID-19 AKI kidney biopsies were obtained from the Johns Hopkins University pathology archive under waived consent approved by the Johns Hopkins Institutional Review Board (IRB 00090103).

Single-cell and single-nucleus human tissue samples

For single-nucleus omic assays, tissues were processed according to a protocol available online (https://doi.org/10.17504/protocols.io.568g9hw). For nucleus preparation, around 7 sections of 40 µm thickness were collected and stored in RNAlater solution (RNA assays) or kept on dry ice (accessible chromatin assays) until processing or used fresh. To confirm tissue composition, 5 µm sections flanking these thick sections were obtained for histology and the relative amount of cortex or medulla composition including glomeruli was determined. For single-cell omic assays, tissues used (15 CKD,12 AKI and 18 living donor biopsy cores) were preserved using CryoStor (StemCell Technologies).

Single-cell, single-nucleus and SNARE2 RNA-seq, quality control and clustering

Isolation of single nuclei

Nuclei were isolated from cryosectioned tissues according to a protocol available online (https://doi.org/10.17504/protocols.io.ufketkw) with the exception that 4′,6-diamidino-2-phenylindole (DAPI) was excluded from the nuclear extraction buffer and used only to stain a subset of nuclei used for counting. Nuclei were used directly for omic assays.

Isolation of single cells

Single cells were isolated from frozen tissues according to a protocol available online (https://doi.org/10.17504/protocols.io.7dthi6n). The single-cell suspension was immediately transferred to the University of Michigan Advanced Genomics Core facility for further processing.

10x Chromium v3 RNA-seq analysis

10x single-nucleus RNA-seq and 10x single-cell RNA-seq were performed according to protocols available online (https://doi.org/10.17504/protocols.io.86khzcw and https://doi.org/10.17504/protocols.io.7dthi6n, respectively), both using the 10x Chromium Single-Cell 3′ Reagent Kit v3. Sample demultiplexing, barcode processing and gene expression quantifications were performed using the 10x Cell Ranger (v.3) pipeline using the GRCh38 (hg38) reference genome with the exception of a subset of scCv3 experiments that used hg19 (indicated in Supplementary Table 1). For single-nucleus data, introns were included in the expression estimates.

SNARE2 dual RNA and ATAC-seq analysis

SNARE-seq217, as outlined previously18, was performed according to a protocol available online (https://doi.org/10.17504/protocols.io.be5gjg3w). Accessible chromatin and RNA libraries were sequenced separately on the NovaSeq 6000 (Illumina) system (NovaSeq Control Software v.1.6.0 and v.1.7.0) using the 300 cycle and 200 cycle reagent kits, respectively.

SNARE2 data processing

Detailed step-by-step processing for SNARE2 data has been outlined previously18. This has now been developed as an automated data processing pipeline that is available at GitHub (https://github.com/huqiwen0313/snarePip). snarePip (v.1.0.1) was used to process all the SNARE2 datasets. The pipeline provides an automated framework for complex single-cell analysis, including quality assessment, doublet removal, cell clustering and identification, robust peak generation and differential accessible region identification, with flexible analysis modules and generation of summary reports for both quality assessment and downstream analysis. The directed acyclic graph was used to incorporate the entirety of the data-processing steps for better error control and reproducibility. For RNA processing, this involved removal of accessible chromatin contaminating reads using cutadapt (v.3.1)51, dropEst (v.0.8.6)52 to extract cell barcodes and STAR (version 2.5.2b)53 to align tagged reads to the genome (GRCh38). For accessible chromatin data, this involved snaptools (v.1.2.3)54 and minimap (v.2-2.20)55 for alignment to the genome (GRCh38).

Quality control of sequencing data

10x snRNA-seq (snCv3)

Cell barcodes passing 10x Cell Ranger filters were used for downstream analyses. Mitochondrial transcripts (MT-*) were removed, doublets were identified using the DoubletDetection software (v.2.4.0)56 and removed. All of the samples were combined across experiments and cell barcodes with greater than 400 and less than 7,500 genes detected were retained for downstream analyses. To further remove low-quality datasets, a gene UMI ratio filter (gene.vs.molecule.cell.filter) was applied using Pagoda2 (https://github.com/hms-dbmi/pagoda2).

10x scRNA-seq (scCv3)

As a quality-control step, a cut-off of <50% mitochondrial reads per cell was applied. The ambient mRNA contamination was corrected using SoupX (v.1.5.0)57. The mRNA content and the number of genes for doublets are comparatively higher than for single cells. To reduce doublets or multiplets from the analysis, we used a cut-off of >500 and <5,000 genes per cell.

SNARE2 RNA

Cell barcodes for each sample were retained with the following criteria: having an DropEst cell score of greater than 0.9; having greater than 200 UMI detected; having greater than 200 and less than 7,500 genes detected. Doublets identified by both DoubletDetection (v.3.0) and Scrublet (https://github.com/swolock/scrublet; v.0.2.2) were removed. To further remove low-quality datasets, a gene UMI ratio filter (gene.vs.molecule.cell.filter) was applied using Pagoda2.

SNARE2 ATAC

Cell barcodes for each sample that had already passed quality filtering from RNA data were further retained with the following criteria: having transcriptional start site (TSS) enrichment greater than 0.15; having at least 1,000 read fragments and at least 500 UMI; having fragments overlapping the promoter region ratio of greater than 0.15. Samples were retained only if they exhibited greater than 500 dual omic cells after quality filtering.

Clustering snCv3

Clustering analysis was performed using Pagoda2, whereby counts were normalized to the total number per nucleus, batch variations were corrected by scaling expression of each gene to the dataset-wide average. After variance normalization, all 5,526 significantly variant genes were used for principal component analysis (PCA). Clustering was performed at different k values (50, 100, 200, 500) on the basis of the top 50 principal components, with cluster identities determined using the infomap community detection algorithm. The primary cluster resolution (k = 100) was chosen on the basis of the extent of clustering observed. Principal components and cluster annotations were then imported into Seurat (v.4.0.0) and uniform manifold approximation and projection (UMAP) dimensionality reduction was performed using the top 50 principal components identified using Pagoda2. Subsequent analyses were then performed in Seurat. A cluster decision tree was implemented to determine whether a cluster should be merged, split further or labelled as an altered state. For this, differentially expressed genes between clusters were identified for each resolution using the FindAllMarkers function in Seurat (only.pos = TRUE, max.cells.per.ident = 1000, logfc.threshold = 0.25, min.pct = 0.25). Possible altered states were initially defined for clusters with one or more of the following features: low genes detected, a high number of mitochondrial transcripts, a high number of endoplasmic-reticulum-associated transcripts, upregulation of injury markers (CST3, IGFBP7, CLU, FABP1, HAVCR1, TIMP2, LCN2) or enrichment in AKI or CKD samples. Clusters (k = 100) that showed no distinct markers were assessed for altered-state features; if present, then these clusters were tagged as possible altered states, if absent then clusters were merged on the basis of their cluster resolution at k = 200 or 500. If this merging occurred across major classes (epithelial, endothelial, immune, stromal) at higher k values, then these clusters were instead labelled as ambiguous or low quality (including possible multiplets). For k = 100 clusters (non-epithelial only) that did show distinct markers, their k = 50 subclusters were assessed for distinct marker genes; if present, then these clusters were split further. The remaining split and unsplit clusters were then assessed for altered-state features. If present, they were tagged as possible altered states, if absent they were assessed as the final cluster. Annotations of clusters were based on known positive and negative cell type markers11,12,58,59,60 (Supplementary Table 5), the regional distribution of the clusters across the corticomedullary axis and altered state (including cell cycle) features. For separation of EC-DVR from EC-AEA, the combined population was independently clustered using Pagoda2 and clusters associated with medullary sampling were annotated as EC-DVR. For separation of the REN cluster, stromal cells expressing REN were selected on the basis of normalized expression values of greater than 3. Final overall assessment of clustering accuracy was performed using the Single Cell Clustering Assessment Framework (SCCAF v.0.0.10) using the default settings, and compared against that associated with broad cell type classifications (subclass level 1).

Annotating snCv3 clusters

To overcome the challenge of disparate nomenclature for kidney cell annotations, we leveraged a cross-consortium effort to use the extensive knowledge base from human and rodent single-cell gene expression datasets, as well as the domain expertise from pathologists, biologists, nephrologists and ontologists11,12,22,58,59,60,61 (see also Supplementary Tables 4 and 5 and the HuBMAP ASCT+B Reporter at GitHub (https://hubmapconsortium.github.io/ccf-asct-reporter)). This enabled the adoption of a standardized anatomical and cell type nomenclature for major and minor cell types and their subclasses (Supplementary Table 4), showing distinct and consistent expression profiles of known markers and absence of specific segment markers for some of the cell types (Extended Data Fig. 2a and Supplementary Table 5). The knowledge of the regions dissected and histological composition of snCv3 data further enabled stratification of distinct cortical and outer and inner medullary cell populations (Fig. 2b and Extended Data Fig. 1). The cell type identities and regional locations were confirmed through orthogonal validation using spatial technologies presented here and correlations with existing human or rodent stromal, immune, endothelial and epithelial datasets4,25,58,59,61,62 (Extended Data Fig. 2b–l).

Atlas cell type resolution

Our atlas now includes a higher granularity for the loop of Henle, distal convoluted tubule and collecting duct segments, now resolving three descending thin limb cell types (DTL1, 2, 3); different subpopulations of medullary or cortical thick ascending limb cells (M-TAL/C-TAL); two types of distal convoluted tubule cells (DCT1, 2); intercalated and principal cells of the connecting tubules (CNT-IC and CNT-PC); cortical, outer medullary and inner medullary collecting duct subpopulations (CCD, OMCD, IMCD); and papillary tip epithelial cells abutting the calyx (PapE). Molecular profiles for rare cell types important in homeostasis were annotated, including juxtaglomerular renin-producing granular cells (REN); macula densa cells (MD); and a cell population with enriched Schwann/neuronal (SCI/NEU) genes NRXN1, PLP1 and S100B. Major endothelial cell types were stratified, including endothelial cells of the lymphatics (EC-LYM) and vasa recta (EC-AVR, EC-DVR). Specific stromal and immune cell types were distinguished, including distinct fibroblast populations across the cortico-medullary axis and 12 immune cell types from lymphoid and myeloid lineages.

Integrating snCv3 and SNARE2 datasets

Integration of snCv3 and SNARE RNA data was performed using Seurat (v.4.0.0) using snCv3 as reference. All counts were normalized using sctransform, anchors were identified between datasets based on the snCv3 Pagoda2 principal components. SNARE2 data were then projected onto the snCv3 UMAP structure and snCv3 cell type labels were transferred to SNARE2 using the MapQuery function. Both datasets were then merged and UMAP embeddings were recomputed using the snCv3 projected principal components. Integrated clusters were identified using Pagoda2, with the k-nearest neighbour graph (k = 100) based on the integrated principal components and using the infomap community detection algorithm. The SNARE2 component of the integrated clusters was then annotated to the most overlapping, correlated and/or predicted snCv3 cluster label, with manual inspection of cell type markers used to confirm identities. Integrated clusters that overlapped different classes of cell types were labelled as ambiguous or low-quality clusters. Segregation of EC-AEA, EC-DVR and REN subpopulations was performed as described for snCv3 above.

Integrating snCv3 and scCv3 datasets

Integration of snCv3 and scCv3 data was performed using Seurat v.4.0.0 with snCv3 as a reference. All counts were normalized using sctransform, anchors were identified between datasets based on the snCv3 Pagoda2 principal components. scCv3 data were then projected onto the snCv3 UMAP structure and snCv3 cell type labels were transferred to scCv3 using the MapQuery function. Both datasets were then merged and UMAP embeddings recomputed using the snCv3 projected principal components. Integrated clusters were identified using Pagoda2, with the k-nearest neighbour graph (k = 100) based on the integrated principal components and using the infomap community detection algorithm. The scCv3 component of the integrated clusters was then annotated to the most overlapping or correlated snCv3 subclass, with manual inspection of cell type markers used to confirm identities. Cell types that could not be accurately resolved (PT-S1/PT-S2) were kept merged. Integrated clusters that overlapped different classes of cell types or that were too ambiguous to annotate were considered to be low quality and were removed from the analysis. Segregation of EC-AEA, EC-DVR and REN subpopulations was performed as described above.

Assessment of snCv3, scCv3 and SNARE2 data integration

As described above, we used the demonstrated Seurat v.4.0.0 integration strategy63 to project query datasets (scCv3, SNARE2 RNA) into the same PCA space as our snCv3 reference. These imputed principal components were used to generate an integrated embedding and integrated clustering through Pagoda2. Query datasets within these integrated clusters were manually annotated on the basis of co-clustering with the reference data, predicted subclass levels and the manual inspection of marker genes. This process was necessary to account for misalignments that occurred for altered states showing more ambiguous marker gene expression profiles, especially for mapping between single-nucleus and single-cell technologies. To assess the accuracy in our alignments, we performed correlation of average expression signatures between the assigned query cell populations and the original reference cell populations (Extended Data Fig. 3e). Although several samples were examined using more than one platform (Supplementary Table 3 and Extended Data Fig. 1a), not all conditions could be covered by all technologies, with AKI/CKD biopsies too limited in size to process with SNARE2 and deeper medullary region capture being less likely for needle biopsies. Despite the differences in patient conditions and regions sampled, we were able to confirm cross-platform sampling with minimal batch contributions for a majority of our subclass (level 3) assignments (77 total). This was demonstrated through integrated bar plots for assay, patient, sex and condition contributions (Extended Data Fig. 3e). The degree to which cells/nuclei between assays were mixed within these subclasses was confirmed using normalized relative entropy weighted by subclass size64, with an average assay entropy across subclasses (covered by more than one technology) of 0.71 and an average patient entropy of 0.71 (out of 1). Mixing within the subclasses was also assessed on the cell embeddings (principal components) using the average silhouette width or ASW (scib.metrics.silhouette_batch function of the scIB package v.1.0.365), with an average score of 0.86 for assays and 0.82 for patients (out of 1). Finally, the average of k-nearest neighbour batch effect test (kBET) score per subclass, computed for all patients using the scib.metrics.kBET function of the scIB package, was 0.49 (out of 1), which is consistent with other integration efforts65.

Integrating snCv3 with published datasets

Integration with published data was performed using Seurat v.4.0.0 with snCv3 as a reference. All counts were normalized using sctransform, anchors were identified between datasets on the basis of the snCv3 Pagoda2 principal components. Published data were then projected onto the snCv3 UMAP structure and snCv3 cell type labels were transferred to the published dataset using the MapQuery function. Ref. 12 snDrop-seq data are available at the Gene Expression Omnibus (GEO: GSE121862). Ref. 15 single-nucleus RNA-seq and ref. 14 single-cell RNA-seq count matrices and metadata tables were downloaded from the UCSC Cell Browser (Cell Browser dataset IDs human-kidney-atac and kidney-atlas, respectively).

NSForest marker genes

To identify a minimal set of markers that can identify snCv3 clusters and subclasses (subclass.l3), or scCv3 integrated subclasses (subclass.l3), we used the Necessary and Sufficient Forest66 (NSForest v.2; https://github.com/JCVenterInstitute/NSForest/releases/tag/v2.0) software using the default settings.

Correlation analyses

For correlation of RNA expression values between snCv3 and scCv3, or SNARE2, average scaled expression values were generated, and pairwise correlations were performed using variable genes identified from Pagoda2 analysis of snCv3 (top 5,526 genes). For comparison with mouse single-cell RNA-seq data of healthy reference tissue59, raw counts were downloaded from the GEO (GSE129798). For comparison with mouse single-cell RNA-seq from IRI tissue4, raw counts were downloaded from the GEO (GSE139107). For human fibroblast and myofibroblast data25, raw counts were downloaded from Zenodo (https://doi.org/10.5281/zenodo.4059315). For each dataset, raw counts were processed using Seurat: counts for all cell barcodes were scaled by total UMI counts, multiplied by 10,000 and transformed to log space. For comparison with mouse single-cell types of the distal nephron61, the precomputed Seurat object was downloaded from the GEO (GSE150338). For mouse bulk distal segment data61, normalized counts were downloaded from the GEO (GSE150338) and added to the ‘data’ slot in a Seurat object. Bulk-sorted immune cell reference data were obtained using the celldex package67 using the MonacoImmuneData()62 and ImmGenData()67,68 functions and log counts imported into the ‘data’ slot of Seurat. For correlation against these reference datasets, averaged scaled gene expression values for each cluster were calculated (Seurat) using an intersected set of variable genes identified for each dataset (identified using Padoda2 for snCv3 and Seurat for reference datasets). For immune reference correlations, a list of immune-related genes downloaded from ImmPort (https://immport.org) was used instead of the variable genes. Correlations were plotted using the corrplot package (https://github.com/taiyun/corrplot). Immune annotations within our atlas were further confirmed by manual comparison with recently reported data14.

Cross-species alignment of cell types/states

For mouse single-nucleus RNA-seq data from IRI tissue4, raw counts were downloaded from the GEO (GSE139107). Integration was performed using Seurat v.4.0.0 with snCv3 as a reference. All counts were normalized using sctransform, anchors were identified between datasets on the basis of the snCv3 Pagoda2 principal components. Mouse data were then projected onto the snCv3 UMAP structure and snCv3 cell type labels were transferred using the MapQuery function.

Computing single-nucleus/cell-level expression scores

To identify markers associated with altered states (degenerative; adaptive—epithelial or aEpi; adaptive—stromal or aStr; cycling), snCv3 and scCv3 data were independently used to identify differentially expressed genes between reference and corresponding altered states for each subclass level 1 (subclass.l1). To ensure general state-level markers, differentially expressed genes were identified using the FindConservedMarkers function (grouping.var = “condition.l1”, min.pct = 0.25, max.cells.per.ident = 300) in Seurat. A minimal set of general degenerative conserved genes was identified as enriched (P < 0.05) in the degenerative state of each condition.l1 (reference, AKI and CKD) and in at least 4 out of the 11 (snCv3) or 9 (scCv3) subclass.l1 cell groupings. A minimal set of conserved aEpi genes was identified as enriched (P < 0.05) in the adaptive state of each condition.l1 (reference, AKI and CKD) in both the aPT and aTAL cell populations. This aEpi gene set was then further trimmed to include only those genes that were enriched within the adaptive epithelial population (aPT/aTAL) versus all others using the FindMarkers function and with a minimum P value of 0.05 and average log2-transformed fold change of >0.6. A minimal set of conserved aStr genes was identified as enriched (P < 0.05) in the adaptive state of each condition.l1 (reference, AKI and CKD for snCv2; reference and AKI for scCv3) for stromal cells. To increase representation from MyoF in scCv3 showing a small number of these cells, MyoF-alone enriched genes (average log2-transformed fold change ≥ 0.6; adjusted P < 0.05) were included for the scCv3 gene set. The aStr gene sets were then further trimmed to include only those genes that were enriched within the adaptive stromal population (aFIB and MYOF) compared with all others using the FindMarkers function and with a minimum P value of 0.05 and average log2-transformed fold change of >0.6. A minimal set of cycling-associated genes was identified as those enriched (adjusted P < 0.05 and average log2-transformed fold change > 0.6) in the cycling state across all associated subclass.l1 cell groupings.

Scores for altered state, extracellular matrix and for gene sets associated with ageing or SASP were computed for each cell from averaged normalized counts using only the genes showing a minimum correlation to the averaged whole gene set of 0.1 (ref. 25) (https://github.com/mahmoudibrahim/KidneyMap). Ageing and SASP genes were obtained from ref. 48 (top 20 genes upregulated in ageing kidney)48, ref. 69 (genes from table S3, group.age A), ref. 70 (SASP genes from figure 2c) or ref. 71 (from table S1 (sheet IR Epithelial SASP), having a positive AVE log2 ratio)71.

Gene set enrichment analyses (GSEA)

To compute gene set enrichments for aPT and aTAL, conserved genes differentially expressed in the adaptive over reference states were identified as indicated above. Gene set ontologies from the Molecular Signatures Database (MSigDB) were downloaded from https://gsea-msigdb.org and pathway enrichments were computed using fgsea72 and gage73, retaining only Gene Ontology terms that were significant (P < 0.05) for both. Redundant pathways were collapsed using the fgsea function collapsePathways and visualized using ggplot.

SNARE2 accessible chromatin analyses

SNARE2 chromatin data were analysed using Signac74 (v.1.1.1). Peak calling was performed using the CallPeaks function and MACS (v.3.0.0a6; https://github.com/macs3-project/MACS) separately for clusters, subclass.l1 and subclass.l3 annotations. Peak regions were then combined and used to generate a peak count matrix using the FeatureMatrix function, then used to create a new assay within the SNARE2 Seurat object using the CreateChromatinAssay function. Gene annotation of the peaks was performed using GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86). TSS enrichment, nucleosome signal and blacklist fractions were all computed using Signac. Jaspar motifs (JASPAR2020, all vertebrate) were used to generate a motif matrix and motif object that was added to the Seurat object using the AddMotifs function. For motif activity scores, chromVAR75 (v.1.12.0; https://greenleaflab.github.io/chromVAR) was performed using the RunChromVAR function. The chromVAR deviation score matrix was then added to a separate assay slot of the Seurat object. To assess the chromatin data, UMAP embeddings were computed from cis-regulatory topics that were identified through latent Dirichlet allocation using CisTopic76 (v.0.3.0; https://github.com/aertslab/cisTopic) and the runCGSModels function. Only regions accessible in 50 nuclei and nuclei with 200 of these accessible regions were used for cisTopic and downstream analyses. The UMAP coordinates for the remaining nuclei were added to the Seurat object. To ensure high-quality accessible chromatin profiles, only clusters with more than 50 nuclei were retained for downstream analyses (Supplementary Table 7). For joint embedding of SNARE2 accessible chromatin and gene expression, a weighted nearest-neighbour graph was computed using the FindMultiModalNeighbors function (Seurat) based on PCA (RNA) and latent semantic indexing or LSI (accessible chromatin) dimensionality reductions. UMAP dimensionality reduction was performed to visualize the joint embedding.

DAR analyses

Sites that were differentially accessible for a given cell grouping (subclass) were identified against a selection of background cells with the best matched total peak counts, to best account for technical differences in the total accessibility for each cell. For this, the total peaks in each cell were used for estimation of the distribution of total peaks (depth distribution) for the cells belonging to the test cluster, and 10,000 background cells with a similar depth distribution to the test cluster were randomly selected. Differentially accessible sites (DARs) were then identified as significantly enriched in the positive cells over selected background cells using the CalcDiffAccess function (https://github.com/yanwu2014/chromfunks), where P values were calculated using Fisher’s exact tests on a hypergeometric distribution and adjusted P values (or q values) were calculated using the Benjamini–Hochberg method. For subclass level 2 DARs, VSM/P clusters were merged and the MD was combined with C-TAL before to DAR calling. Subclasses with >100 DARs with q < 0.01 were used for further analysis. Co-accessibility between all peak regions was computed using Cicero77 (v.1.8.1). Sites were then linked to genes on the basis of co-accessibility with promoter regions, occurring within 3,000 bp of a gene’s TSS, using the RegionGeneLinks function (https://github.com/yanwu2014/chromfunks) and the ChIPSeeker package78. DARs associated with markers for each subclass (identified at the subclass.l2 level using snCv3, P < 0.05) and showing q < 0.01 and a log-transformed fold change of >1 were selected for visualization. For this, DAR accessibility (peak counts) was averaged, scaled (trimming values to a minimum of 0 and a maximum of 5) and visualized using the ggHeat plotting function of the SWNE package79. Motif enrichment within cell type DARs was computed using the hypergeometric test (FindMotifs function) in Signac.

Transcription factor analyses

To identify active transcription factors from SNARE2 accessible chromatin data, differential activities (or deviation scores) of TFBSs between different populations were assessed using the Find[All]Markers function through logistic regression and using the number of peak counts as a latent variable. Only transcription factors with expression detected within the corresponding cluster, subclass or state grouping were included. For PT and TAL clusters, TFBSs that were differentially active (P < 0.05, average log2-transformed fold change of >0.35) and associated with transcription factors with expression detected in at least 2.5% of nuclei (SNARE2) were identified between clusters. Common aPT/aTAL TFBS activities were identified from an intersection of those differentially active and expressed within adaptive PT and TAL clusters. For aPT and aTAL trajectory modules, TFBSs showing differential activity between modules (adjusted P < 0.05, average log2-transformed fold change of >0.35) and expression detected within at least 2.5% of nuclei/cells (snCv3/scCv3) were identified. For common degenerative state TFBS activities, differentially active TFBSs were identified between reference and degenerative states for each level 1 subclass. Significant degenerative state TFBS activities (P < 0.05, average log2-transformed fold change of >0.35) in three or more subclass.l1 were trimmed to those showing expression detected in more than 20% of the degenerative state nuclei/cells for snCv3/scCv3.

Ligand–receptor interaction analyses

Ligand–receptor analyses were performed on the basis of the CellChat package (v.1.0.0; https://github.com/sqjin/CellChat). Only cells in TAL, immune and stroma of subclass level 2 (immune: cDC, cycMNP, MAC-M2, MAST, MDC, N, ncMON, NKT, pDC, PL, T and B; stroma: MyoF, FIB, dFIB, cycMyoF and aFIB) and interactions for secreted ligands were used to infer the cell–cell communication. For cells in the TAL trajectory, we computed the intercellular cell communication probability between each module and other cell populations using the CellChat function computeCommunProb (see ref. 80 for a detailed description of the method). The overall scaled communication probability was then visualized based on a circle plot using a customized plot_communication function (Code availability). To further understand which signals contribute most to the ligand–receptor (LR) interaction pathways, we generated the pathway enrichment heat map of each interaction for incoming, outgoing and overall signals using the plotSigHeatmap function (Code availability). The contribution of significant LR pairs of each interaction was also identified using netAnalysis_contribution in the CellChat package.

GWAS analyses

To link SNARE2 cell types to kidney genome-wide association study (GWAS) traits and diseases, we first summed the binary peak accessibility profiles for all cells belonging to the same cell type to create a pseudobulk peak-by-subclass accessibility matrix. Pseudobulk analyses give more stable results, especially as SNARE2 accessibility data can be sparse. To ensure sufficient coverage, we used subclass level 2 groupings with the following modifications: VSM/P clusters were merged; MD was combined with C-TAL; subclasses with <100 DARs with q < 0.01 were excluded. We used g-chromVAR81 (v.0.3.2), an extension of chromVAR for GWAS data, to identify cell types with higher than expected accessibility of genomic regions overlapping GWAS-linked single-nucleotide polymorphisms (SNPs). Running g-chromVAR requires first identifying GWAS-linked SNPs that are more likely to be causal, a process known as fine-mapping. For the chronic kidney failure GWAS traits, we used existing fine-mapped SNPs from the CausalDB database, using the posterior probabilities generated by CAVIARBF82,83. The original GWAS summary statistics files were obtained from an atlas of genetic associations from the UK BioBank84. We manually fine-mapped the CKD, eGFR, blood urea nitrogen and gout traits using the same code that was used to generate the CausalDB database (https://github.com/mulinlab/CAUSALdb-finemapping-pip). The summary statistics for all of these traits are available at the CKDGen Consortium site (https://ckdgen.imbi.uni-freiburg.de/)85,86. We also manually fine-mapped the hypertension trait and the original summary statistics can be found on the EBI GWAS Catalog87. We looked only at causal SNPs with a posterior causal probability of at least 0.05 to ensure that SNPs with low causal probabilities did not cause false-positive signals. Moreover, as g-chromVAR selects a semi-random set of peaks with similar average accessibility and GC content as background peaks, the method has an element of randomness. To ensure stable results, we ran g-chromVAR 20 times and averaged the results. Cluster/trait z-scores were plotted using ggheat (https://github.com/yanwu2014/swne).

To link causal SNPs to genes, we used functions outlined in the chromfunks repository (https://github.com/yanwu2014/chromfunks; /R/link_genes.R). This involved the identification of causal peaks for each cell type and trait (minimum peak Z score of 1, minimum peak posterior probability score of 0.025). Sites were then linked to genes on the basis of co-accessibility (Cicero) with promoter regions, occurring within 3,000 bp of a gene’s TSS. Only sites associated with genes detected as expressed in 10% of TAL nuclei/cells (snCv3/scCv3) were included. Motif enrichment within the causal SNP and TAL-associated peaks was performed using the FindMotifs function in Seurat and only motifs for transcription factors expressed in 10% of TAL nuclei/cells (snCv3/scCv3) were included (Supplementary Table 31). For a TAL-associated ESRRB transcription factor subnetwork, peaks were linked to genes using Cicero, then subset to those associated with TAL (C-TAL, M-TAL) marker genes that were identified using the Find[All]Markers function in Seurat for subclass.l3 (P < 0.05). Transcription factors were then linked to gene-associated peaks on the basis of the presence of the motif and correlation of peak and TFBS co-accessibility (chromVAR), using a correlation cut-off of 0.3. Only transcription factors with expression detected within 20% of TAL cells or nuclei (snCv3/scCv3) were included. Eigenvector centralities were then computed using igraph and the transcription-factor-to-gene network was visualized using PlotNetwork in chromfunks.

Disease-associated gene set enrichment analyses

Genes linked with CKDGen consortium GWAS loci for the kidney functional traits eGFR and urinary albumin-creatinine ratio (UACR) were obtained from table S14 of ref. 88. These included the top 500 genes per trait or only those genes also implicated in monogenic glomerular diseases. eQTLs associated with eGFR, systolic blood pressure and general kidney function were obtained from tables S20, S21 and S22 of ref. 89, respectively. Genes associated with the transition from acute to chronic organ injury after ischaemia–reperfusion were obtained from ref. 90 from the following supplementary tables: Acute_Human_Specific (table S3, Human specific column); Acute_Mouse_Overlap (table S3, Shared column); Mid_Acute (table S8, cluster 2 genes); Late_Human_Specific (table S9, Human specific column); Late_Mouse_Overlap (table S9, Shared column); Late_Fibrosis (table S6, positive logFC); Late_Recovery (table S6, negative logFC). Each gene set was assessed for its enrichment within combined snCv3 and scCv3 subclass (level 3) differentially expressed genes (adjusted P < 0.05, log-transformed fold change of >0.25). Enrichments were performed using Fisher’s exact tests and the resultant −log10[P] values were scaled and visualized using ggplot2.

Patient cohorts used for clinical association analyses

NEPTUNE91 (193 adult patients) and ERCB45 (131 patients) expression data were used as validation cohorts to determine the significance between patients with different levels of cell state gene expression. NEPTUNE (NCT01209000) is a multicentre (21 sites) prospective study of children and adults with proteinuria recruited at the time of first clinically indicated kidney biopsy (Supplementary Table 34). The study participants were followed prospectively, every 4 months for the first year, and then biannually thereafter for up to 5 years. At each study visit, medical history, medication use and standard local laboratory test results were recorded, while blood and urine samples were collected for central measurement of serum creatinine and urine protein/creatinine ratio (UPCR) and eGFR (ml per min per 1.73 m2). End-stage kidney disease (ESKD) was defined as initiation of dialysis, receipt of kidney transplant or eGFR <15 ml per min per 1.73 m2 measured at two sequential clinical visits; and the composite end point of kidney functional loss by a combination of ESKD or 40% reduction in eGFR92. Genome-wide transcriptome analysis was performed on the research core obtained at the time of a clinically indicated biopsy using RNA-seq by the University of Michigan Advanced Genomics Core using the Illumina HiSeq2000 system. Read counts were extracted from the fastq files using HTSeq (v.0.11). NEPTUNE mRNA-seq data and clinical data are controlled access data and will be available to researchers on request to [email protected].

ERCB is the European multicentre study that collects biopsy tissue for gene expression profiling across 28 sites. Transcriptional profiles of biopsies from patients in the ERCB were obtained from the GEO (GSE104954).

Clinical association of cell state scores

The gene expression data from the tubulointerstitial compartment of the kidney biopsies from NEPTUNE patients was used to calculate the composite scores for the genes enriched in degenerative, aPT, aTAL and aStr states. The expression of the genes that were uniquely enriched in the cell state (described above) and that were found in both snCv3 and scCv3 were used to calculate the composite cell state score (Supplementary Table 29). As scCv3 did not efficiently identify all stromal cell types, the union of the enriched genes from scCv3 and snCv3 data were used to calculate the aStr cell state score. We also generated a cell state score for the genes that were commonly enriched in aPT and aTAL cells (common).

For outcome analyses (40% loss of eGFR or ESKD) in the NEPTUNE cohort, patient profiles were binned according to the degree of cell state score by tertile. Clinical outcomes were available on 193 participants with a total of 30 events. Kaplan–Meier analyses were performed using log-rank tests to determine significance between patients in different tertiles of cell state gene expression. Moreover, for the different cell state scores, multivariable adjusted Cox proportional hazard analyses were performed using five statistical models adjusting for different sets of potential confounding effects given the overall few number of events: (1) age, sex and race; (2) baseline eGFR and UPCR; (3) immunosuppressive treatment and FSGS status; (4) eGFR, UPCR and race self-reported as Black (factors that were associated with outcome in this dataset); and (5) immunosuppressive treatment, eGFR and UPCR (Supplementary Table 30). Note that the adjusted models simply assess whether the association with outcome persists after adjusting for common clinical features (that is, confounding effects), but do not assess for prediction accuracy.

In the ERCB cohort, differential expression analyses using multivariable regression modelling were performed between the cell state scores in the disease groups and living donors. Age and sex were used as covariates. The cell state scores for both NEPTUNE and ERCB bulk mRNA transcriptomics data were generated93. In brief, the cell state scores were generated by creating Z scores for each of the cell state gene sets and then using the average Z score as the cell state composite score. These analyses found scores for all adaptive epithelial, but not degenerative, states were significantly higher in the patients with diabetic nephropathy patients compared to that of living donors (Supplementary Table 30). After adjusting for sex and age, both aPT and aTAL were significant when scores from patients with diabetic nephropathy were compared with those of living donors and aPT scores were significant even after correcting for the different disease groups.

Sample-level analysis and clustering on clinical association gene sets

To find association of patients based on altered-state gene signatures that were used in clinical association analyses (Supplementary Table 30), we performed sample-level clustering. All of the cells from the same patient in snCv3 and scCv3 were aggregated to get pseudo-bulk count matrices on the basis of the associated clinical gene set. The matrices were further normalized by RPKM followed by t-distributed stochastic neighbour embedding (t-SNE) dimensionality reduction. Groups of patients were then identified based on k-means clustering and density-based spatial clustering (DBSCAN) methods in the reduced space. To associate the patient clusters with clinical features, we calculated the distribution of eGFR in each identified group (Code availability).

To identify gene sets that best differentiate between AKI and CKD patients in the PT and TAL cell populations, we trained a gene-specific logistic regression model based on the sample-level gene expression. The model was used to assess the predictive power that differentiate patients with AKI and CKD in both snCv3 and scCv3 measured by area under the curve (AUC). The genes with AUC > 0.65 on both snCV3 and scCv3 were selected for downstream analysis (Supplementary Table 32).

To identify genes that were differentially expressed between AKI and CKD across all cell types, we aggregated the cells associated with each subclass (level 1) to generate cell-type-specific pseudocounts for each sample and performed differential gene expression analysis based on the DEseq2 method using the estimatePerCellTypeDE function in the Cacoa package (v.0.2.0; https://github.com/kharchenkolab/cacoa).

Pseudotime analysis of PT and TAL cells

To find cells associated with disease progression, we performed trajectory analysis for PT and TAL cells. To get accurate pseudotime and trajectory estimation, we removed degenerative cell populations in both PT and TAL and inferred the trajectory for single nuclei and single cells separately using the Slingshot package94 (v.2.0.0). We specified normal cell populations as the end points for trajectory inference (S1–S3 in PT and M-TAL in TAL) using the Slingshot parameter end.clus. The correspondent trajectory embedding was visualized using the plotEmbedding function in the Pagoda2 package.

To identify whether the gene expression was statistically significantly associated with the inferred trajectory, we modelled the expression of a gene as a function of the estimated pseudotime by fitting a gam model with cubic spline regression using formula expi = f(t) + ϵ, where t is the pseudotime and f is the function of cubic spline. The model is then compared to a reduced model expi = f(1) + ϵ to get P-value estimates using the F-test. The Benjamin–Hochberg method was used to calculate the adjusted P values. To further identify candidate genes showing potential differences between patients with AKI and CKD, we extended the base gam model by fitting a conditional-smooth interaction using CKD as a reference.

Gene module detection and cell assignment

To identify expression modules for significant gene sets along the estimated trajectories, we applied the module detection algorithm implemented in the WGCNA package95 (v.1.70-3) based on the smoothed gene expression matrix with parameters softPower = 10 and minModuleSize = 20. The similar modules detected by the original parameters were further merged. In total, we identified five different modules in PT and six modules in TAL cells. For the gene sets in each module, we further performed pathway analysis using the Reactome online tool96 (https://reactome.org/PathwayBrowser/). The enrichment of clinical associated gene sets for each module (Fig. 6e) was assessed by performing log ratio enrichment tests. To predict the transcription factor activities of PT and TAL subclass genes, we used the DoRothEA package (v.1.7.2) as targets. DoRothEA transcription factors and transcriptional targets were curated from both human and mouse evidence. The transcription factor activity scores for each cell type were calculated based on the run_viper function of the viper package (v.3.15; https://bioconductor.org/packages/release/bioc/html/viper.html).

To identify cells that are associated with each module, we developed a systematic approach. In brief, for the cells in the smoothed expression matrix, we performed dimension reduction using PCA followed by Louvain clustering. This enabled the identification of cell clusters along the trajectory. For the identified cell clusters, we then performed hierarchical clustering to calculate the correlation of each module on the basis of mean gene expression values and further linked the clusters with associated modules by cutting the hierarchical tree. Finally, module labels for each cell were assigned on the basis of its associated clusters. To link single-cell datasets with single-nucleus modules, we performed k-means clustering based on the joint embedding of PT or TAL cells and assigned the cells in scCv3 to modules on the basis of the majority voting from its k’s nearest neighbours (Code availability).

To further investigate cluster-free compositional change between disease conditions, we also performed cell density analysis, in which we compared the normalized cell density between AKI and CKD conditions through 2D kernel estimates using Cacoa Package. Z scores were calculated to identify the regions that showed significant differences in cell density.

To validate the direction of modules inferred from human data, we performed joint alignment of the human and mouse trajectories. The individual trajectories inferred separately from these two species (Slingshot, described above) were aligned to generate a joint trajectory using CellAlign (https://github.com/shenorrLab/cellAlign) with parameters winSz = 0.1 and NumPts = 1000. The collection groups (timepoints from injury) derived from mouse data were then projected to human cells based on the joint trajectory. The genes that were conserved or divergent between the two species were specified as overlapping/distinct gene sets that were tested for significance based on a gam model inferred from the trajectory (see above).

RNA velocity analyses

Spliced and unspliced reads were counted from Cell Ranger BAM files for each snCv3 run using velocyto97 (v.0.17.17) and using the GRCh38 gene annotations prepackaged with the Cell Ranger pipeline. Repetitive elements were downloaded from the UCSC genome browser and masked from these counts. Corresponding loom files were loaded into R using the SeuratWrappers function ReadVelocity and converted to Seurat objects using the as.Seurat function. aPT or aTAL trajectory populations were then subset and RNA velocity estimates were calculated using scVelo98 (v.0.2.4) through a likelihood-based dynamical model. Velocity embeddings on the trajectory UMAPs were visualized using the pl.velocity_embedding_stream function. Latent times based on transcriptional dynamics predicted from splicing kinetics were computed and the top 300 dynamical genes were plotted using the pl.heatmap function. Top likelihood genes were computed for each TAL module to identify potential drivers for these states. Spliced versus unspliced or latent time scatter plots were generated using the pl.scatter function.

GRN analyses

GRNs associated with TAL trajectory modules were constructed using Celloracle (v.0.9.1) with the default parameters outlined in the provided tutorials (https://morris-lab.github.io/CellOracle.documentation). The base GRN was first constructed from SNARE2 accessible chromatin data. Co-accessible peaks across cell types identified using Cicero (v.1.8.1) were linked to genes through their TSS peaks to identify accessible promoter/enhancer DNA regions. Peaks were then scanned for transcription-factor-binding motifs (gimme.vertebrate.v5.0) to generate a base GRN. snCv3 data were then used to identify TAL state-specific GRNs. To ensure that relevant genes were used, we included genes that varied across the aTAL trajectory (Supplementary Table 17), showed dynamic module-specific transcription from scVelo analyses (Supplementary Table 21), were variably expressed across TAL cells (Pagoda2) or that were associated with differential transcription factor activities (Supplementary Table 20). GRN inference through regularized machine learning regression models was performed to prune inactive (insignificant or weak) connections and to select active edges associated with regulatory connections within each module or state, retaining the top 2,000 edges ranged by edge strength. Network scores for different centrality metrics were then calculated and visualized using Celloracle plotting functions. For in silico transcription factor perturbation analyses, target gene expression was set to 0 and resultant gene expression values were extrapolated or interpolated using the default parameters of Celloracle and according to the provided tutorial. Stromal GRN construction was performed as indicated above, except using a gene subset that included variable STR genes identified using Pagoda2; subclass level 3 markers for FIB, aFIB, MyoF (adjusted P < 0.05); or transcription factors with expression detected in at least 2.5% of nuclei (SNARE2) and having binding sites that were differentially active between STR subclasses (P < 0.05). To ensure BMP target SMADs were represented, SMAD1/5/8 were also included.

SLIDE-seq2

Puck preparation and sequencing

Tissue pucks were prepared from fresh frozen kidney tissue either embedded in Optimal Cutting Temperature (OCT) compound or frozen in liquid nitrogen and sequenced20,99 according to a step-by-step protocol (https://doi.org/10.17504/protocols.io.bvv6n69e). Libraries were sequenced on the NovaSeq S2 flowcell (NovaSeq 6000) with a standard loading concentration of 2 nM (read structure: read 1, 42 bp; index 1, 8 bp; read 2, 60 bp; index 2, 0 bp). Demultiplexing, genome alignment and spatial matching was performed using Slide-seq tools (https://github.com/MacoskoLab/slideseq-tools/releases/tag/0.1).

Deconvolution

We used Giotto100 (v.1.0.3) for handling the slide-seq data and RCTD101 (v.1.2.0) for the cell type deconvolution. As only reference tissue was used for slide-seq, all degenerative states as well as PapE, NEU, B and N were removed from the snCv3 Seurat object prior to deconvolution. The Seurat object was randomly subsampled to have at most 3,000 cells from each level 2 (l2) subtype and the level 1 (l1) subclasses of ATL and DTL were merged. For each data source, that is, HuBMAP or KPMP (Supplementary Table 2), the counts from all beads across all pucks were pooled and deconvolved hierarchically: first, beads with less than 100 UMIs and genes detected in less than 150 beads were removed. Then, the broad l1 subclass annotations in the Seurat object were used to deconvolve all beads (gene_cutoff = 0.0003, gene_cutoff_reg = 0.00035, fc_cutoff = 0.45, fc_cutoff_reg = 0.6, manually adding REN in the RCTD gene list and merging ATL and DTL subtypes as TL). The prediction weights were normalized to sum to 100 per bead. Beads for which one cell type had a relative weight of 40% or higher were classified as that l1 subclass. Then, for each l1 subclass, all classified beads were further deconvolved using the l2 annotation of that subclass, as well as the remaining subclass l1 annotations (same parameters as l1). Note that, for each l2 deconvolution, the bulk parameters in RCTD were fitted using all beads and then the RCTD object was subsetted to only contain the selected beads for the l2 deconvolution. Classification at subclass l2 was done similar to l1 with the maximum relative weight cut-off of 30% or 50% depending on the stringency needed for an analysis (50% for Figs. 2c,f and Extended Data Fig. 4b and 30% in other analyses). For plotting gene counts, the scaling was performed with the command normalizeGiotto(gObj, scalefactor = 10000, log_norm = T, scale_genes = T, scale_cells = F). The marker gene dot plots were plotted using the DotPlot function in Seurat (v.4.0.0).

Cell type interaction

Coarse cell–cell interactions can be revealed by looking for cell types that tend to be in close proximity. For each puck, we created a neighbourhood graph based on Delaunay triangulation in which each bead is connected by an edge to at least one other neighbouring bead, provided that their distance is smaller than 50 µm. For each pair of cell types, we count the number of times they are connected by edges. Then, the cell type labels are randomly permuted 2,500 times to form the null distribution of the number of connections. The expected number of connections between pairs of cell types is calculated from this simulation and the proximity enrichment is defined as the ratio of the observed over the expected frequency of connections. The network construction and enrichment analysis were performed using Giotto’s createSpatialNetwork and cellProximityEnrichment commands, respectively. Those beads with maximum level 2 weight less than 30% were removed. We further excluded spurious beads that were outside of the visual boundary of the tissue (only for the pucks of which the names start with ‘Puck_210113’) by manually specifying straight lines that follow the tissue boundary. For cortical pucks (Supplementary Table 2), M-PC, C-PC and IMCD labels were relabelled as PC; M-TAL and C-TAL as TAL; and EC-DVR was merged into EC-AEA. Other medullary and cycling subtypes were removed. For medullary pucks, M-PC and C-PC were relabelled as PC; M-TAL and C-TAL as TAL; all DTL subtypes as DTL; and EC-AEA was merged into EC-DVR. Other cortical and cycling subtypes were removed.

To generate the proximity plots in Extended Data Fig. 4, the enrichment values for each cell type pair were averaged across all pucks from the same region and plots were generated using the R package ggGally (v.2.1.2). For the cortex and medulla, respectively, only the connections with mean enrichment values higher than 0.7 and 0.8 were plotted.

10x Visium spatial transcriptomics

Preparation, imaging and sequencing

Human kidney tissue was prepared and imaged according to the Visium Spatial Gene Expression (10x Genomics) manufacturer’s protocol (CG000240, Visium Tissue Preparation Guide) and as previously described102. Nephrectomy (n = 6), AKI (n = 6) and CKD (n = 11) samples were sectioned at 10 µm thickness from OCT-compound-embedded blocks. These 23 samples represent 22 participants because 2 samples (1 cortex and 1 medulla) were obtained from the same participant with CKD. A Keyence BZ-X810 microscope equipped with a Nikon ×10 CFI Plan Fluor objective was used to acquire H&E-stained bright-field mosaics, which were subsequently stitched. mRNA was isolated from stained tissue sections after permeabilization for 12 min. Released mRNA was bound to oligonucleotides in the fiducial capture areas. mRNA was then reverse-transcribed and underwent second strand synthesis, denaturation, cDNA amplification and SPRIselect cDNA cleanup (Visium CG000239 protocol) as part of library preparation. Sequencing was performed on the Illumina NovaSeq 6000 system103.

Gene expression analysis

Space Ranger (v.1.0 or higher) with the reference genome GRCh38 was used to perform expression analysis, mapping, counting and clustering. Summary statistics and quality-control metrics are included in Extended Data Fig. 5 and Supplementary Table 2. Normalization was performed using SCTransform104. Final data processing was performed in Seurat (v.3.2.3). Expression feature plots depict the intensity of transcript expression in each spot. In each Visium sample, the outermost layer of spots was eliminated from comparative analyses if the edge was manually cut by a razor.

Deconvolution

Using Seurat (v.3.2.0), a transfer score system was used to assess and map the proportion of signatures arising from each 55 µm spot. The transfer score reflects a probability between each spot’s signature and its association with a given snCv3 subclass (level 2). The highest probability transfer scores have the highest proportion mapped within each spatial transcriptomics spot pie graph. For cell type feature plots (Figs. 2g and 3f and Extended Data Fig. 7i), subclass level 2 cell type transfer scores were mapped to convey the proportion of signature underlying each spot. For cell state feature plots (Fig. 3b), instead of mapping subclass level 2 cell types, the aEpi cell state annotated in snCv3 was mapped across all spots in the samples. We summed the proportion of signatures arising from all cell types corresponding to each of the 6 cell states in all spots of all samples (Fig. 3a). The proportions of cell state were compared across nephrectomy, AKI and CKD samples using Fisher’s exact tests.

Colocalization of epithelial, immune and stromal cells

In all spots across all samples, we categorized spots into healthy, adaptive or degenerative epithelial cell states on the basis of the highest proportion of epithelial cell state signature as calculated in Fig. 3a. For stromal or immune cell type colocalization, we first selected spots with non-zero transfer scores of each cell type in all 23 samples. The presence of stromal or immune cell signature was considered colocalized with an epithelial cell if its stromal or immune transfer score exceeded its mean transfer score across all selected spots. An odds ratio was calculated for colocalization between the healthy, adaptive and degenerative epithelial cell state with stromal or immune cell signature.

Cell state marker expression

To compare marker gene expression associated with the healthy, adaptive and degenerative cell states (Fig. 3d), we first categorized a subset of spots from AKI and CKD samples into 1 of 5 predominant cell types: POD, PT, TAL, CD or FIB. For the PT, TAL and fibroblasts, a spot was selected if the highest proportion of its signature (level 1 mapping) corresponded to one of these cell types. For the CD subset, a spot was selected if the sum of level 1 mapping proportions for the PC and IC contributed most to its signature. POD spots were defined by the presence of a minimum of 20% signature arising from the level 1 POD label. Once the subsets of PT, TAL, fibroblast, CD and POD spots were selected, each spot was further divided into healthy, adaptive or degenerative cell state groups based on the highest proportion of cell state signature as calculated in Fig. 3a. For PODs, the presence of EC-GC signature was considered to be a degenerative equivalent given that a loss of POD markers was associated with an observed gain in EC-GC signatures within DKD samples.

Niche analysis

To examine the diversity of cell types colocalizing with TAL epithelial cells, we selected spots with more than 20% TAL signature and in which the highest proportion of signature arose from level 1 TAL mapping. Using Seurat clustering methodology, selected spots were reclustered after Seurat label transfer scores were substituted in lieu of gene expression. Spots with similar proportions of signature arising from TAL cell types and states, stromal cells and immune cells were clustered into 13 niches. Niches were mapped over the 23 kidney samples and the marker gene expression in each niche was determined. To depict the relative proportion of each cell type, the transfer score average was first computed in each niche. Next, a z score for each cell type was calculated across the niches.

Histological validation

To determine whether the 74 snCv3 subclasses (level 2) were appropriately mapped to histological structures, the proportion of signature in each spot was compared to a histologically validated set of six unsupervised clusters defined by Space Ranger102 (Extended Data Fig. 5a). These six unsupervised clusters (glomerulus, PT, loop of Henle, distal convoluted tubule, connecting tubule and collecting duct, and the interstitium) had an overall alignment of 97.6% with the underlying histopathologic structures in the H&E image. In each sample, regions of cortex and medulla were defined by histological evaluation, including the presence of glomeruli. Of the 23 samples, 18 samples were composed of only cortex, 4 samples were a combination of cortex and medulla and 1 sample was completely medulla.

Label-free and multifluorescence large-scale 3D imaging

Kidney biopsy cores frozen in OCT from patients with AKI or CKD enrolled in KPMP were used for label-free imaging followed by multiplexed-fluorescence large-scale 3D imaging as outlined in the protocol (https://doi.org/10.17504/protocols.io.9avh2e6) and described in a recent publication27. Frozen biopsies were sectioned to a thickness of 50 µm using a cryostat and then immediately fixed in 4% fresh paraformaldehyde (PFA) for 24 h and subsequently stored at 4 °C in 0.25% PFA.

The first step in imaging consists of label-free imaging with multiphoton microscopy to collect autofluorescence and second harmonic images of the unlabelled tissue mounted in non-hardening mounting medium. Imaging was conducted using a Leica SP8 confocal scan-head mounted to an upright DM6000 microscope. For large-scale imaging of tissues at the sub-micrometer resolution, the Leica Tile Scan function was used to collect a mosaic of smaller image volumes using a high-power, high-numerical aperture objective. Leica LASX software (v.3.5) was then used to stitch these component volumes into a single image volume of the entire sample. The scanner zoom and focus motor control were set to provide voxel dimensions of 0.5 × 0.5 μm laterally and 1 μm axially.

Labelling of tissue for fluorescence microscopy was preceded by washing in phosphate-buffered saline (PBS) and blocking with PBS with 0.1% Triton X-100 (MP Biomedical) and 10% normal donkey serum (Jackson Immuno Research). Antibodies for indirect immunofluorescence were applied first for 8–16 h at room temperature, followed by washing cycles of PBS with 0.1% Triton X-100. An incubation cycle with secondary antibodies occurred next, followed by washing and finally application of directly labelled antibodies. Antibodies targeting markers for tubular cells and structures (aquaporin-1, uromodulin, F-actin) and immune cells (myeloperoxidase, CD68, CD3, siglec 8) were used, in addition to nuclei labelling using DAPI (Supplementary Table 35). After the final washing cycles, the tissue was mounted in Prolong Glass (Thermo Fisher Scientific).

Confocal microscopy was conducted using a Leica ×20/0.75 NA multi-immersion objective (adjusted for oil immersion), with excitation sequentially provided by a solid-state laser launch with laser lines at 405 nm, 488 nm, 552 nm and 635 nm. Images in 16 channels (emission spectra collected by PMT detectors adjusted for the following ranges: 410–430 nm, 430–450 nm, 450–470 nm, 470–490 nm, 500–509 nm, 510–519 nm, 520–530 nm, 530–540 nm, 570–590 nm, 590–610 nm, 610–630 nm, 631–651 nm, 643–664 nm, 664–685 nm, 685–706 nm and 706–726 nm) were collected for each focal plane of each panel of the 3D mosaic. The resulting 16-channel image was then spectrally deconvolved (by linear unmixing using the Leica LASX linear unmixing software) to discriminate the eight fluorescent probes in the sample. Validation of the linear unmixing was described previously27.

Confocal immunofluorescence microscopy

Human kidney tissue samples from the cortex or medulla were fixed in 4% PFA, cryopreserved in 30% sucrose and frozen in OCT cryomolds, and were cut into 5 μm sections. The sections were post-fixed with 4% PFA for 15 min at room temperature, blocked in blocking buffer (1% BSA, 0.2% skimmed milk, 0.3% Triton X-100 in 1× PBS) for 30 min at room temperature and then immunofluorescence microscopy was performed, first by overnight incubation at 4 °C with primary antibodies, followed by labelling with secondary antibodies. The primary antibodies included NRXN-1β, TUJ1, collagen I and III, synapsin-1, NPSH-1, SLC14A2, UMOD, CD31, CD34, CD11b, PROM1, KIM1, VCAM1, AQP1, AQP2, CD45 and S100 (Supplementary Table 36). After washing, labelling with the secondary antibodies was performed using Alexa-488-conjugated goat anti-mouse IgG, Cy3-conjugated goat anti-rabbit IgG or Cy5-conjugated donkey anti-goat IgG at room temperature for 1 h. After washing, the sections were counterstained with DAPI for nuclear staining. Images were acquired with a Nikon 80i C1 confocal microscope.

In situ hybridization

Human kidney tissues were sectioned with 3 μm from formalin-fixed, paraffin-embedded (FFPE) blocks. In situ detections of PROM1, CST3 and EGF mRNA transcripts were performed with the use of RNAscope Probes Hs-PROM1 (311261, Advanced Cell Diagnostics), Hs-CST3 (528181, Advanced Cell Diagnostics), and Hs-EGF (605771, Advanced Cell Diagnostics) and RNAscope kit (322330, Advanced Cell Diagnostics) according to the manufacturer’s protocol. RNAscope Positive Control Probe Hs-UBC (310041, Advanced Cell Diagnostics) was used as a positive control. A horseradish-peroxidase-based signal amplification system (322310, RNAscope 2.0 HD Detection Kit-Brown, Advanced Cell Diagnostics) was used to hybridize with target probes followed by DAB staining. The sections were then counterstained with haematoxylin (3535-16, RICCA Chemical Company). Positive staining was determined by brown dots. After rehydrating, the sections were immersed in periodic acid solution (0.5%, P7875, Sigma-Aldrich) for 5 min, rinsed in three changes of distilled water, incubated with Schiff’s reagent (3952016, Sigma-Aldrich) for 15 min and then rinsed in running tap water for 5 min. Nuclei were counterstained with haematoxylin 2 (220-102, Thermo Fisher Scientific) for 2 min. The sections were then rinsed in water, dehydrated in alcohol, cleared in xylene and finally mounted with Cytoseal XYL (8312-4, Thermo Fisher Scientific).

Tissue cytometry and in situ cell classification

Tissue cytometry and analysis were conducted using the Volumetric Tissue Exploration and Analysis (VTEA) software (v.1.0a-r9). VTEA is a 3D image processing workspace that was developed as a plug-in for ImageJ105. The version of VTEA, which includes the supervised and unsupervised labelling of cells and combining spatial and features based gating strategies, used here is available at GitHub (https://github.com/icbm-iupui/volumetric-tissue-exploration-analysis) and through the FIJI updater. In this analytical pipeline, each individual nucleus was segmented using intensity thresholding and connected component segmentation built into VTEA and ImageJ. Each surveyed nucleus became a surrogate for a cell, to which the location and marker staining around or within the nucleus could be registered. This captured information could be used to classify cells on the basis of marker intensity or spatial features using scatterplot displays that enable various gating strategies and statistical analysis, including export as .csv files of all segmented cells and the associated features106. Cells classified on the basis of marker intensity are summarized in Supplementary Table 37. Gated cells were mapped back directly into the image volumes, which enabled immediate validation of the gates. Moreover, direct gating on the image could be performed, which could trace all of the cells within the chosen region-of-interest back to the data display on the scatter plot. Thus, cell classification could also be performed based on direct annotation of regions-of-interest (ROIs) within the image volumes. Annotated ROIs were determined by the pixel-wise agreement between 3 of 4 experts who performed annotation on each biopsy specimen separately.

Using tissue cytometry, 14 cell classes were defined based on the following features: (1) PT cells: AQP1+ cells in cortex ± brush border staining. (2) C-TAL cells: UMOD+ cells in cortex. (3) Glomerular cells (which encompass PODs, glomerular endothelium and mesangial cells) annotated ROIs based on morphology and F-actin staining. (4) Cortical large and medium vessel cells: annotated ROIs based on morphology and F-actin staining. (5) Cortical distal nephron cells (distal tubules (CD), connecting tubules (CNT) and collecting ducts (C-CD) or cortical distal nephrons): AQP1UMOD and annotated ROIs based on unique morphology in cortex. (6) M-TAL cells: UMOD+ cells in the medulla. (7) DTL: AQP1+ cells in the medulla. (8) Medullary collecting ducts: AQP1UMOD and annotated ROIs based on unique morphology in the medulla. (9) Vascular bundles in the medulla: annotated ROIs based onunique morphology in the medulla and F-actin staining. (10) Neutrophils: MPO+ cells. (11) Activated macrophages: MPOCD68+ cells. (12) T cells: CD3+ cells. (13) Cells in altered regions: annotated ROIs based on loss of (unrecognizable) tubular morphology, expanded interstitium, increased fibrosis (by second harmonic generation imaging) and cell infiltrates. (14) Not determined: unable to be classified on the basis of the above criteria.

Using such an approach,1,540,563 cells were labelled from all the biopsies used in this analysis.

3D neighbourhood building and representation

3D neighbourhoods were calculated for every cell in each biopsy using VTEA and a radius of 25 μm (50 voxels in x and y and 25 voxels in z). We reasoned the largest measurable neighbourhood/niche in our 3D approach is limited by the 50 μm thickness of the sections imaged (z dimension). Thus, per Nyquist sampling, the radius used was about 25 μm, which is consistent with previous approaches107,108,109. For each 3D neighbourhood, VTEA was used to calculate the features: fraction-of-total and sum of each labelled cell per neighbourhood. A list of neighbourhoods, positions in 3D and their features was exported by biopsy sample as .csv files.

Neighbourhood visualization and statistical analysis

CSV files of neighbourhoods by biopsy sample were generated in VTEA and imported into R (v.4.0.4), parsed for the sum of each labelled cell and monotypic neighbourhoods removed. These features were scaled by Z-standardization and used for Louvain community detection (R packages FNN (v.1.1.3) and igraph (v.1.2.6)) and t-SNE manifold projection (R package Rtsne (v.0.15)). To understand the interactions within neighbourhoods, pairwise interactions by neighbourhood were tallied and plotted on a chord plot (R package: circlize (v.0.4.12)) and Pearson’s correlation coefficients were calculated and plotted (R packages Hmisc (v.4.5.0) and corrplot (v.0.84)). Subclasses of neighbourhoods, those with at least one cell with a specific label were selected and plotted as network plots (R package igraph (v.1.2.6)) with edges in CD3 and Altered neighbourhoods scaled at 40% of all other subclasses to facilitate visualization. All scripts are provided as an annotated RStudio notebook file (.rmd).

Plots and figures

UMAP, feature, dot and violin plots for snCv3, scCv3, SNARE2 and Visium data were generated using Seurat. Correlation plots were generated using the corrplot package. Genome coverage plots were performed using Signac. Plots for 3D cytometry and neighbourhood analysis were generated in R with circlize, ggplot2 and igraph.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link