Organization of the human intestine at single-cell resolution - Nature - Alert Breaking News

Tissue collection and processing

This study complies with all relevant ethical regulations and was approved by the Washington University Institutional Review Board and the Stanford University Institutional Review Board. Human bowel tissues were procured from deceased organ donors. Written informed consent was obtained from the next-of-kin for all donor participants.

Organs were preserved according to surgical protocols to prepare the bowel for transplantation. In brief, a large-bore cannula was placed into the infrarenal aorta before circulatory arrest. Immediately after circulatory cessation, the abdominal viscera was rapidly cooled by flushing ice-cold HTK preservation solution through the aortic cannula and by packing the abdominal cavity with ice. The bowel was kept cold throughout the transport and dissection process until samples could be frozen for long-term storage. Tissue samples collected from the designated bowel sites were preserved by snap freezing in liquid nitrogen for snRNA-seq/scATAC, and embedded and frozen in optimal cutting temperature compound (OCT) for CODEX.

Array creation

Imaging data were collected from four human donors, each of whom constitutes a dataset. Each dataset includes two arrays of tissues that were imaged together on the same coverslip with four tissues per array: colon (sigmoid, descending, transverse and ascending), and small intestine (ileum, mid-jejunum, proximal jejunum and duodenum). Tissues were individually frozen in OCT moulds and then cut and assembled into arrays of four tissues with known directionality such that a cross-section of each tissue would be achieved cutting the block at once. Arrays were constructed on the cryostat and sectioned at a width of 7 μm.

Registration of samples with HuBMAP common coordinate framework

We have also registered these blocks within HuBMAP’s tissue registration in a common coordinate framework⁶⁶. In brief, male and female 3D reference objects for 11 organs including the small bowel and colon were created using Visible Human Project datasets. Using standard surgical anatomical landmarks used to collect the eight bowel sites, the tissue blocks were registered to the reference objects. The anatomical landmarks used for small bowel segments were as follows: (1) descending duodenum to the right of the pancreas head; (2) 5 cm beyond the ligament of Treitz in the jejunum; (3) 200 cm of the jejunum beyond the ligament of Treitz in the mid-bowel; and (4) 5 cm proximal to the ileocecal valve for terminal ileum. Five centimetres of bowel at each site was collected, representing approximately 20 g of tissue at each site. For the colon, the following landmarks used: (1) the right colon midway between the ileocecal valve and the hepatic flexure; (2) the transverse colon midway between the hepatic and splenic flexures; (3) the left colon midway between the splenic flexure to the appearance of the sigmoid mesentery; and (4) the sigmoid colon midway to the rectosigmoid junction where the taenia coli ceased.

CODEX antibody conjugation and panel creation

CODEX multiplexed imaging was performed according to the CODEX staining and imaging protocol previously described⁸. Antibody panels were chosen to include targets that identify subtypes of intestinal epithelium and stromal cells, and cells of the innate and adaptive immune system. Detailed panel information is provided in Supplementary Table 7. Each antibody was conjugated to a unique oligonucleotide barcode, after which the tissues were stained with the antibody–oligonucleotide conjugates and we validated that the staining patterns matched the expected patterns already established for immunohistochemistry within positive control tissues of the intestine or tonsil. Similarly, haematoxylin and eosin morphology staining was used to confirm the location of marker staining. First, antibody–oligonucleotide conjugates were tested in low-plex fluorescence assays and the signal-to-noise ratio was also evaluated at this step, then they were tested all together in a single CODEX multicycle.

CODEX multiplexed imaging

The tissue arrays were then stained with the complete validated panel of CODEX antibodies and imaged⁸. In brief, this entails cyclic stripping, annealing and imaging of fluorescently labelled oligonucleotides complementary to the oligonucleotide on the conjugate. After validation of the antibody–oligonucleotide conjugate panel, a test CODEX multiplexed assay was run, during which the signal-to-noise ratio was again evaluated, and the optimal dilution, exposure time and appropriate imaging cycle was evaluated for each conjugate (Supplementary Table 7). Finally, each array underwent CODEX multiplexed imaging. Metadata from each CODEX run are provided in Supplementary Table 8.

CODEX data processing

Raw imaging data were then processed using the CODEX Uploader for image stitching, drift compensation, deconvolution and cycle concatenation. Processed data were then segmented using the CODEX Segmenter or CellVisionSegmenter, a watershed-based single-cell segmentation algorithm and a neural network R-CNN-based single-cell segmentation algorithm, respectively. The donor sample from individual B001 was segmented using the CODEX Segmenter (with parameters tuned as described previously⁸), whereas all of the other donor samples were segmented using CellVisionSegmenter. CellVisionSegmenter has been shown to work well with segmenting both dense and diffuse cellular tissues with CODEX data⁶⁷. CellVisionSegmenter is an open-source, pretrained nucleus segmentation and signal quantification software based on the Mask region-convolutional neural network (R-CNN) architecture. Indeed, it was designed and trained on manually annotated images from CODEX multiplexed imaging data within our own group. Consequently, the only parameter that was altered was the growth pixels of the nuclear mask, which we found experimentally to work best at a value of 3. Despite this, no segmentation algorithm does a perfect job of segmentation in cases in which the boundaries identified may capture portions of neighbouring cells, and nuclear segmentation can limit quantified signal that whole-cell segmentation might be able to capture (although also imperfect from lack of consistent cell membrane stains), which has been discussed in more detail in reviews and primary sources of segmentation^{67,68,69,70,71,72,73,74}. For this reason, we performed an in-depth analysis of the different data normalization techniques and unsupervised clustering methods for robust identification of cell types in CODEX intestine data. This analysis revealed that there is some segmentation noise that could affect cell type identification if using manual gating, but using z-normalization, Vortex or Leiden-based unsupervised clustering, over-clustering the data and manually overlaying resultant cell type clusters to the image results in cell type identification at a much higher fidelity⁷⁵.

Both the CODEX Uploader and Segmenter software can be downloaded from GitHub (https://github.com/nolanlab/CODEX), and the CellVisionSegmenter software is available at GitHub (https://github.com/bmyury/CellVisionSegmenter or https://github.com/michaellee1/CellSeg). After the upload, the images were again evaluated for specific signal: any markers that produced an untenable pattern or a low signal-to-noise ratio were excluded from the ensuing analysis. Uploaded images were visualized in ImageJ (https://imagej.nih.gov/ij/).

Cell type analysis

B001 and B004 cell type identification was performed according to methods developed previously⁷⁵. In brief, nucleated cells were selected by gating DRAQ5, Hoechst double-positive cells, followed by z-normalization of protein markers used for clustering (some phenotypic markers were not used in the unsupervised clustering). Cells positive (z > 1) for greater than 35 fluorescent markers were removed from the data. Then the data were overclustered with X-shift (https://github.com/nolanlab/vortex) or Leiden-based clustering with the scanpy Python package (v.1.9.1). These processing steps were performed based on an in-depth analysis of normalization techniques and unsupervised clustering algorithms used for CODEX multiplexed imaging data of the intestine⁷⁵. These are not new approaches and many packages have emerged for integrating these clustering algorithms into libraries such as Squidpy⁷⁵. Clusters were assigned a cell type on the basis of average cluster protein expression and the location within image. Impure clusters were split or reclustered after mapping back to the original fluorescence images.

Cell type annotation using STELLAR

CODEX cell type labels were transferred to other donors using the STELLAR framework for annotating spatially resolved single-cell data, as we described previously¹². In brief, STELLAR is a geometric deep learning method that uses the spatial and molecular cell information to transfer cell type labels across different datasets. While SpaGCN⁷⁶ and Spage2vec⁷⁷ can leverage spatial and molecular data to annotate cells, these methods are unsupervised clustering methods. As such, they require manual effort to assign cell clusters to corresponding cell types, and can also require additional manual effort for multiple iterative cluster refinements. On the other hand, STELLAR automatically identifies existing cell types and discovers cell types without requiring manual effort, so we used STELLAR as the preferred method. To apply STELLAR, we used B004 donor data as the labelled training dataset as defined in STELLAR. This dataset was curated and annotated by clustering, merging, reclustering, subclustering and assigning cells to cell types based on average marker expression. Each cluster’s purity and accuracy were confirmed by location of the cell within CODEX images with corresponding fluorescence images and also H&E staining. We used B004, B005 and B006 donor dataset to train STELLAR and transfer annotations to all other donor datasets that were treated as unannotated test datasets in the STELLAR framework. We did not expect to find any new cell types across different donors so we used STELLAR to identify existing cell types across donors. All datasets were z-normalized as suggested previously⁷⁵. We then manually confirmed the quality of STELLAR’s cell type assignments in all donor datasets by looking at average marker expression profiles of predicted cell types. We found that protein marker distributions match expert hand-annotated profiles¹².

CODEX cell type percentage normalization

We normalized the cell type percentage to the overall cell category for three reasons. First, we captured a cross-section of each individual intestinal piece; however, a few of the intestinal pieces (4 out of 64) were not representative cross-sections. Second, with a fixed CODEX imaging window, we attempted to capture the full mucosal area, as this contains the majority of cell types identified by antibodies in our panel, which led to variable amounts of the muscularis externa captured. Third, it is useful to normalize by overall cell type to understand how cell type compositions change across the three compartments.

CODEX same-cell density analysis

CODEX same-cell density was analysed by taking the average distance of the five nearest neighbours of the same cell type for each individual cell in each imaged region. This average distance was divided by the most diffuse distance for same cell types. The most diffuse same cell distance was calculated by taking the total number of cells of a given cell type divided by the total area of the region. Thus, numbers that are closer to 1 are least dense and numbers closer to 0 are more dense with cells of the same cell type.

Neighbourhood identification analysis

Neighbourhood analysis was performed as described previously^10,27. In brief, this analysis involved (1) taking windows of cells across the entire cell type map of a tissue with each cell as the centre of a window; (2) calculating the number of each cell type within this window; (3) clustering these vectors; and (4) assigning overall structure on the basis of the average composition of the cluster (Extended Data Fig. 2a). In brief, determining window size cut-offs for cellular neighbourhood analysis is an important metric to be chosen. In general, smaller window sizes will identify more local or microstructures, whereas larger window sizes will lead to the identification of similarly composed structures that require a larger window size. For our neighbourhood analysis here, we chose to have window size cut-offs by selecting the ten nearest neighbours around a given cell. This number has worked well to identify conserved compositions representing a cell’s immediate microenvironment or local neighbours in other tissues^{10,12,27,75,78,79}. We chose this strategically to look at the microstructures at the neighbourhood level because we also were curious to understand how these microstructures work together and come together to form macrostructures of the intestine at a multihierarchical scale. In general, the size of the structure does not directly relate to the window size choice, but instead relates to how compartmentalized the conserved cell types are within a given structure. Neighbourhoods were overclustered to 30 clusters. These clusters were mapped back to the tissue and evaluated for cell type enrichments to determine overall structure and merged down into 20 unique structures.

Neighbourhood conservation analysis

To determine neighbourhood compositional (cell type) conservation across the small bowel and colon, neighbourhood enrichment scores were found separately for both the small bowel and colon samples across all donors. This enrichment score is the average cell type percentage within the average of the neighbourhood cluster divided by the average cell type percentage for all cells in the tissue. The colon enrichment scores for each cell type were subtracted from the small bowel scores to provide the heat map that was ordered both in terms of the greatest absolute sum of differences for both neighbourhood and cell type in conservation. Thus, differences in cell type enrichment for a given neighbourhood indicate that this cell type is not compositionally conserved across the small and large intestines for this neighbourhood structure.

Community and tissue unit identification analysis

Communities were determined similarly to how multicellular neighbourhoods were determined with some small differences. In brief, the cells in the neighbourhood tissue maps were taken with a larger window size of 100 nearest neighbours. These windows were taken across the entirety of the tissue and the vectors then clustered with k-means clustering and overclustering with 20 total clusters. These clusters were mapped back to the tissue and evaluated for neighbourhood composition and enrichment to determine overall community type. This same approach was applied to communities with a window size of 300 nearest neighbours.

Hierarchical intestine structural graphs

Each hierarchical level was connected to the next by either what it contributed to the largest in the next level, or what made up at least 15% of this next hierarchy. The percentage of each feature in the level was represented by size of the shape. The shape and colour combination correspond to the level and feature respectively. The size of the connecting line represents the amount that a feature contributes to the next feature.

Spatial context maps

Spatial context maps were created as previously described²⁷. In brief, the spatial context analysis of neighbourhood–neighbourhood or community–community associations has some similarities with our method to identify multicellular neighbourhoods but also contains a few key differences. First, windows of neighbourhoods or communities were calculated with either 100 or 300 nearest neighbours respectively. For the neighbourhood spatial context maps, only the cells classified in the mucosa tissue unit were included in the analysis, whereas the community spatial context maps included all cells from all tissues. Once windows were calculated (number of each cell type within the window), then the combinations representing more than 85% of the neighbourhoods within that window were selected as a combination. This combination informs about prominent associations of neighbourhoods or communities in the window, a feature that we termed spatial context. Combinations were then counted and represented in size by the size of the black circle underneath the square neighbourhood combination and only combinations that had a greater frequency than 0.1% of all combinations were plotted for visualization purposes. Each combination was then connected to each combination containing it with another combination in sublayers of the graph. For example, if the Secretory Epithelial community was found to represent 85% of a window, then this would be its own combination (purple triangle). If it is found as a combination with the Adaptive-Immune-Enriched community, then it is connected with this in a combination (purple triangle and orange triangle). Similarly, if it is found as a combination with the Plasma-Cell-Enriched community, then it is connected with this in a combination (purple triangle and yellow triangle). The single combination is therefore connected (through black edges) to these combinations (nodes) in the spatial context graph. Similarly, combinations of the Secretory Epithelial and Adaptive-Immune-Enriched communities (purple triangle and orange triangle) derive from this combination (for example, Secretory Epithelial, Adaptive-Immune-Enriched and Follicle (purple, orange and blue triangles)).

Tissue motif identification

We used a previously developed method to identify significantly associated cellular neighbourhood-neighbourhood motifs²⁷. In brief, motif identification uses segmented areas of the tissue where multiple cells of the same community are co-located, instead of individual cells (Extended Data Fig. 5a). The tissue network graphs therefore represent shared edges between instances of communities. To create a tissue graph for each treatment group, we took the union of the tissue graphs of each unique imaging region. We then created a null-set as the graph of the set of cell neighbourhood or community assignments by a sequence of valid transpositions of cell neighbourhood or community assignments. Permuting neighbourhood or community assignments and fixing the number of vertices created the maximum entropy null distribution. Only two chains with at least five instances were considered. To identify significant chains, P values were Bonferroni corrected by multiplying by twice the number of tests conducted in each comparison group (small bowel and CL).

Tissue dissociation and nucleus isolation for single-nucleus experiments

Nuclei were isolated using the OmniATAC protocol⁸⁰. Isolation of nuclei was carried out on wet ice. A total of 40–60 mg of flash-frozen tissue was gently triturated and thawed in 1 ml HB (lysis) buffer (1.0341× HB stable solution, 1 M DTT, 500 mM spermidine, 150 mM spermine, 10% NP40, cOmplete Protease Inhibitor, Ribolock) for 5 min. Tissue was then dounced 10 times with pestle A and 20 times with pestle B, or until there was no resistance from either pestle. The samples were then filtered through a 40 μm cell strainer (Falcon, 352340) and the resulting homogenate was transferred to a prechilled 2 ml LoBind tube. The samples were centrifuged in a 4 °C fixed-angle centrifuge for 5 min at 350 rcf to pellet the nuclei. After centrifugation, all but 50 μl of supernatant was removed. Then, 350 μl HB was added to the nucleus pellet for a total volume of 400 μl and the nuclei were gently resuspended using a wide bore pipet. One volume of 50% iodixanol (60% OptiPrep (Sigma Aldrich; D1556), diluent buffer (2 M KCl, 1 M MgCl₂, 0.75 M Tricine-KOH pH 7.8), water) was added and the resulting solution was gently triturated. Next, 600 μl of 30% iodixanol was carefully layered under the 25% mixture. Finally, 600 μl of 40% iodixanol was layered under the 30% mixture. The sample was then centrifuged for 20 min at 3,000 rcf in a 4 °C swinging-bucket centrifuge, resulting in a visible band of nuclei. The supernatant was aspirated down to within 200–300 μl of the nucleus band. The nucleus band was then collected at 200 μl and transferred to a fresh 1.5 ml tube. The sample was diluted with one volume (200 μl) of resuspension buffer (1× PBS, 1% BSA, 0.2 U μl⁻¹ Ribolock). The nucleus concentration was determined using the Countess II FL Automated Cell Counter (Thermo Fisher Scientific; AMQAF1000).

snATAC–seq

snATAC-seq targeting 9,000 nuclei per sample was performed using the Chromium Next GEM Single Cell ATAC Library & Gel Bead Kit v1.1 (10x Genomics, 1000175) and the Chromium Next GEM Chip H (10x Genomics, 1000161) or Chromium Single Cell ATAC Library & Gel Bead Kit (10x Genomics, 1000110). Libraries were sequenced on the Illumina NovaSeq 6000 system (1.4 pM loading concentration, 50 × 8 × 16 × 49 bp read configuration) targeting an average of 25,000 reads per nucleus.

Single-nucleus transcriptome sequencing using snRNA-seq

snRNA-seq targeting 9,000 nuclei per sample was performed using Chromium Next GEM Single Cell 3’ Reagent Kits v3.1 (10x Genomics, 1000121) and the Chromium Next GEM Chip G Single Cell Kit (10x Genomics, 1000120). Libraries were pooled and sequenced on the Illumina NovaSeq 6000 system (read 1 = 28 bp, i7 index = 8 bp, i5 index = 0 bp, read 2 = 91 bp read configuration) targeting an average of 20,000 reads per nucleus.

Single-nucleus multiome experiments

snMultiome experiments targeting 9,000 nuclei per sample were performed using Chromium Chromium Next GEM Single Cell Multiome ATAC + Gene Expression (10x Genomics, 1000283). ATAC (read 1 = 50 bp, i7 index = 8 bp, i5 index = 24 bp, read 2 = 49 bp read configuration) and RNA (read 1 = 28 bp, i7 index = 10 bp, i5 index = 10 bp, read 2 = 90 bp read configuration) libraries were sequenced separately on the Illumina NovaSeq 6000 system.

Initial processing of single-nucleus data

Initial processing of scATAC-seq data was performed using the Cell Ranger ATAC Pipeline (https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/what-is-cell-ranger-atac) by first running cellranger-atac mkfastq to demultiplex the bcl files and then running cellranger-atac count to generate scATAC fragments files. Initial processing of snRNA-seq data was performed using the Cell Ranger Pipeline (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) by first running cellranger mkfastq to demultiplex the bcl files and then running cellranger count. As nuclear RNA was sequenced, data were aligned to a pre-mRNA reference. Initial processing of the mutiome data, including alignment and generation of fragments files and expression matrices, was performed using the Cell Ranger ARC Pipeline.

Colocalization analyses

The CODEX data were used to compute and compare the CLQ between all cell-type pairs in the small bowel versus the colon. The colocalization quotient between cell type A and cell type B was calculated using the expression ({{rm{CLQ}}}_{{rm{Ato B}}}=frac{{C}_{{rm{Ato B}}},/,{N}_{{rm{A}}}}{{N}_{{rm{B}}},/,(N-1)}) (ref. ⁸¹), where C_A→B is the number of cells of cell type A among the defined nearest neighbours of cell type B. N is the total number of cells and N_A and N_B are the numbers of cells for cell type A and cell type B. Student’s t-tests, adjusted for multiple-hypothesis testing, were used identity statistically significant different CLQs between the small bowel and colon.

Ligand and receptor analyses

The FANTOM5 database⁸² and 12 more literature-supported experimentally validated ligand and receptor pairs were used to obtain the final list of validated ligand receptor pairs⁸³. Non-parametric Wilcoxon rank-sum tests were used to identify differentially expressed ligands and receptors in the small bowel versus the colon (adjusted P-value cut-off = 0.05).

Quality control, dimensionality reduction and clustering of snATAC–seq data

The snATAC fragments files were loaded into R (v.4.1.2) using the createArrowFiles function in ArchR⁵². Quality-control metrics were computed for each cell and any cells with TSS enrichments less than 5 were removed. Cells were also filtered on the basis of the number of unique fragments sequenced using a unique fragment cut-off that was defined for each individual sample. The sample-specific cut-offs enabled us to account for differences in sequencing depth between samples and ranged from 1,000 to 10,000, with the most common cut-off being 3,000 fragments per cell. After quality control and filtering, doublet scores for all multiome cells and all non-multiome snATAC cells were computed using the ArchR function addDoubletScores with k = 10, knnMethod = “UMAP” and LSIMethod = 1. An ArchR project was then created and doublets were filtered with filterDoublets with a filter ratio of 1.2. A small number of snATAC samples did not separate into distinct clusters of expected cell types and were removed from downstream analysis. For the non-multiome scATAC cells, an IterativeLSI dimensionality reduction was generated using addIterativeLSI, with iterations = 3, sampleCellsPre = 25000, dimsToUse = 1:25 and varFeatures = 15000. Next, clusters were added with addClusters with resolution = 1.5, nOutlier = 20, seed = 1, sampleCells = 40000 and maxClusters = 40, and the resulting clusters were divided into groups on the basis of whether the cells exhibited high gene activity scores⁵², a measure of accessibility within and around a gene body, for known immune, stromal or epithelial marker genes.

Quality control, dimensionality reduction and clustering of snRNA-seq data

After running Cell Ranger, the raw_feature_bc_matrix produced by Cell Ranger was read into R with the Seurat⁸⁴ function Read10X. The data were filtered to remove nuclei with fewer than 400 unique genes per nucleus, greater than or equal to 10,000 genes per nucleus, greater than or equal to 20,000 counts per nucleus, or greater than or equal to 5% mitochondrial RNA per nuclei. To limit the contributions of ambient RNA, we also filtered out nuclei that did not have at least three times as many counts as the median number of counts in all droplets, which should reflect the median number of counts in an empty droplet, as the large majority of droplets are empty. This limits the fraction of RNA that can come from ambient RNA in droplets that are included in the dataset. DoubletFinder⁸⁵ was run for each non-multiome snRNA sample using principal components 1–20. nExp was set to 0.076 × nCells²/10,000, pN to 0.25 and pK was determined using paramSweep_v3, and cells that were classified as doublets were removed before downstream analysis. snRNA data for both multiome and non-multiome cells was corrected for possible ambient RNA correction using DecontX⁸⁶.

The remaining cells from all samples were merged into a single seurat object. The data was then processed using Seurat’s standard pipeline⁸⁴. First, NormalizeData was run using the method LogNormalize and scale.factor of 10,000. Variable features were identified with Seurat’s findVariableFeatures using the vst method and 2,000 features. ScaleData was then run on all genes and principal components were computed with RunPCA. To account for batch effects between different donors in clustering, the RunHarmony⁸⁷ function was run with the data being grouped by donor. RunUMAP was then used to generate a UMAP dimensionality reduction from the harmony reduction and the cells were clustered by first using FindNeighbors with reduction = “harmony” and dims = 1:20 and then FindClusters with a resolution of 1. Expression of marker genes in the resulting clusters was then used to label clusters as epithelial, stromal or immune for downstream analysis.

Annotation of single-nucleus data

The snATAC and snRNA data were annotated in the following groups: epithelial duodenum, epithelial jejunum, epithelial ileum, epithelial colon, stromal and immune. The snATAC data were further divided into separate projects for multiome and nonmultiome cells and the RNA annotations were used for the multiome cells while the snATAC-only cells were annotated separately. For the ATAC data, the cells in each of these compartments were subset into a new ArchR project. addIterativeLSI was then run for each compartment. addHarmony was then run using the LSI dimensions as input. After computation of the harmony dimensions, the cells were clustered using addClusters and a UMAP was computed on the basis of the harmony coordinates using addUMAP. Clusters were annotated by examining gene activity scores of known marker genes. Marker genes were used for initial annotation of cell types including BEST4⁺ enterocytes (BEST4, OTOP2), goblet cells (MUC2, TFF1, SYTL2), immature goblet cells (KLK1, RETNLB, CLCA1), stem cells (RGMB, SMOC2, LGR5, ASCL2), tuft cells (SH2D6, TRPM5, BMX, LRMP, HCK), enteroendocrine cells (SCGN, FEV, CHGA, PYY, GCG), cycling transit-amplifying cells (TICRR, CDC25C) and Paneth cells (LYZ, DEFA5).

Within the immune compartment, we identified clusters of CD4⁺ and CD8⁺ T cells (CD2, CD3E, IL7R, CD4, CD8), B cells (PAX5, MS4A1, CD19), plasma cells (IGLL5, AMPD1), natural killer cells (SH2D1B), macrophages/monocytes (CD14) and mast cells (HDC, GATA2, TPSAB1) (Fig. 4c and Extended Data Fig. 6d). Natural killer cells and T cells from one donor clustered separately from the other T cells in the snRNA data (Extended Data Fig. 6e), possibly because this donor was much younger (24 years) than the others in this study (Supplementary Table 1). Smooth muscle/myofibroblast clusters exhibited high expression of ACTA2, MTH11 and TAGLN. Villus fibroblasts exhibited high expression of WNT5B and some crypt fibroblasts exhibited high expression of RSPO3.

For the snRNA cells, cells were divided into six Seurat objects from the compartments listed above. Cells from the immune and stromal compartments were run through a pipeline analogous to the pipeline listed above for initial clustering (NormalizeData, ScaleData, RunHarmony, FindNeighbors and FindClusters). For cells in the four epithelial groups, we integrated the data using a different approach, first running SCTransform on the epithelial cells from each sample with assay = “decontXcounts”, method = “glmGamPoi”, and vars.to.regress = c(“percent.mt”, “percent.ribo”). We then ran the Seurat functions SelectIntegrationFeatures with features = 3000, PrepSCTIntegration, FindIntegrationAnchors using reference-based integration and dims 1–30, and finally IntegrateData. We then ran RunPCA on the integrated data followed by FindNeighbors and FindClusters to cluster the resulting integrated data.

In addition to removing probable doublets based on simulating doublets as described above, we also identified clusters with expression of markers from multiple lineages (for example, stromal and immune) during downstream clustering and annotation. For example, some cells that initially clustered with immune cells expressed higher levels of stromal genes than would be expected. For these cases, we took the following approach: we first clustered all of the cells initially classified as immune cells and identified marker genes for each cluster. We next compared the marker genes to a previously published list of colon marker genes²⁸ to nominate clusters that may not contain singlet immune cells. Next, we moved these cells to the stromal or epithelial compartments, in which we clustered them with all of the cells that were initially classified as epithelial or stromal cells. In this case, if the cells had high expression of immune marker genes when compared to stromal cells, we reasoned that they were most likely immune/stromal doublets and removed the cluster of cells before downstream analysis. After initial annotation of epithelial cells, enteroendocrine cells in the snRNA data from all samples were integrated according to the SCTransform-based integration approach by running SCTranform on all epithelial cells from all samples and integrating all epithelial cells using reference-based integration as described above. After integration of all cells, we subset only the enteroendocrine cells and then computed the principal components using RunPCA and identifyied clusters using FindNeighbors and FindClusters. Known subtypes of enteroendocrine cells were then annotated based on expression of marker genes (Fig. 4 and Extended Data Fig. 6h). To annotate the MUC6⁺, MUC5B⁺ and exocrine cells, we subset the integrated duodenum data and computed the principal components using RunPCA and identified clusters using FindNeighbors and FindClusters. Exocrine cells were annotated on the basis of expression of CELA3B and CPB1, and MUC5B⁺ and MUC6⁺ cells were annotated on the basis of expression of MUC5B and MUC6 (Fig. 4h). Comparisons in cell-type abundance between regions of the intestine were done using the packages scCoda²⁹ and Milo³⁰.

Peak calling for single-nucleus data

Peak calling was performed with MACS2 using ArchR⁵². Peak sets were defined independently for epithelial cells, stromal cells and immune cells. For epithelial cells, we wanted to generate a union peak set that captured both cell-type-specific and location-specific peaks in the epithelial compartment. To accomplish this, we divided the cells into groups, generated pseudobulk replicates for each group, called peaks on the pseudobulk replicates, generated a reproducible peak set for each group using the peaks called for the pseduobulk replicates, and then iteratively merged the peak sets for each group into a union peak set using the approach implemented in ArchR. Groups used for epithelial peak calling were the cells from each epithelial cell type—with all enteroendocrine subtypes combined, and MUC5B⁺, MUC6⁺, exocrine and unknown cells combined—in the four main regions of the intestine. Tuft cells from the small intestine were merged into a single group owing to the low number of tuft cells in the dataset. For defining the immune and stromal peak sets, cells were divided by cell type, but not location, as there were fewer cells in these compartments. Finally, an additional peak set for all cell types in the immune, stromal and epithelial cell types was defined for determining marker peaks for linkage-disequilibrium score regression.

Integration of snRNA and snATAC data

To assign snRNA profiles to the non-multiome snATAC samples, the snRNA and snATAC datasets from the four primary regions of the intestine (duodenum, jejunum, ileum and colon) were integrated separately using the ArchR function addGeneIntegrationMatrix with reducedDims = “Harmony” and useMatrix = “GeneScoreMatrix”.

Nomination of regulatory TFs in single-nucleus data

We next identified TF regulators according to the ArchR manual for identifying TF regulators for each region, with a correlation cut-off of 0.5. TFs that met the criteria for regulators in any of the four primary regions of the intestine are plotted in Fig. 5b. This process was performed separately for the multiome datasets and the integrated singleome datasets. Cell types with few cells in each region of the intestine (for example, L cells) were combined into a single group regardless of location of origin, leading to the final cell type groupings on the x-axis of Fig. 5b. Regulators were identified separately for cells in the immune and stromal compartment with the final cell type groupings indicated on the x-axis of Extended Data Fig. 10g,h. TF footprints were computed using the ArchR functions getFootprints and plotFootprints.

Analysis of absorptive differentiation trajectories in single-nucleus data

Absorptive differentiation trajectories for each main section of the colon (duodenum, jejunum, ileum and colon) were inferred by running the ArchR function addTrajectory with the trajectory set as harmony clusters moving from Stem to TA2 to TA1 to immature enterocytes to enterocytes and reducedDims set to the harmony dimensions. To identify variable peaks along the trajectory, a matrix of accessibility in all peaks along the trajectory was first generated with getTrajectory with useMatrix = “PeakMatrix” and log2Norm = TRUE. Peaks with variance > 0.9 in any of the four regions were then identified with the function plotTrajectoryHeatmap with varCutOff = 0.9, returnMatrix = TRUE, scaleRows = FALSE and maxFeatures = 100000. The four matrices returned by getTrajectory were then concatenated into a single matrix and the matrix was subset to include only peaks that met the variance criteria of 0.9 in at least one of the four regions and had an absolute difference in magnitude of at least 0.2. Row z-scores for the resulting matrix were computed using the ArchR function .rowZscore. The resulting row z-scores were k-means clustered using the function kmeans with the number of clusters set to 7 and iter.max = 500. Two clusters of peaks did not show a characteristic pattern and were not included in Fig. 6. Hypergeometric enrichment of motifs in marker peaks was computed with peakAnnoEnrichment and the resulting P values are plotted in Extended Data Fig. 10k. Variable genes along the trajectory were identified with an analogous method, using GeneExpressionMatrix for the multiome data or GeneIntegrationMatrix (for the separate snATAC and snRNA data) instead of PeakMatrix when running getTrajectory and plotTrajectoryHeatmap. For gene expression, log2Norm was set to TRUE when running getTrajectory and genes were filtered to include only those with an absolute difference in magnitude of at least 0.5. Row z-scores of the resulting matrices were again k-means clustered using the function kmeans with the number of clusters set to 7 and iter.max = 500. Enrichment of KEGG pathways in these clusters of genes was determined using the limma function kegga⁸⁸, and the resulting unadjusted P values are plotted in Extended Data Fig. 10l. Plots of gene expression versus pseudotime were generated using the ArchR function plotTrajectory with the default parameters, including using imputeWeights added by addImputeWeights. TFs with correlated motif activity and RNA expression were identified with correlateTrajectories as outlined in the ArchR manual. The row z-scores of the smoothed expression of TFs along the pseudotime trajectories is plotted in Fig. 6d. TFs that were correlated with expression in any of the four trajectories were included in the heat map in Fig. 6d. Peaks correlated with gene expression were identified with addPeak2GeneLinks. In Fig. 6e, the set of peaks that were correlated with ETV6 expression with a correlation of at least 0.4 in one of the four main intestinal regions was determined. For TMPRSS15, a correlation cut-off of 0.55 was used to show the most correlated peaks in the figure. The smoothed trajectory peak accessibility for each of these peaks was then plotted along the differentiation trajectory. This process was performed separately for the multiome datasets and the integrated singleome datasets.

Cell-type-specific linkage-disequilibrium score regression

Linkage-disequilibrium score regression is a method that aims to distinguish heritability from confounding factors such as population stratification and cryptic relatedness. To run cell-type-specific linkage-disequilibrium score regression, we first computed marker peaks for coarse cell types in our dataset. To do this, we added cell type annotations to the full ArchR project with all cells and then defined a peak set for this object by running addGroupCoverages with groupBy = “CellType” followed by addReproduciblePeakSet and addPeakMatrix. We next defined less granular cell types by merging all myofibroblast clusters and pericytes into a single group, all non-villus fibroblast clusters into a single group, all non-stem absorptive epithelial cells into a single group, cycling TA cells into a single group, all enteroendocrine cells except for EnteroendocrineUn into a single group, and lymphatic endothelial and endothelial cells into a single group. We determined marker peaks for the resulting groups of cells with getMarkerFeatures and ten selected peaks with getMarkers with cutOff = “FDR <= 0.1 & Log2FC >= 0.5”. Marker peaks from cell types with very few cells were not included (T cells, unknown, exocrine, secretory specialized MUC5B⁺, DC, ILC, adipocytes, neurons, EnteroendocrineUn) The resulting peaks were then lifted over to hg19 from hg38. We then followed the linkage-disequilibrium score regression tutorial (https://github.com/bulik/ldsc/wiki) for cell-type-specific analysis^89,90,91. We used summary statistics from a number of UKBB traits (https://nealelab.github.io/UKBB_ldsc/downloads.html) related to the intestine, including non-cancer illness code;self-reported: crohns disease, non-cancer illness code;self-reported: ulcerative colitis, non-cancer illness code;self-reported: malabsorption/coeliac disease, BMI as well as traits less clearly related to the intestine including non-cancer illness code;self-reported: hypertension and diagnoses—main ICD10: K02 dental caries. Coefficient P values from ldsc are plotted in the heat map in Fig. 6. Significance was determined by correcting the coefficient P values for the number of cell types tested with Bonferroni correction with the R function p.adjust.

Processing of single-nucleus data for ligand–receptor analysis and snRNA-CODEX integration

The ligand–receptor analysis and scRNA-seq CODEX integration were performed with an initial dataset consisting of samples from donors B001, B004, B005 and B006 before collection of the remaining data. This dataset was annotated similarly to the full dataset with the following differences. First, all analysis was carried out in R v.4.0.2. Second, the quality-control cut-offs used were slightly different, with the requirement to have three times as many RNA counts as an empty droplet not implemented in the initial dataset. Third, when running doubletFinder, pK was set as 0.09 instead of running paramSweep_v3. Fourth, scTransform was not used in the subclustering analysis for the epithelial compartments. Instead, Seurat’s standard normalize and scale pipeline followed by Harmony was run to compute an integrated dimensionality reduction that was then clustered using Seurat’s findNeighbors and FindClusters for the immune and stromal cells as well as the epithelial cells from the four main regions of the intestine. As the ligand–receptor analysis involved making predictions that we later attempted to validate using Molecular Cartography, this analysis could not be redone with the remaining dataset.

Molecular Cartography validation of ligand–receptor pairs

Small intestine (duodenum) and colon (sigmoid/descending) samples from donors B004, B008 and B009 were analysed for spatial transcriptomics. The cryosections (thickness, 10 µm) from the same OCT-embedded tissue arrays were placed onto the glass slides provided by Resolve Biosciences for Molecular Cartography assay. The Molecular Cartography assay was performed by the Resolve Biosciences team with their optimized protocol ‘human colon v1.3’ and targeting a panel of total 100 transcripts of interest, including 63 genes from our ligand–receptor predictions (Supplementary Table 3) and 37 genes for cell type annotation (Supplementary Table 9). Segmentation was performed using DAPI signal with cellPose (https://github.com/MouseLand/cellpose; v.2.0.5) and followed by Baysor (https://github.com/kharchenkolab/Baysor; v.0.5.1). Gene counts were quantified per cell and cells were removed by area (<50 or >8,000 pixels) and total gene counts (<2). Manual cell type annotation was performed on the basis of the marker gene expression after Leiden clustering (Scanpy). As only 37 genes were used as cell type markers and the sample number (n = 6) is limited, only 20 cell types were identified in this dataset. To validate the ligand–receptor repair (Supplementary Table 3), we first match the cell type with snRNA-seq annotation. Only 58 out of 152 predictions have matching cell types. We compared ligand and receptor expression (log transformed) between the colon and small intestine in their predicated cell types (one sided Wilcoxon rank-sum test). P values were corrected for multiple testing using the Benjamin–Hochberg procedure. In total, 15 pairs have consistent higher expression in the colon compared with in the small intestine (both adjusted P < 0.05; Supplementary Table 6). Permutation tests were used to assess whether the success rate (15.5%, 9 out of 58) of our validation is higher than random. Gene labels were swapped and the same DEG procedure was repeated 10,000 times.

snRNA-seq/CODEX single-cell matching and integrational analysis using MaxFuse

snRNA-seq cells and CODEX cells were matched and downstream integrative analysis was performed using MaxFuse, of which the methodology details were described previously⁴¹. In brief, MaxFuse is an algorithm that matches cells across different single-cell modalities by linear assignment, using both shared (when available) and unshared features, and implements signal boosting steps (for example, graph smoothing and meta cell construction) to enable matching cells across weakly correlated modalities (for example, RNA to protein). Although various methods are available for integration tasks on modalities with robust sharing information (such as scRNA/scATAC)^84,92,93, when such tasks involve integration between protein and sequencing modalities, with much weaker shared features available (<60 versus thousands), a specialized method is needed⁹⁴. We applied MaxFuse to match snRNA-seq cells to CODEX cells. Cells that were previously annotated as B, T, monocyte, macrophage, plasma, goblet, endothelial, enteroendocrine, smooth muscle and stromal cells were used during this integration process, whereas other cell types were not used owing to limited sharing information across modalities. Subsequently, a shared co-space was calculated to embed both modalities, with visualization of the embedding (first 20 MaxFuse-components) using UMAP. To evaluate single-cell matching performance across RNA to protein modality, we used cell type annotation accuracy (for example, a single CODEX plasma cell matched to a snRNA-seq plasma cell) as a proxy, and both CL and small bowel matching achieved >90% accuracy. Using the single-cell-level pairing information, we transferred the transcriptome expression profile to each individual CODEX cell, and subsequently performed analysis of the DEGs across various CODEX cellular neighbourhoods. The DEGs were selected with the function FindAllMarkers in the R package Seurat, with the parameters only.pos = TRUE, min.pct = 0.3, logfc.threshold = 0.25. The genes with adjusted P < 0.05 and shared across the CL and small bowel datasets were shown on the heat map. Gene Ontology enrichment analysis was performed for individual CNs with the DEGs, by using the PANTHER database. We have uploaded the code that we used to perform the MaxFuse matching for the snRNA-seq and CODEX datasets within the paper here (https://github.com/shuxiaoc/maxfuse/tree/main/Archive/hubmap_nature). This includes both the dataset preparation and analysis features of the code.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link