Single-cell genomic variation induced by mutational processes in cancer - Nature - Alert Breaking News

Generation and culture of human mammary epithelial cell lines

The wild-type human mammary epithelial cell line 184-hTERT L9 (SA039) and isogenic 184-hTERT TP53 knockout (SA906) cell line, generated from 184-hTERT L9, were cultured as previously described^18,19,40 in Mammary epithelial cell growth basal medium (MEBM) (Lonza) supplemented with the SingleQuots kit (Lonza), 5 μg ml⁻¹ transferrin (Sigma-Aldrich) and 10 μM isoproterenol (Sigma-Aldrich). Additional truncation mutations (Supplementary Table 4) of BRCA1 (SA1054: c.[427_441+36delGAAAATCCTTCCTTGGTAAAACCATTTGTTTTCTTC];[437_441+8delCCTTGGTAAAACC]; SA1292 c.[71_75delGTCCC];[=]) and BRCA2 (SA1056: c.[6997delG];[6997_6998delGT]; SA1188: c.[6997_6999delGT];[=]; SA1055: c.[3507_3522delinsGA];[3509_3520delinT]) (hg19) were introduced by CRISPR–Cas9 nuclease (pX330 hSpCas9) with an RFP reporter gene using Mirus TransIT LT1 transfection (Mirus Bio). Clonal populations were generated by flow sorting and propagating single RFP-positive cells. Mutations were verified by TOPO cloning and Sanger sequencing of both alleles for genotypes, protein expression by western blotting and absence of off-target effects by sequencing of the top hits. SNV positions from Sanger sequencing data were annotated with information from GENCODE v.19 (ref. ⁴¹). Variant sequence and position was used to annotate variant calls with records from Clinvar 20200206_data_release⁴² and COSMIC v. 91 (ref. ⁴³). Although multiple BRCA2 homozygous loss-of-function alleles could be derived from 184-hTERT^{p53−/−;BRCA2+/−} intermediates, only a single homozygous BRCA1 allele was retrieved from the 119 clones of 184-hTERT^{p53−/−,BRCA1+/−} that were screened, emphasizing that even with a p53 deletion, full loss of BRCA1 is initially negatively selected. OV2295 cells⁴⁴ were maintained in a 1:1 mix of Media 199 (Sigma-Aldrich) and MCDB 105 (Sigma-Aldrich) supplemented with 10% fetal bovine serum (FBS) under normoxic conditions. Cell lines were authenticated by short tandem repeat (STR) profiling and tested for mycoplasma by LabCorp.

Immunoblotting

184-hTERT cells were lysed directly in 1× Laemmli buffer supplemented with 7.5% β-mercaptoethanol and proteins were denatured at 95 °C for 15 min. Protein from 250,000 cells was resolved on a 4–15% acrylamide gel (Biorad) or 3–8% Tris-acetate acrylamide gel (Novex) and transferred to a nitrocellulose membrane with Towbin transfer buffer overnight at 30 V at 4 °C. Blots were blocked with 5% milk in TBST for 1 h and incubated overnight at 4 °C with mouse anti-p53 (Santa Cruz SC-126, 1:500 in 5% bovine serum albumin (BSA)), mouse anti-BRCA1 (Santa Cruz SC-6954, 1:200 in 5% BSA), mouse anti-BRCA2 (Millipore OP95, 1:200 in 5% BSA) or goat anti-GAPDH (SC-48166, 1:500 in 5% BSA). Blots were washed five times for 5 min in TBST and incubated with anti-goat HRP-conjugated secondary antibody (Abcam ab6721, 1:5,000 in 5% BSA) for 1 h at room temperature, then washed five times for 5 min and the signal was imaged using Immobilon Western Chemiluminescent HRP Substrate (MilliporeSigma, WBKL20500) and the ImageQuant LAS 4000 (GE Healthcare) using the ImageQuant TL software.

Verification of mutations in the 184-hTERT cell line

Genomic DNA was extracted from 184-hTERT cell lines and regions of interest of BRCA1 or BRCA2 were amplified by PCR. Amplicons were inserted into a pCR-TOPO vector and transformed into Escherichia coli using the TOPO TA cloning kit (Thermo Fisher Scientific). Colonies were selected, DNA purified by Purelink Quick Plasmid Miniprep kit (Thermo Fisher Scientific) and sequenced by Sanger sequencing to assess CRISPR-induced mutations.

Acquisition of samples from patients and patient consent

Samples were acquired with informed consent, according to procedures approved by the Ethics Committees at the University of British Columbia. Patients with breast cancer undergoing diagnostic biopsy or surgery were recruited and samples were collected under protocols H06-00289 (BCCA-TTR-BREAST), H11-01887 (Neoadjuvant Xenograft Study), H18-01113 (Large-scale genomic analysis of human tumours) or H20-00170 (Linking clonal genomes to tumour evolution and therapeutics). HGSC samples were obtained from women undergoing debulking surgery under protocols H18-01652 and H18-01090. Banked HGSC and TNBC specimens were obtained at the Memorial Sloan Kettering Cancer Center following Institutional Review Board (IRB) approval and patient informed consent (protocols 15–200 (HGSC) and 18–376 (TNBC)). HGSC and TNBC clinical assignments were performed according to American Society of Clinical Oncology guidelines for ER, PR and HER2 positivity.

Xenografting

Fragments of tumours from patients were chopped finely with scalpels and mechanically disaggregated for one minute using a Stomacher 80 Biomaster (Seward Limited) in 1 ml cold DMEM/F-12 with glucose, l-glutamine and HEPES (Lonza 12–719F). Two hundred microlitres of medium containing cells or organoids from the resulting suspension was used equally for transplantation in four mice. The remaining tissue fragments were cryopreserved viably in DMEM/F-12 supplemented with 47% FBS and 6% dimethyl sulfoxide (DMSO). Tumours were transplanted in mice as previously described (Eirew) in accordance with SOP BCCRC 009. Female NOD/SCID/IL2rγ^−/− (NSG) and NOD/Rag1^−/−Il2rγ^−/− (NRG) mice were bred and housed at the Animal Resource Centre (ARC) at the British Columbia Cancer Research Centre. For subcutaneous transplants, mechanically disaggregated cells and clumps of cells were resuspended in 150–200 µl of a 1:1 v/v mixture of cold DMEM/F-12:Matrigel (BD Biosciences). Female mice (8–12 weeks old) were anaesthetized with isoflurane and the mechanically disaggregated cell clump suspension was transplanted under the skin on the left flank using a 1-ml syringe and 21-gauge needle. Mice were housed at a 18–25 °C temperature range and 20–70% humidity range, with a 12-h daylight cycle (on at 06:00; off at 18:00). All animal experimental work was approved by the animal care committee (ACC) and animal welfare and ethical review committee at the University of British Columbia (UBC) under protocol A19-0298.

Tissue processing

Xenograft-bearing mice were euthanized when the size of the tumours approached 1,000 mm³ in volume, according to the limits of the experimental protocol. The tumour material was excised aseptically and processed as described for primary tumour. A section of tumour was fixed in 10% buffered formalin for 24 h, dehydrated in 70% ethanol and paraffin-embedded before duplicate 1-mm cores were used to generate tissue microarrays for staining and pathological review. Remaining tumour was finely chopped and gently paddle-blended, and released single cells and fragments were viably frozen in DMEM supplemented with 47% FBS and 6% DMSO.

Histopathology of PDX tumours

Deparaffinized 4-µm sections of tissue microarrays were stained with haematoxylin and eosin or KRAS (Lifespan Bioscience, LS-B4683, 1:50), performed using the Ventana Discovery XT platform and the UltraMap DAB detection kit. HGSC pathology was confirmed by an anatomical pathology resident at University of British Columbia, under the supervision of a certified staff pathologist.

WGS

Genomic DNA was extracted from frozen tissue fragments using the DNeasy Blood and Tissue kit (Qiagen) and constructed libraries for whole genomes of 309 tumour–normal pairs were sequenced on the Illumina HiSeqX according to Illumina protocols, generating 100-bp paired-end reads for an estimated coverage of sequencing between 40× (normal) and 80× (tumour). Sequenced reads were aligned to the human reference GRCh37 (hg19) using BWA-MEM.

Long-read sequencing

High-molecular weight (HMW) DNA was extracted from fresh and/or frozen tissue fragments using the MagAttract HMW DNA Kit (Qiagen) and size-selected using Blue Pippin for single long-molecule sequencing on the PromethION (Oxford Nanopore Technologies).

Generation of single-cell suspensions and nuclei for scDNA-seq

Viably frozen aliquots of patient tissues and PDX tumours were thawed and either homogenized and lysed using Nuclei EZ Buffer (Sigma) or enzymatically dissociated using a collagenase/hyaluronidase 1:10 (10×) enzyme mix (STEMCELL Technologies), as described previously^3,18. Cells and nuclei were stained with CellTrace CFSE (Life Technologies) and LIVE/DEAD Fixable Red Dead Cell Stain (Thermo Fisher Scientific) in a 0.04% BSA/PBS (Miltenyi Biotec 130-091-376) incubated at 37 °C for 20 min. Cells were pelleted and resuspended in 0.04% BSA/PBS. This single-cell suspension was loaded into a contactless piezoelectric dispenser (Cellenone or sciFLEXARRAYER S3, Scienion) and spotted into open nanowell arrays (SmartChip, TakaraBio) preprinted with unique dual index sequencing primer pairs. Occupancy and cell state were confirmed by fluorescent imaging and wells were selected for single-cell copy number profiling using the DLP+ method³. In brief, cell dispensing was followed by enzymatic and heat lysis. After cell lysis, tagmentation mix (14.335 nl TD buffer, 3.5 nl TDE1 and 0.165 nl 10% Tween-20) in PCR water was dispensed into each well followed by incubation and neutralization. For BRCA1^+/− cells, the tagmentation mix consisted of 10 nl TB1 buffer and 10 nl BT1 enzyme without Tween-20 in PCR water. Final recovery and purification of single-cell libraries was done after eight cycles of PCR. Pooled single-cell libraries were analysed using the Agilent Bioanalyzer 2100 HS kit. Libraries were sequenced at the UBC Biomedical Research Centre on the Illumina NextSeq 550 (mid- or high-output, paired-end 150-bp reads), or at the Genome Sciences Centre on the Illumina HiSeq2500 (paired-end 125-bp reads) and Illumina HiSeqX (paired-end 150-bp reads). The data were then processed through a quantification and statistical analysis pipeline³.

Generation of 10X scRNA-seq data

The 184-hTERT cells were pelleted and gently resuspended in 200 µl PBS followed by 800 µl 100% methanol and incubation at −20 °C for 30 min to fix, dehydrate and shrink cells. PDX tumour fragments were dissociated into single cells using collagenase/hyaluronidase at 37 °C for 2 h for TNBC tumours or with cold active Bacillus lichenformis (Creative Enzymes NATE0633) in PBS supplemented with 5 mM CaCl₂ and 125 U ml⁻¹ DNAse for HGSC tumours, as described previously⁴⁵ with additional mechanical dissociation using a gentleMACS dissociator (Miltenyi Biotec). Cells were then pelleted and resuspended in 0.04% BSA/PBS and immediately loaded onto a 10X Genomics Chromium single-cell controller targeting 3,000 cells for recovery. Libraries were prepared according to the 10X Genomics Single Cell 3′ Reagent kit standard protocol. Libraries were then sequenced on an Illumina Nextseq500/550 with 42-bp paired-end reads, or a HiSeq2500 v4 with 125-bp paired-end reads. 10X Genomics Cell Ranger 3.0.2 was used to perform demultiplexing, counting and alignment to GRCh38 and mm10.

Processing of bulk whole-genome data

SNV and SV calls for 121 HGSC samples were acquired from a previous study¹². For new samples, reads were aligned to the hg19 reference genomes using BWA-MEM. Processing proceeded as per the aforementioned study¹² to maintain consistency.

SNVs were called with MutationSeq⁴⁶ (probability threshold = 0.9) and Strelka⁴⁷. The intersection of calls from these methods were retained; however, SNVs falling in blacklist regions were removed. The blacklist regions include the UCSC Genome Browser Duke and DAC blacklists, and those in the CRG Alignability 36mer track that had more than two mismatched nucleotides. SNVs were then annotated with OncoKB⁴⁸ for variant impact.

SVs were called using deStruct⁴⁹ and LUMPY⁵⁰, and breakpoints called by both methods were retained. We then filtered events with the following criteria: any breakpoints falling in the blacklists described above; ≤30-bp inter-breakpoint distance; <1,000-bp deletion; any breakpoints with fewer than 5 supporting reads in the tumour sample or any read support in the matched normal sample.

Gene mutation enrichment analysis was performed using the hypergeometric test for SNVs, amplifications and deep deletions separately, comparing each signature stratum to all other samples.

Nanopore data analysis

For Nanopore sequence data, base calling and read alignment were performed using Guppy v.3 and Minimap2, respectively^51,52. Reads that were likely to be derived from mouse were filtered by first aligning to a concatenated hg19 and mm10 reference, removing reads with alignments to mm10 and re-aligning the remaining reads to hg19. Signal artefact regions as well as alignments with mapping quality of less than 60 were excluded from the final alignments. Alignments were then phased using the PEPPER-Margin-DeepVariant pipeline, after which WhatsHap was used to tag reads in the filtered alignments using phasing information^53,54. SV calling was performed using Sniffles (v.1.0.12) and cuteSV (v.1.0.11) with 5 read support, and subsequently merged using SURVIVOR for a union set of predicted variants^55,56. Alignments and variants were visualized using IGV, Ribbon and the karyoploteR R package^57,58,59.

HGSC and TNBC meta-cohort signature analysis

Signature analysis was performed according to a previous study¹⁰. The MMCTM model was run on the sample SNV and SV count matrices. The number of signatures to estimate in the HGSC and TNBC integrated cohort was chosen by running the above fitting procedure for k = 2–16 for both SNV and SV signatures with the number of restarts set to 500, in which k is the number of signatures. We performed this step on approximately half the mutations in each sample, then computed the average per-mutation log likelihood on the other held-out half of the mutations. The elbow curve method on log-likelihood values was used to select the final number of signatures to fit to the entire dataset.

To estimate MMCTM parameters on the full dataset, α hyper-parameters were set to 0.1. The model was initially fit to the data 1,500 times. Each restart was run for a maximum of 1,000 iterations or until the relative difference in predictive log likelihood on the training data was less than 10⁻⁴ between iterations. The restarts with the best predictive log likelihoods for SNVs and SVs were selected as seeds for the final fitting step. The model was again fit to the data 1,500 times. The model parameters for each restart were set to the parameters of the optimal models from the previous step described above, then run for a maximum of 1,000 iterations or until the relative difference in predictive log likelihood on the training data was less than 10⁻⁵ between iterations. The restart with the best mean rank of the SNV and SV predictive log likelihoods from this round was selected as the final model.

MMCTM estimated SNV signatures were matched to COSMIC signatures by solving the linear sum assignment problem for cosine distances between the MMCTM and COSMIC signatures v.3 (minus tobacco smoking-associated COSMIC SBS4) using the clue R package⁶⁰.

Samples were clustered by first applying UMAP⁶¹ to the normalized signature probabilities for the HRD SNV signature and all SV signatures with n_neighbors = 20 and min_dist = 0 to produce two-dimensional sample embeddings. Next, HDBSCAN⁶² was run on the sample embeddings with min_samples = 5, min_cluster_size = 5 and cluster_selection_epsilon = 0.75 to produce the sample clusters (strata).

Survival analysis

For each patient, the number of days between diagnosis and death or last follow-up was collected. Patients were segregated into groups, and a Kaplan–Meier curve was fitted for each group. Each cancer type was analysed separately and in two distinct grouping schemes. First, patients were split into HRD and ‘Other’ groups, in which the HRD group included patients whose cancers were identified as being in either the HRD-Dup or HRD-Del groups, and the ‘Other’ group included all other patients. Next, patients were grouped according to their assigned signature types: HRD-Dup, HRD-Del, TD or FBI.

DLP+ WGS quantification and analysis

Single-cell copy number, SNV and SV calls were generated using a previously described approach³, except that BWA-MEM⁶³ was used to align DLP+ reads to the hg19 reference genome. The genome was segregated into 500-kb bins, and GC-corrected read counts were calculated for each bin. These read counts were then input into HMMCopy⁶⁴ to produce integer copy number states for each bin.

DLP+ data filtering

Cells were retained for further analysis if the cell quality was at least 0.75 (ref. ³), and they passed both the S-phase and the contamination filters. The contamination filter uses FastQ Screen⁶⁵ to tag reads as matching human, mouse or salmon genomes. If more than 5% of reads in a cell are tagged as matching the mouse or salmon genomes, then the cell is flagged as contaminated. The S-phase filter uses the cell-cycle state Random Forest classifier from ref. ³ and removes cells for which S-phase is the most probable state. The HGSC and TNBC cells were also filtered to remove small numbers of contaminating diploid cells.

Finally, cell filtering was performed to remove putative early and late S-phase cells that passed the initial S-phase filter. This involved two steps: first, building a cell phylogeny with sitka³⁷ and manually identifying the minimal phylogeny branches in which the cycling cells have been clustered. The cells in these branches were then removed. Next, clustering cells according to their copy number profiles and removing manually identified clusters of cycling cells.

We removed potentially problematic genome bins from our copy number results that had a mappability score of 0.99 or below, or that were contained in the ENCODE hg19 blacklist⁶⁶.

To detect SNVs and SVs in each dataset, reads from all cells in a DLP+ library were merged to form ‘pseudobulk’ libraries. SNV calling was performed on these libraries individually using MutationSeq⁴⁶ (probability threshold = 0.9) and Strelka (score > 20) (ref. ⁴⁷). Only SNVs that were detected by both methods were retained. For each dataset, the union of SNVs was aggregated, then for each cell and each SNV, the sequencing reads of that cell were searched for evidence of that SNV. SV calling was performed in a similar manner, by forming pseudobulk libraries, then running LUMPY⁵⁰ and deStruct⁴⁹ on each pseudobulk library, and retaining events that were detected by both methods. LUMPY and deStruct predictions were considered matched if the breakpoints matched in orientation and the positions involved were each no more than 200 nucleotides apart on the genome. Only deStruct predictions with a matching LUMPY prediction were retained. Sparse per-cell breakpoint read counts were extracted from deStruct using the cell identity of read evidence for each predicted breakpoint. SNV and SV calls were further post-processed according to a previous study¹². When performing pseudobulk analysis on groups of cells, a breakpoint is considered present in a clone if at least one cell that constitutes the clone contains evidence of the breakpoint. A subsampling experiment determined that this approach has 80% power to recover breakpoints at a cumulative coverage of 5× (100–150 cells) (see Supplementary Note).

Analysis of mutation signatures in DLP+ data

Mutation signature probabilities were fit to DLP+ pseudobulk-derived SNV and SV counts for each patient using the MMCTM method and pre-computed mutation signatures from the HGSC and TNBC meta-cohort. Inference was performed as per the bulk sequencing data, until the relative difference in predictive log likelihood was < 10⁻⁶ between iterations.

Identifying clones in DLP+ WGS by clustering copy number profiles

For most datasets, clones were detected by first using UMAP on per-cell GC-corrected read count profiles, producing a two-dimensional embedding of the cell profiles. We then ran HDBSCAN on the two-dimensional embedding from UMAP to detect clusters of cells with similar copy number profiles.

UMAP was run with min_dist = 0.0, and metric = “correlation”, whereas HDBSCAN was run with approx_min_span_tree = False, cluster_selection_epsilon = 0.2, and gen_min_span_tree = True. Dataset specific UMAP and HDBSCAN parameter settings are listed in Supplementary Table 3.

Calculating cell ploidy

Cell ploidy was calculated by taking the most common copy number state. Copy number states were those determined by HMMCopy.

Identifying missegregated chromosomes

The approach taken to identify putative chromosome missegregation events is similar to a previous one³. Cells were split into groups corresponding to their clones. Clone copy number profiles were generated for each clone. Cells with ploidy not equal to the clone consensus profile were normalized to match the clone ploidy. Cell copy number profiles were compared to the clone copy number profile for the matching clone to which the cell belongs. The result was assignment of an offset value for each genomic bin in each cell, that represented the copy number difference between the cell and the clone-level consensus profile. For each chromosome in each cell, if a particular copy number difference (that is, −1, 1, and so on) represented at least 75% of the chromosome, then we labelled that chromosome as having a missegregation event.

Identifying CNA segments

Gain and loss segments in each cell were found by comparing the copy number state in each 500-kb bin to that cell’s ploidy. A copy number higher than ploidy was labelled as a gain, and a copy number lower than ploidy was labelled as a loss. Gain and loss segments are a set of consecutive bins with the same gain–loss label. Segments ≤ 1.5 × 10⁶ bp were excluded to reduce segments potentially resulting from noise in the HMMCopy copy number states.

Computing serriform variability scores in CNA breakpoints

For each dataset, consensus copy number profiles were generated for each clone. Copy number segments were identified as above for each consensus profile. Copy number segments were then identified for single-cell copy number profiles. The copy number profiles of each cell were normalized so that the adjusted cell ploidy matched the ploidy of the clone to which the cell belonged using the following formula: cell_state = cell_state/cell_ploidy × clone_ploidy

Cell copy number segments were matched to segments in the clone copy number profile as follows: for each segment in the clone copy number profile, inspect the copy number states of the adjacent segments. If the segment state was less than both adjacent states, then only cell segments whose state was less than both of the two adjacent clone segment states could be matched to that segment. If the clone segment state was higher than both adjacent states, only cell segments whose state was higher than both of the adjacent clone segment states could be matched to that segment. If the clone segment state was in between the two adjacent states, only cell segments whose state was in between the two adjacent clone segment states could be matched to that segment. Finally, each cell segment was matched to the compatible clone segment that it overlapped the most, in which compatibility means that the cell segment state met the criteria described above and the cell belonged to the relevant clone.

Next, clone segment breakpoints were aggregated across all clones. For each breakpoint, matched cell breakpoints were identified. Stable cell breakpoints (that is, those cell breakpoints that matched the clone-level breakpoint position) and unstable cell breakpoints (all other cell breakpoints) were queried for their raw GC-corrected read count values up to five bins to the left and right of the breakpoint position. Breakpoint noise values were computed as the mean absolute value of the difference between these values and the integer copy number state inferred by HMMCopy. For each clone-level breakpoint event, cells were removed if their breakpoint noise values were higher than a threshold value, which was computed as the mean noise value of the stable cell breakpoints.

Serration scores for each event were calculated by first computing the frequency of cell-specific breakpoint positions. Each cell breakpoint position was considered ‘rare’ if it occurred in less than 5% of cells with the considered event. The final serration score was computed as the fraction of event cells whose breakpoint position was considered ‘rare’.

For comparing serration rates between cases, breakpoints with at least 100 cells, and whose adjacent copy number segments were in total at least 20 Mbp (40 genomic bins) were retained. This was done to retain only those breakpoints for which serration could be reliably computed. As a result, SA605 was not included in comparisons as this case had fewer than 100 cells. A zero-inflated generalized linear mixed model with a beta response that accounted for case-specific and signature-type effects was fit to determine the effect of mutation signature type on serration scores.

Comparison of HLAMP copy number variance

HLAMPs were identified by first selecting 500-kb genomic bins in which at least 10 cells have a raw copy number (adjusted per-bin read counts) of at least 10. Copy number variance for each bin was calculated using the raw copy number that was adjusted for cell ploidy and cell clone by first dividing the copy number by the cell ploidy, then subtracting the mean clone raw copy number. The cell ploidy is the most common HMMCopy copy number state as described above, and the mean clone copy number is computed for each bin in each clone across all cells in that clone. Mean HLAMP copy number variance was calculated for each dataset across all HLAMP bins, and these values were compared between signature type dataset groups.

Clustering HLAMP genomic features

To explore plausible mechanistic origins of oncogenic HLAMPs we extracted genomic features proximal to the locus of interest. We took a region 15 Mbp either side of the locus of interest and pulled out copy number and SV features. We extracted the following features: entropy of haplotype-specific states; total number of SVs identified; proportion of SVs of each type (fold-back inversions, duplications, deletions and translocations); number of chromosomes involved in translocations; ratio of copy numbers between the bin containing the oncogene and the average copy number across the chromosome; average copy number state; average size of segments; average number of segments; and average minor allele copy number. All averages are across cells. We then performed hierarchical clustering on a scaled matrix of all features, using the silhouette width to determine the appropriate number of clusters.

HSCN analysis

See the Supplementary Note for a detailed discussion of our method, SIGNALS, for HSCN analysis. This includes validation of the method and benchmarking against other methods. In brief, SIGNALS uses haplotype blocks genotyped in single cells and implements an hidden Markov model (HMM) based on a Beta-Binomial likelihood to infer the most probable haplotype-specific state. We used default parameters for all datasets apart from SA1292, in which we increased the self transition probability from 0.95 to 0.999 to mitigate against the noisier copy number data in this sample.

Pseudobulk HSCN profiles

In numerous places in this study we construct ‘pseudobulk’ haplotype-specific or total copy number profiles either across all cells in a sample or subsets of cells that share some features of interest. To do this, we group the cells of interest and then compute an average profile by taking the median values of copy number and BAF and the mode of the haplotype-specific state. The function ‘consensuscopynumber’ provided in SIGNALS was used for this.

Comparing segmentation profiles across cells

To facilitate comparisons of genomic profiles across cells, we inferred a set of disjoint segments from the consensus copy number profiles of clusters. For each clone or cluster we generated a consensus segmentation profile, and then used the ‘disjoin_ranges’ function from plyranges⁶⁷ to generate a non-overlapping disjoint segmentation profile. Each segment was then genotyped in each cell by taking a consensus across the bins within each segment, producing a consistent set of genomic segments and states that could be compared across cells.

Identification of parallel copy number events

The set of genotyped disjoint segmentation profiles was used to calculate the number of parallel copy number CNAs. Parallel CNAs were defined as genomic regions greater than 4 Mbp in which gain or loss of both the maternal and paternal haplotype was observed in more than 1% of cells. Copy number breakpoints of segments do not need to match to be included.

Phylogenetic analysis

We computed phylogenetic trees using sitka as previously described³⁷, using the consensus tree from the posterior distribution for downstream analysis. For visualization, clades with a high fraction of singletons (nodes with a single descendant) were removed. To remove nodes, nodes were ordered by the fraction of descendants that were singletons, and nodes were removed iteratively until a maximum of 3% of cells in the tree were removed. Trees were visualized using ggtree⁶⁸ and functionality in SIGNALS. Phylogenetic distances were computed as the mean pairwise distances between phylogenetic tips (cells) using the cophenetic function in APE⁶⁹. Distances represent the number of copy number change points between two cells on the phylogeny.

Event rates inferred from single-cell phylogenies

To compute the rates of gains and losses of whole chromosomes, chromosome arms and segmental aneuploidies we enumerated the number of events from our single-cell phylogenies using parsimony-based ancestral state reconstruction. We used the genotyped disjoint segmentation profiles for this.

We first defined states for each segment in each cell relative to the most common state across all cells. For each segment, cells can have one of two possible states for each class of interest: (gain, not gained), (loss, not lost). By casting the problem as reconstructing the ancestral states within the phylogeny, we can then compute the number of transitions between these states that most parsimoniously explains the phylogenetic tree. We used a simple transition matrix in which transitions between states incur a cost of 1. Ancestral state reconstruction then amounts to finding the reconstruction that minimizes this cost. The event frequency per sample is then calculated by dividing the parsimony score (number of events) by the number of cells. We used castor v.1.6.6 in R to perform the ancestral state reconstruction⁷⁰. The unit of this quantity is the number of events per cell division, assuming no cell death. It is possible (perhaps likely) that many cells get segmental gains or losses but then die, we never sample such cells and our phylogenetic tree reconstructs ancestral relationships between cells that survive and that we sample. It is challenging to decouple the death rate of cells from the true event rate per cell division⁷¹; thus, our event rate is an effective event rate; that is, the event rate scaled by the (unknown) death rate of cells. To contrast the rates across different types of events, we classified segments as whole chromosomes, chromosome arms or segmental aneuploidies.

Calculation of copy number distance

The copy number distance calculates the number segments that need to be modified to transform one copy number profile into another²⁰. We use this measure to compute cell-to-cell variation in our dataset. To compute this measure, we modified the code provided in a previous study²⁵ to take into account whole-genome doubling of cells (https://github.com/raphael-group/WCND). We did this as follows: given two copy number profiles (integer copy number states of individual haplotypes in bins across the genome) CNP_A and CNP_B, we computed the following distances:

$${d}_{1}=f({{rm{CNP}}}_{{rm{A}}},{{rm{CNP}}}_{{rm{B}}})$$

$${d}_{2}=f(2times {{rm{CNP}}}_{{rm{A}}},{{rm{CNP}}}_{{rm{B}}})$$

$${d}_{3}=f({{rm{CNP}}}_{{rm{A}}},2times {{rm{CNP}}}_{{rm{B}}})$$

in which 2× refers to doubling the copy number state across the whole genome. We then took the copy number distance to be

$$d=min ({d}_{1},{d}_{2},{d}_{3}).$$

If the minimum was d₂ or d₃, we increased d by 1 (that is, counting WGD as an additional event). Calculating all pairwise comparisons is computationally expensive, so for each dataset we subsampled 250 cells and calculated all pairwise distances for these 250 cells.

10X scRNA-seq processing

CellRanger software (v.3.1.0) was used to perform read alignment, barcode filtering and quantification of unique molecular identifiers (UMIs) using the 10X GRCh38 transcriptome (v.3.0.0) for FASTQ inputs. CellRanger filtered matrices were loaded into individual Seurat objects using the Seurat R package (v.4.1.0) (refs. ^72,73). The resulting gene-by-cell matrix was normalized and scaled for each sample. Cells retained for analysis had a minimum of 500 expressed genes and 1,000 UMI counts and less than 25% mitochondrial gene expression. Genes expressed in fewer than three cells were removed. Cell-cycle phase was assigned using the Seurat⁷³ CellCycleScoring function. Scrublet⁷⁴ (v.0.2.3) was used to calculate and filter cells predicted to be doublets. We then applied the standard Seurat processing pipeline using default parameters apart from using the first 20 principal component analysis (PCA) dimensions for nearest neighbour and UMAP calculations.

Allelic imbalance in scRNA-seq

We called heterozygous SNPs in the scRNA-seq data using cellSNP v.1.2.2 (ref. ⁷⁵). As input, we used the same set of heterozygous SNPs identified in the scDNA-seq and the corresponding normal sample for each sample. The liftover script provided in cellSNP was used to lift over SNPs from hg19 to hg38. After genotyping, we phased the SNPs using the phasing information computed from the haplotype-specific inference in the scDNA-seq. As SNP counts are much more sparse in scRNA-seq than in scDNA-seq (around two orders of magnitude lower), we aggregated counts across segments (minimum size = 10 Mbp), computing the BAF for each segment. We then generated a cell by segment BAF matrix and incorporated this into our gene expression Seurat objects. We applied an additional filtering criterion here, removing cells with fewer than 200 SNP counts. Functionality to map scDNA-seq to scRNA-seq and call allelic imbalance is provided in SIGNALS.

Differential expression analysis

Differential expression analysis between gene expression clusters was computed using the Wilcoxon rank sum test with the presto R package. Gene expression clusters were computed using the FindClusters function in Seurat. Only cells in G1 phase were included. To compare gene expression variability for oncogenes, we took the absolute maximum log-transformed fold change for each sample for each oncogene and contrasted this value in cases in which oncogene copy number was determined to be fixed or variable from DLP+ single-cell sequencing of the same samples. ‘Variable’ oncogenes were defined as those that had a minimum ratio of 2 between the maximum to minimum clone-level copy number, and ‘non-variable’ oncogenes as those that had a ratio of less than 2.

Nearest neighbour gene expression analysis

To assess transcriptional convergence of losses of alleles we made use of the shared nearest neighbour graph computed using Seurat. This was done for chr. 2q in sample SA906b. For a given cell, an enrichment score was defined as the observed fraction of nearest neighbours divided by the expected fraction of nearest neighbours. Here, the expected fraction of neighbours with the same allelic state was defined as the global fraction of cells in each state. Hence, a positive enrichment score indicates an overrepresentation of cells in the allelic state of interest among its nearest neighbours, a negative score indicates an underrepresentation and a score of 0 would reflect a perfectly mixed neighbourhood of cells with different allelic states. To mitigate the influence of other technical or biological variability, for this analysis we only included cells in G1 phase, and removed cells with greater than 7.5% mitochondrial gene expression as we found that this was variable between gene expression clusters.

Statistical tests

The statistical tests used were two-tailed unequal-variance t-tests unless otherwise specified: log-rank tests were used for comparing survival curves; Wilcoxon rank sum two-tailed tests were used for comparing segment lengths, segment counts, missegregations and ploidy percentages, copy variances, bin counts, gene copy number distributions, gene expression log-transformed fold changes, parallel copy number counts and breakpoint counts; and hypergeometric tests were used to identify enrichment of gene mutations. P values from multiple comparisons were corrected using the Benjamini–Hochberg method⁷⁶.

Box plot statistics

All box plots indicate the median, first and third quartiles (hinges), and the most extreme data points no farther than 1.5× the IQR from the hinge (whiskers).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link