Diverse mutational landscapes in human lymphocytes - Nature - Alert Breaking News

Samples

Human blood mononuclear cells (MNCs) were obtained from four sources: (1) bone marrow, spleen and peripheral blood taken with written informed consent (provided by next-of-kin) from three deceased transplant organ donors (KX001, KX002, KX003) recruited from Cambridge University Hospitals NHS Trust, Addenbrooke’s Hospital (by Cambridge Biorepository for Translational Medicine, Research Ethics Committee approval 15/EE/0152), (2) peripheral blood taken with written informed consent from one patient (AX001) recruited from Addenbrooke’s Hospital (approval 07-MRE05-44), (3) tonsil taken with written informed consent from guardians of two patients (TX001, TX002) recruited from Addenbrooke’s Hospital (approval 07-MRE05-44), and (4) one cord blood (CB001) collected with written informed consent from guardian by StemCell Technologies (catalogue (cat.) no. 70007) (Supplementary Table 2). All donors were haematopoietically normal and healthy. Donor KX002 had a history of Crohn’s disease and treatment with Azathioprine. Patients TX001 and TX002 had a history of tonsillitis. MNCs from (1), (2) and (3) were extracted using Lymphoprep (Axis-Shield), depleted of red blood cells using RBC lysis buffer (BioLegend) and frozen viable in 10% DMSO. Cord blood MNCs (4) were received frozen and then selected on the basis of CD34 expression using the EasySep human whole-blood CD34 positive-selection kit (Stemcell Technologies) as per the manufacturer’s instructions, with the CD34⁺ fraction used for HSPC cultures and the CD34⁻ fraction used for lymphocyte cultures. Additional peripheral blood MNCs from (1) also underwent CD34 positive selection and was used for HSPC cultures.

Flow cytometry

MNC samples were sorted by flow cytometry at the NIHR Cambridge BRC Cell Phenotyping Hub on AriaIII or Aria-Fusion cell sorters into naive B cells (CD3⁻CD19⁺CD20⁺CD27⁻CD38⁻IgD⁺), memory B cells (CD3⁻CD19⁺CD20⁺CD27⁺CD38⁻IgD⁻), naive T cells (CD3⁺CD4/CD8⁺CCR7⁺CD45RA^high), memory T cells (CD3⁺CD4/CD8⁺CD45RA⁻), regulatory T cells (Tregs: CD3⁺CD4⁺CD25^highCD127⁻) and HSPCs (CD3⁻CD19⁻CD34⁺CD38⁻CD45RA⁻) (Supplementary Fig. 1). HSPCs from AX001 included HSCs (CD34⁺CD38⁻) and progenitors (CD34⁺CD38⁺CD10^-/dim). The antibody panels used are as follows: lymphocytes (excluding Tregs): CD3-APC, CD4-BV785, CD8-BV650, CD14-BV605, CD19-AF700, CD20-PEDazzle, CD27-BV421, CD34-APC-Cy7, CD38-FITC, CD45RA-PerCP-Cy5.5, CD56-PE, CCR7-BV711, IgD-PECy7, Zombie-Aqua; T_reg cells: CD3-APC, CD4-BV785, CD8-BV650, CD19-APC-Cy7, CD45RA-PerCP-Cy5.5, CD56-PE, CCR7-FITC, CD25-PECy5, CD127-PECy7, CD69-AF700, CD103-BV421, CCR9-PE, Zombie-Aqua; HSPCs (excluding AX001): CD3-FITC, CD90-PE, CD49f-PECy5, CD38-PECy7, CD33-APC, CD19-A700, CD34-APC-Cy7, CD45RA-BV421, Zombie-Aqua; HSPCs (AX001): CD38-FITC, CD135-PE, CD34-PE-Cy7, CD90-APC, CD10-APC-Cy7, CD45RA-V450, Zombie-Aqua. Details of the antibody panels used are in Supplementary Table 11. Cells were either single-cell sorted for liquid culture into 96-well plates containing 50 μl cell-type-specific expansion medium, or (for AX001 HSPCs) bulk-sorted for MethoCult plate-base expansion. Plotting of the fluorescence-activated cell sorting data was performed with FlowJo and FCS Express.

In vitro liquid culture expansion

We designed novel protocols to expand B and T cells from single cells into colonies of at least 30 cells. Detailed step-by-step descriptions of the protocols are provided in Supplementary Information. The B cell expansion medium was composed of 5 μg ml⁻¹ Anti-IgM (StratechScientific), 100 ng ml⁻¹ IL-2, 20ng ml⁻¹ IL-4, and 50 ng ml⁻¹ IL-21 (PeproTech EC), 2.5 ng ml⁻¹ CD40L-HA (Bio-Techne) and 1.25 μg ml⁻¹ HA Tag (Bio-Techne), in Advanced RPMI 1640 Medium (ThermoFisher Scientific) with 10% fetal bovine serum (ThermoFisher Scientific), 1% penicillin/streptomycin (Sigma-Aldrich), and 1% l-glutamine (Sigma-Aldrich). The T cell expansion medium was composed of 12.5 μl ml⁻¹ ImmunoCult CD3/CD28 (STEMCELL Technologies) and 100 ng ml⁻¹ IL-2 and 5 ng ml⁻¹ IL-15 (PeproTech EC), in ImmunoCult-XF T Cell Expansion Medium (STEMCELL Technologies) with 5% fetal bovine serum (ThermoFisher Scientific) and 0.5% penicillin/streptomycin (Sigma-Aldrich). Twenty-five microlitres of fresh expansion medium was added to each culture every 3–4 days. Colonies (30–2,000 cells per colony) were collected either manually or robotically using a CellCelector (Automated Lab Solutions) approximately 12 days after sorting (depending on growth).

Sorted HSPCs from donors KX001, KX002, KX003 and CB001 were expanded from single cells into colonies of 200–100,000 cells in Nunc 96-well flat-bottomed TC plates (ThermoFisher Scientific) containing 100 μl supplemented StemPro medium (Stem Cell Technologies) (MEM medium). MEM medium contained StemPro Nutrients (0.035%) (Stem Cell Technologies), L-Glutamine (1%) (ThermoFisher Scientific), Penicillin-Streptomycin (1%) (ThermoFisher Scientific) and cytokines (SCF: 100 ng ml⁻¹; FLT3: 20 ng ml⁻¹; TPO: 100 ng ml⁻¹; EPO: 3 ng ml⁻¹; IL-6: 50 ng ml⁻¹; IL-3: 10 ng ml⁻¹; IL-11: 50 ng ml⁻¹; GM-CSF: 20 ng ml⁻¹; IL-2: 10 ng ml⁻¹; IL-7: 20 ng ml⁻¹; lipids: 50 ng ml⁻¹) to promote differentiation towards myeloid–erythroid–megakaryocyte (MEM) and natural killer cell lineages. Manual assessment of colony growth was made at 14 days. Colonies were topped up with an additional 50 μl MEM medium on day 15 if the colony was ≥1/4 the size of the well. Following 21 ± 2 days in culture, colonies were selected by size criteria. Colonies ≥3,000 cells in size were collected into a U-bottomed 96-well plate (ThermoFisher Scientific). Plates were then centrifuged (500g for 5 min), medium was discarded, and the cells were resuspended in 50 μl PBS prior to freezing at −80 °C. Colonies less than 3,000 cells but greater than 200 cells in size were collected into 96-well skirted Lo Bind plates (Eppendorf) and centrifuged (800g for 5 min). Supernatant was removed to 5–10 μl using an aspirator prior to DNA extraction on the fresh cell pellet. Sorted HSPCs from donor AX001 were plated onto CFC medium MethoCult H4435 (STEMCELL Technologies) and colonies were picked following 24 days in culture.

Whole-genome sequencing of colonies

DNA was extracted from 717 colonies with Arcturus PicoPure DNA Extraction Kit (ThermoFisher Scientific), with the exception of larger HSPC colonies which were extracted using the DNeasy 96 blood and tissue plate kit (Qiagen) and then diluted to 1–5 ng. DNA was used to make Illumina sequencing libraries using a custom low-input protocol⁴⁵. We performed whole-genome sequencing using 150 bp paired-end sequencing reads on an Illumina XTen platform, to an average depth of 20× per colony. Sequence data were mapped to the human genome reference GRCh37d5 using the BWA-MEM algorithm.

Variant calling

We called all classes of variants using validated pipelines at the Wellcome Sanger Institute. SNVs were called using the program CaVEMan⁴⁶, insertion/deletions (indels) using Pindel⁴⁷, structural variants using BRASS⁴⁸ and copy number variants (CNVs) using ASCAT⁴⁹. In order to recover all mutations, including high frequency ones, we used an in silico sample produced from the reference genome rather than use a matched normal for the CaVEMan, Pindel, and BRASS analyses. Germline mutations were removed after variant calling (see below). For the ASCAT analysis we elected one colony (arbitrarily chosen) to serve as the matched normal.

Variants were filtered to remove false positives and germline variants. First, variants with a mean VAF greater than 40% across colonies of an individual were probably germline variants and were removed. To remove remaining germline variants and false positives, we exploited the fact that we have several, highly clonal samples per individual. We performed a beta-binomial test per variant per individual, retaining only SNVs and indels that were highly over-dispersed within an individual. For SNVs we also required that the variants be identified as significantly subclonal within an individual using the program Shearwater, and applied filters to remove artefacts resulting from the low-input library preparation. Detailed descriptions of the artefact filters were provided previously⁴⁵ and the complete filtering pipeline is made available on GitHub (https://github.com/MathijsSanders/SangerLCMFiltering). For both the beta-binomial filter and the Shearwater filter we observed bimodal distributions separating the data into low and high confidence variants. We made use of this feature, using a valley-finding algorithm (R package quantmod) to determine the p-value cut-offs, per individual. We genotyped each colony for the set of filtered somatic SNVs and indels (per respective individual), calling a variant present if it had a minimum VAF of 20% and a minimum of two alternate reads in that colony.

We estimated our sensitivity to detect SNVs using germline mutations as a truth set of heterozygous mutations. We called germline mutations by performing a one-sided exact binomial test of the sum of the alternate and sum of the total reads across colonies of an individual for each CaVEMan unfiltered variant (alternate hypothesis of proportion of successes less than 0.5 for autosomes and female X chromosomes, 0.95 for male sex chromosomes). A variant was called as germline on failure to reject the null at a false-discovery rate q-value of 10⁻⁶. We calculated sensitivity as the proportion of germline variants detected per colony.

We removed artefacts from the structural variant calls using AnnotateBRASS with default settings. The full list of statistics calculated and post-hoc filtering strategy was described in detail previously³⁶. Somatic structural variants were identified as those shared by less than 25% of the colonies within an individual. Structural variants and CNVs were both subsequently manually curated by visual inspection.

Mutation burden analysis

We found that sequencing depth was a strong predictor of mutation burden in our samples. Therefore, in order to more accurately estimate the mutation burden for each colony, we corrected the number of SNVs or indels (corrected separately) by fitting an asymptotic regression (function NLSstAsymptotic, R package stats) to mutation burden as a function of sequencing depth per colony. For this correction we used HSPC genomes (excepting the tonsil samples, for which naive B and T cells were used), as lymphocyte genomes are more variable in mutation burden, and included additional unpublished HSPC genomes to increase the reliability of the model¹². Genomes with a mean sequencing depth of greater than 50× were omitted. The model parameters b₀, b₁ and lrc for each dataset for the model y = b₀ + b₁ × (1 − exp(−exp(lrc) × x)) are in Supplementary Table 7. Mutation burden per colony was adjusted to a sequencing depth of 30.

We used a linear mixed-effects model (function lme, R package nlme) to test for a significant linear relationship between mutation burden and age, and for an effect of cell subset on this relationship (separately for SNVs and indels). Number of mutations per colony was regressed on age of donor and cell type as fixed effects, with interaction between age and cell type, donor by cell type as a random effect, weighted by cell type, and with maximum likelihood estimation.

Detecting positive selection

In order to estimate an exome-wide rate of selection and to detect selection acting on specific genes we used the dndscv function of the dNdScv R package¹³. This program leverages mutation rate information across genes. As the elevated mutation rate seen with SHM may break the assumptions of the test, we excluded the immunoglobulin loci from these analyses (excluded GRCh37 regions: chr14:106304735–107283226, chr2:89160078–90274237, chr22:22385390–23263607). We performed the test for the following subsets of the data: all lymphocytes, naive B, memory B, naive T, memory T, all lymphocytes testing only cancer genes and all lymphocytes excluding cancer genes. Cancer genes were defined as the 566 tier 1 genes from the COSMIC Cancer Gene Census (https://cancer.sanger.ac.uk, downloaded 6 June 2018).

Mutational signature analysis

We characterized per-colony mutational profiles by estimating the proportion of known and novel mutational signatures present in each colony. For comparison, we included in the analysis 223 genomes from 7 blood cancer types: Burkitt lymphoma, follicular lymphoma, diffuse large B cell lymphoma, chronic lymphocytic leukaemia (mutated), chronic lymphocytic leukaemia (unmutated), and acute myeloid leukaemia³⁸ and multiple myeloma¹⁵. We identified mutational signatures present in the data by performing signature extraction with two programs, SigProfiler⁵⁰ and hdp (https://github.com/nicolaroberts/hdp). We used the SigProfiler de novo results for the suggested number of extracted signatures. hdp was run without any signatures as prior, with no specified grouping of the data. These programs identified the presence of 9 mutational signatures with strong similarity (cosine similarity ≥ 0.85) to Cosmic signatures¹⁶ SBS1, SBS5, SBS7a, SBS8, SBS9, SBS13, SBS17b, SBS18 and SBS19 (version 3).

Both SigProfiler and hdp also identified the same novel signature (cosine similarity = 0.93), which we term the blood signature or SBSblood. This signature is very similar to the mutational profile seen previously in HSPCs^10,11. As the signature SBSblood co-occurs with SBS1 in HSPCs, leading to the potential for these signatures being merged into one signature, we further purified SBSblood by using the program sigfit⁵¹ to call two signatures across our HSPC genomes, SBS1 and a novel signature, with the novel signature being the final SBSblood (Extended Data Fig. 4a and Supplementary Table 8). SBSblood was highly similarto both the hdp and SigProfiler de novo extracted signatures (cosine similarity of 0.95 and 0.94, respectively) and had similarity to the Cosmic v3 SBS5 signature (cosine similarity = 0.87). One hypothesis is that SBSblood is the manifestation of SBS5 mutational processes in the blood cell environment.

We estimated the proportion of each of the 10 identified mutational signatures using the program sigfit. From these results we identified three signatures (SBS5, SBS13 and SBS19) that were at nominal frequencies in the HSPC and lymphocyte genomes (less than 10% in each genome)- these were excluded from the analysis and the signature proportions were re-estimated in sigfit using the remaining 7 signatures: SBSblood, SBS1, SBS7a, SBS8, SBS9, SBS17b, SBS18 (Supplementary Table 8).

Immunoglobulin receptor sequence analysis

In order to identify the immunoglobulin rearrangements, productive CDR3 sequences and per cent SHM for each memory B cell, we ran IgCaller⁵², using a genome from the same donor (HSPC or T cell) as a matched normal for germline variant removal. We considered the SHM rate to be the number of variants identified by IgCaller in the productive IGHV gene divided by the gene length. For CSR calling, see Supplementary Information.

We estimated the number of mutations resulting from on-target (IGHV gene) SHM compared with those associated with SBS9. We first counted all IGHV variants identified by Caveman pre-filtering, as we found that standard filtering removes many SHM variants. We then estimated SBS9 burden as the proportion of SBS9 mutations per genome multiplied by the SNV burden. The SBS9 mutation rate per genome was the SBS9 burden divided by the ‘callable genome’ (genome size of 3.1 Gb minus an average of 383 kb excluded from variant calling).

Distribution of germinal centre-associated mutations in B cells

We assessed the genomic distribution of the germinal centre-associated mutational signatures, SBS9 and the SHM signature, in memory B cells. We performed per-Mb de novo signature analyses with hdp (no a priori signatures), treating mutations across all normal memory B cells within a given Mb window as a sample. The extracted SHM signature (Supplementary Table 8) had a cosine similarity of 0.96 to the spectrum of memory B cell mutations in the immunoglobulin gene regions, supporting the assumption that it is indeed the signature of SHM. In this analysis, SBSblood and SBS1 resolved as a single combined signature that we refer to in the genomic feature regression (below) as SBSblood/SBS1.

We estimated the per-gene enrichment of SBS9 and SHM signatures across normal memory B and malignant B cell genomes (Burkitt lymphoma, follicular lymphoma, diffuse large B cell lymphoma, chronic lymphocytic leukaemia, and multiple myeloma). We first used sigfit to perform signature attribution of the signatures found in memory B cells (from the main signature analysis; SBSblood, SBS1, SBS8, SBS9, SBS17b or SBS18) and the extracted SHM signature from the above 1-Mb hdp analysis, considering each 1-Mb bin a sample. We subsequently calculated a signature attribution per variant. Gene coordinates were downloaded from UCSC (gencode.v30lift37.basic.annotation.geneonly.genename.bed). We calculated the mean attribution of variants in a given gene, representing the proportion of variants attributable to a given signature. We estimated the enrichment of SBS9 or SHM over genomic background per gene per cell type as the P-value of individual t-tests. While for this down-sampled dataset few genes were significant after multiple testing correction, analysis of full datasets with larger sample sizes show statistically significant enrichment in most presented genes after multiple testing correction (data not shown).

Regression of SBS9 and genomic features

The hdp per-Mb memory B cell mutational signature results above were used to identify genomic features associated with the location of mutations attributable to a particular mutational signature. To achieve a finer-scale genomic resolution, each Mb bin was further divided up into 10-kb bins, and the proportion of each mutational signature in a Mb bin was used to calculate a signature attribution per 10-kb bin, based on the type and trinucleotide context of mutations in the 10-kb bin.

The number of mutations attributable to a particular mutational signature, per 10-kb window, was regressed on each of 36 genomic features (Supplementary Table 4). Noise was further removed from the replication timing data, using the GM12878 blood cell line data, and filtering the Wave Signal data by removing low sum signal (<95) regions, per Hansen et al.⁵³. SBS9 was analysed separately from the SBSblood/SBS1 combined signature. The number of mutations per signature per bin was calculated as the sum of the per-nucleotide probabilities per signature within a given bin. For the analysis of a given signature, a bin was only included if the average contribution of that signature was greater than 50%. This step ameliorates the problem of artificially high numbers of mutations being ascribed to a bin due to the combination of a trivially small attribution but a high overall mutation rate. This can occur in high SHM or SBS9 regions. This left 26,151 bins for SBS9 and 25,202 bins for SBSblood, out of 91,343 bins with mutations and 279,094 bins genome-wide. We also included a random sample of zero-mutation bins to equal 10% of the total bins.

We performed lasso-penalized general additive model regressions of the number of mutations per bin with the value of the genomic features. We used the gamsel function in R (package gamsel), with the lambda estimated from a fivefold cross-validation of training data (two-thirds of the data). To estimate individual effect sizes, we performed general additive model regressions per genomic feature using the function gam (R package mgcv). The same analysis was also performed on HSPC mutations. The results for the full and individual regression models for each of SBS9 and SBSblood/1 in memory B cells and for all HSPC mutations can be found in Supplementary Table 4.

RAG and CSR motif analysis

We assessed the enrichment of V(D)J recombination (mediated by RAG) and class switch recombination (CSR, mediated by AID) associated motifs in regions proximal to lymphocyte structural variants. We identified the presence of full length and heptamer RSS motifs associated with RAG binding and endonuclease activity (RAG motifs) for the 50 bp flanking each structural variant breakpoint using the program FIMO⁵⁴ (P < 10⁻⁴). Clusters of AGCT and TGCA repeats, associated with AID cytosine deamination and CSR (CSR motifs), were identified in the 1,000 bp flanking each structural variant breakpoint using the program MCAST⁵⁵ (P < 0.1, maximum gap = 100, E < 10,000). In order to estimate a genomic background rate of these motifs, we generated 100 genomic controls sets, randomly selected from regions of the genome not excluded from variant calling, and performed both the RAG and CSR motif analyses on these sets. The genomic background rate presented is the median of the 100 control datasets for each motif analysis. Both the RAG and CSR motif analyses were also performed for structural variants from the PCAWG cancer genomes included in the mutational signatures analysis and for acute lymphoblastic leukaemia genomes³.

Telomere length

We estimated the telomere length for HSPC and lymphocyte genomes (Supplementary Table 3) using the program Telomerecat⁵⁶. Telomere lengths for all genomes for a given donor were estimated as a group.

Timing of mutational processes

Following a procedure described previously^33,57, we modelled the distribution of somatic mutations along the genome from the density of chromatin immunoprecipitation–sequencing reads using random forest regression in a tenfold cross-validation setting and the LogCosh distance between observed and predicted profiles. Each mutation was attributed to the signature that most likely generated it and aggregated into 2,128 windows of 1 Mb spanning ~2.1 Gb of DNA. Signatures with an average number of mutations per window <1 were not evaluated due to lack of power. We determined the difference between models using a paired two-sided Wilcoxon test on the values from the tenfold cross-validation. Epigenetic data were gathered from different sources^58,59,60 (Supplementary Table 9) and consisted of 149 epigenomes representing 48 distinct blood cell types and differentiation stages and their replicates. Histone marks used included H3K27me3, H3K36me3, H3K4me1 and H3K9me3. To evaluate the specificity of SBS9 mutational profiles in memory B cells, we took the same number of mutations as in SBSblood with the highest association with SBS9 and compared models with an unpaired two-sided Wilcoxon test.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link