May 26, 2024
Geographic variation of mutagenic exposures in kidney cancer genomes – Nature

Geographic variation of mutagenic exposures in kidney cancer genomes – Nature

Recruitment of cases and informed consent

The International Agency for Research on Cancer (IARC/WHO) coordinated case recruitment through an international network of more than 40 collaborators from the 11 participating countries (Table 1 and Supplementary Table 16). The inclusion criteria for patients were ≥18 years of age (range from 23 to 87, with a mean of 60 and a s.d. of 12), confirmed diagnosis of primary ccRCC and no prior cancer treatment. Informed consent was obtained for all participants. Patients were excluded if they had any condition that could interfere with their ability to provide informed consent or if there were no means of obtaining adequate tissues or associated data as per the protocol requirements. Ethical approvals were first obtained from each Local Research Ethics Committee and Federal Ethics Committee when applicable, as well as from the IARC Ethics Committee.

Bio-samples, data collection and expert pathology review

Dedicated standard operating procedures, following guidelines from the International Cancer Genome Consortium (ICGC), were designed by IARC/WHO to select appropriate case series with complete biological samples and exposure information as described previously34 (Supplementary Table 16). In brief, for all case series included, anthropometric measures were taken, together with relevant information regarding medical and familial history. Comparable smoking and alcohol history was available from all centres. Detailed epidemiological information on residential history was collected in Czech Republic, Romania, and Serbia. Potential limitations of using retrospective clinical data collected using different protocols from different populations were addressed by a central data harmonization to ensure a comparable group of exposure variables (Supplementary Table 16). All patient-related data as well as clinical, demographical, lifestyle, pathological and outcome data were pseudonymized locally through the use a dedicated alpha-numerical identifier system before being transferred to IARC/WHO central database.

Original diagnostic pathology departments provided diagnostic histological details of contributing cases through standard abstract forms. IARC/WHO centralized the entire pathology workflow and coordinated a centralized digital pathology examination of the frozen tumour tissues collected for the study as well as formalin-fixed, paraffin-embedded sections when available, via a web-based report approach and dedicated expert panel following standardized procedures as described previously34. A minimum of 50% viable tumour cells was required for eligibility to whole-genome sequencing.

In summary, frozen tumour tissues were first examined to confirm the morphological type and the percentage of viable tumour cells. A random selection of tumour tissues was independently evaluated by a second pathologist. Enrichment of tumour component was performed by dissection of non-tumoral part, if necessary. 90 cases overlapped with a previously published cohort recruited under the Cancer Genomics of the Kidney (CAGEKID) project5, which were also part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) project7.

DNA extraction

Extraction of DNA from fresh frozen tumour and matched blood samples was centrally conducted at IARC/WHO except for Japan, which performed DNA extractions at the local centre following a similarly standardized DNA extraction procedure. Of the cases which proceeded to the final analysis (n = 962), germline DNA was extracted from either buffy-coat, whole blood, or from adjacent normal tissue (samples from Japan) using previously described protocols and methods34.

Whole-genome sequencing

In total, 1,583 renal cell carcinoma cases were evaluated, with 1,267 confirmed as ccRCC cases. One hundred and sixteen cases (9%) were excluded due to insufficient viable tumour cells (pathology level), or inadequate DNA (tumour or germline). DNA from 1,151 cases was received at the Wellcome Sanger Institute for whole-genome sequencing. Fluidigm single nucleotide polymorphism (SNP) genotyping with a custom panel was performed to ensure that each pair of tumour and matched normal samples originated from the same individual. Whole-genome sequencing (150 bp paired end) was performed on the Illumina NovaSeq 6000 platform with target coverage of 40X for tumours and 20X for matched normal tissues. All sequencing reads were aligned to the GRCh38 human reference genome using Burrows-Wheeler-MEM (v0.7.16a and v0.7.17). Post-sequencing quality control metrics were applied for total coverage, evenness of coverage and contamination. Cases were excluded if coverage was below 30X for tumour or 15X for normal tissue. For evenness of coverage, the median over mean coverage score was calculated. Tumours with median over mean coverage scores outside the range of values determined by previous studies to be appropriate for whole-genome sequencing (0.92–1.09) were excluded. Conpair39 (https://github.com/nygenome/Conpair) was used to detect contamination, cases were excluded if the result was greater than 3%40. A total of 962 cases passed all criteria and were included in subsequent analysis.

Somatic variant calling

Variant calling was performed using the standard Sanger bioinformatics analysis pipeline (https://github.com/cancerit). Copy number profiles were determined first using the algorithms ASCAT41 and BATTENBERG28, where tumour purity allowed. Single nucleotide variants (SNVs) were called with cgpCaVEMan42, indels were called with cgpPINDEL43, and structural rearrangements were called using BRASS. CaVEMan and BRASS were run using the copy number profile and purity values determined from ASCAT where possible (complete pipeline, n = 857). Where tumour purity was insufficient to determine an accurate copy number profile (partial pipeline, n = 105), CaVEMan and BRASS were run using copy number defaults and an estimate of purity obtained from ASCAT/BATTENBERG. For SNV additional filters on ASRD (read length-adjusted alignment score of reads showing the variant allele) and CLPM (median number of soft-clipped bases in variant supporting reads) (ASRD ≥ 140 and CLPM = 0) were applied to remove potential false positive calls. A second variant caller, Strelka2, was run for SNVs and indels as consensus variant calling was previously shown to eliminate algorithm specific artefacts and to generate highly reproducible mutational spectra compared to using a single variant calling algorithm34,44. Only variants called by both the Sanger variant calling pipeline and Strelka2 were included in subsequent analysis.

Validation of sequencing for Japanese cases

The matched normal tissue sequenced was blood for all countries with the exception of Japan, where adjacent normal kidney was used. As Japan displayed an enrichment of SBS12, matched blood was obtained from 28 of the 36 patients to confirm that the source of the matched normal tissue was not influencing the result. In all cases, the mutational spectra of Japanese ccRCC generated using either blood or adjacent normal kidney matched each other with a cosine similarity of >0.99.

Generation of mutational matrices

Mutational matrices for SBS, DBS and indels were generated using SigProfilerMatrixGenerator (https://github.com/AlexandrovLab/SigProfilerMatrixGenerator) with default options (v1.2.12)45.

Mutational signature analysis

Mutational signatures were extracted using two algorithms, SigProfilerExtractor (https://github.com/AlexandrovLab/SigProfilerExtractor), based on non-negative matrix factorization, and mSigHdp46 (https://github.com/steverozen/mSigHdp), based on the Bayesian hierarchical Dirichlet process. For SigProfilerExtractor, de novo mutational signatures were extracted from each mutational matrix using SigProfilerExtractor with nndsvd_min initialization (NMF_init = “nndsvd_min”) and default parameters (v1.1.9)47. Briefly, SigProfilerExtractor deciphers mutational signatures by first performing Poisson resampling of the original matrix with additional renormalization (based on a generalized mixture model approach) of hypermutators to reduce their effect on the overall factorization47. Non-negative matrix factorization (NMF) was performed using initialization with non-negative singular value decomposition and by applying the multiplicative update algorithm using the Kullback–Leibler divergence as an objective function47. NMF was applied with factorizations between k = 1 and k = 20 signatures; each factorization was repeated 500 times47. De novo SBS mutational signatures were extracted with SigProfilerExtractor for both SBS-288 and SBS-1536 contexts45. The results were largely concordant with the SBS-1536 de novo signatures allowing additional separation of mutational processes, therefore the SBS-1536 de novo signatures were taken forward for further analysis (Supplementary Table 2). Mutational signatures for DBS and indels were extracted in DBS-78 and ID-83 contexts respectively (Supplementary Tables 3 and 4). Where possible, SigProfilerExtractor matched each de novo extracted mutational signature to a set of previously identified COSMIC signatures4, for SBS-1536 signatures this requires collapsing the 1536 classification into the standard 96 substitution type classification with six mutation classes having single 3’ and 5’ sequence contexts (Supplementary Table 7). This step makes it possible to distinguish between de novo signatures which can be explained by a combination of the known catalogue of mutational process (which have not been completely separated during the extraction), and those which have not been previously identified. mSigHdp extraction of SBS-96 and ID-83 signatures was performed using the suggested parameters and using the country of origin to construct the hierarchy. SigProfilerExtractor’s decomposition module was subsequently used to match mSigHdp de novo signatures to previously identified COSMIC signatures4. Further details on the comparison of results between SigProfilerExtractor and mSigHdp, decomposition of de novo signatures into COSMIC reference signatures and support for the splitting of SBS40 components can be found in the Supplementary Note.

Attribution of activities of mutational signatures

The de novo (SigProfiler) and COSMIC signature (SigProfiler and mSigHdp) activities were attributed for each sample using the MSA signature attribution tool (v2.0, https://gitlab.com/s.senkin/MSA)48. For COSMIC attributions, only COSMIC reference signatures, which were identified in the decomposition of de novo signatures, were included in the panel for attribution, in addition to de novo signatures which could not be decomposed into COSMIC reference. At its core, the tool utilizes the non-negative least squares (NNLS) approach minimizing the L2 distance between the input sample and the one reconstructed using available signatures. To limit false positive attributions, an automated optimization procedure was applied by repeated removal of all signatures that do not increase the L2 similarity of a sample by >0.008 for SBS, >0.014 for DBS, and >0.03 for ID mutation types, as suggested by simulations. These optimal penalties were derived using an optional parameter (params.no_CI_for_penalties = false) utilizing a conservative approach in calculation of penalties. Finally, a parametric bootstrap approach was applied to extract 95% confidence intervals for each attributed mutational signature activity.

Driver mutations

A dNdS approach was used to identify genes under positive selection in ccRCC49. The analysis was performed both for the whole genome (q value < 0.01), and with restricted hypothesis testing for a panel of 369 known cancer genes49. Variants in any gene identified as under positive selection in global dNdS or in the 369-cancer gene panel were assessed as potential drivers49. Candidate driver mutations were annotated with the mode of action using the Cancer Gene Census (https://cancer.sanger.ac.uk/census) and the Cancer Genome Interpreter tool50 (https://www.cancergenomeinterpreter.org). Missense mutations were assessed using the MutationMapper tool51 (http://www.cbioportal.org/mutation_mapper). Variants were considered likely drivers if they met any of the following criteria: (1) Truncating mutations in genes annotated as tumour suppressors; (2) mutations annotated as probably or known oncogenic in MutationMapper; (3) truncating variants in genes with selection (q value < 0.05) for truncating mutations assumed to be tumour suppressors and thus likely drivers; (4) missense variants in all genes under positive selection and with dN/dS ratios for missense mutations above 5 (assuming 4 of every 5 missense mutations are drivers) labelled as likely drivers; or (5) in-frame indels in genes under significant positive selection for in-frame indels.

Evolutionary analysis

Subclonal architecture reconstruction was performed using the DPClust R package v2.2.8 (refs. 28,29), after obtaining cancer cell fraction (CCF) estimates by dpclust3p v1.0.8 (https://github.com/Wedge-lab/dpclust3p) based on the variant allele frequency provided by the somatic variant callers and the copy number profiles determined by the BATTENBERG algorithm. Only tumours with at least 40% purity according to BATTENBERG were considered for further evolutionary analysis. For each tumour with at least one subclone, the respective somatic mutations were split into clonal and subclonal mutations using the most probable cluster assignment for each mutation as per the DPClust output. Mutations not assigned to a cluster by DPClust were removed from further analysis. Clusters centred at a CCF > 1.5 and ones where chromosome X contributed the highest number of mutations were deemed artifactual, and the respective mutations were removed. Samples with a total number of clonal or subclonal mutations below 256 were also removed. Additionally, samples with poor separation between the clonal and subclonal distributions (e.g., subclone centred at a CCF > 0.80) were removed. Finally, only samples that had both a clone and at least one subclone post-filtering were retained for further analysis. This yielded a total of 223 samples, each with clonal and subclonal mutations. SigProfilerAssignment (v0.0.13)52 (https://github.com/AlexandrovLab/SigProfilerAssignment) was used to identify the activity of each mutational signature in each clone/subclone, and these activities were then normalized by the total number of mutations belonging to the clone or subclone (that is, clonal mutations were not included in the subclone). A two-sided Wilcoxon signed-rank test53 was used to assess the differences in the relative activity of each mutational signature between the clones and their respective subclones. P values were corrected using the Benjamini–Hochberg procedure54 and reported as q values in this Article.

Regressions

Signature attributions were dichotomized into presence and absence using confidence intervals, with presence defined as both lower and upper limits being positive, and absence as the lower limit being zero. If a signature was present in at least 75% of cases (SBS1, SBS40a, SBS40b, ID1 and ID5), it was dichotomized into above and below the median of attributed mutation counts. The binary attributions served as dependent variables in logistic regressions, and relevant risk factors were used as factorized independent variables. To adjust for confounding factors, sex, age of diagnosis, country, and tobacco status were added as covariates in regressions. The Bonferroni method was used to test for significant P values (that is, a total of 224 comparisons for regressions with signatures, and a total of 24 comparisons for regressions with mutation burden). P values reported are raw (not corrected). Regressions with incidence of renal cancer were performed as linear regressions with mutation burdens or signature attributions (those present in at least 75% of cases) with confidence intervals not consistent with zero as a dependent variable, and ASR of renal cancer obtained from Global Cancer Observatory (GLOBOCAN)11, sex and age of diagnosis as independent variables. ASR of renal cancer for regions of Czech Republic were obtained from SVOD web portal55. Signatures present in less than 75% of cases were dichotomized into presence and absence as previously mentioned and analysed using the logistic regressions. All regressions were performed on a sample basis.

Polygenic risk score analysis of lifestyle risk factors

In this analysis, we used the genome-wide association studies (GWAS) summary statistics estimated in European populations for well-established risk factors for ccRCC. For tobacco smoking status, we used results from the GSCAN consortium meta-analysis of smoking initiation (ever vs never status)56. For BMI, the results of UK Biobank and GIANT consortium meta-analysis of continuous BMI were used57. GWAS summary statistics related to hypertension, namely systolic blood pressure and diastolic blood pressure, as well as the ones related to diabetes58, such as fasting glucose and fasting insulin were also obtained using UK Biobank studies59.

Since all the GWAS summary statistics used in the current work were based on European populations, we used ADMIXTURE tool (v1.3.0)60 and PCA to infer the unsupervised cluster of individuals with European genetic background within ccRCC cases. Hapmap SNPs (n = 1,176,821 variants) were extracted from the ccRCC whole-genome sequence genotype data. After basic quality control using PLINK (v1.9b, www.cog-genomics.org/plink/1.9/), 333 variants were removed due to missing genotype rate > 5%, 1,236 variants failed Hardy–Weinberg equilibrium test (P values < 10−8), and 18,702 variants had MAF < 1% in our cohort. Additionally, 3 ambiguous variants and 21,358 variants within regions of long-range, high linkage disequilibrium in the human genome (hg38) were excluded. After pruning for linkage disequilibrium, 143,727 variants remained in ccRCC genotype data. The 1000 genome reference population genotype data (phase 3) for Europeans (N = 489), Africans (Yoruba in Ibadan, Nigeria, N = 108) and East Asians (N = 103 from China and 104 from Japan) (https://www.internationalgenome.org/data/) were filtered and merged with ccRCC genotype data based on the pruned set of variants present in both datasets. ADMIXTURE was run on the merged genotype data with k = 3, which would correspond to the 3 ancestral continental population groups that probably reflect the participants of our study. The ccRCC cases with European genetic fraction greater than 80% by the ADMIXTURE analysis were selected for the PRS analyses. To complement the ADMIXTURE analysis, PCA was run on the same samples.

The initial genotype data based on whole-genome sequence from 849 ccRCC cases with European genetic background consisted of biallelic SNPs with MAF > 0.01% (to exclude ultra-rare variants; N ≈ 30 million variants). After basic quality control, variants with missing genotype rate of greater than 5% (N = 7,519,196 variants) with strong deviation from Hardy–Weinberg equilibrium (P values < 10−8, N = 220,862) were excluded. For each GWAS trait, we restricted our analyses to the biallelic SNPs with minor allele frequency (MAF) greater than 1% in the 1000 Genomes reference for European populations. For the selection of the independent genome-wide significant hits (P values < 5 × 10−8) of each GWAS summary statistic used to generate the PRS, SNPs were clumped (r2 = 0.1 within a linkage disequilibrium window of 10 MB) using PLINK v1.9b61 (www.cog-genomics.org/plink/1.9/) based on the 1000 Genomes European reference population genotype data (N = 489; ~10 million variants). Where a selected GWAS hit was not found in ccRCC genotype data, we extracted proxies (r2 > 0.8 in 1000 Genomes) also present in ccRCC dataset where possible (Supplementary Table 17). The variance of each genetic trait explained by the genetic variants were calculated as previously suggested62. PRS was subsequently calculated as the sum of the individual’s beta-weighted genotypes using PRSice-2 software63. Associations were estimated per standard deviation increase in the PRS, which was normalized to have a mean of zero across ccRCC cases of European genetic ancestry.

Untargeted metabolomics association with signatures

Of the 962 subjects from the main analysis, 901 subjects were included in this sub-study—all Japanese samples (n = 36) as well as few cases from Czech Republic (n = 13), Romania (n = 5) and Russia (n = 3) were not included due to lack of available plasma samples. Samples were randomized and analysed as two independent analytical batches. Analysis was performed with a UHPLC-QTOF-MS system that consisted of a 1,290 Binary LC and a 6,550 QTOF mass spectrometer equipped with Jet Stream electrospray ionization source (Agilent Technologies), using previously described methods64. Pre-processing was performed using Profinder 10.0.2.162 and Mass Profiler Professional B.14.9.1 software (Agilent Technologies, https://www.agilent.com/). A ‘batch recursive feature extraction (small molecules)’ process was employed for samples and blanks to find [M + H]+ ions. The two batches were processes separately and the resulting features were aligned in Mass Profiler Professional. Chromatographic peak areas were used as a measurement of intensity. No normalization or transformation of raw data was performed prior to the downstream data analysis.

A total of 2,392 features were detectable in at least one of the 901 samples. Features present in only one of the two batches were filtered out. Recursive filtering elimination was applied to decrease redundancy from highly correlated variables (r ≥ 0.85, Pearson’s r calculated before any transformation/imputation) by selecting the features with least missing data within clusters of features. A total of 944 features were included in the statistical analysis. Features were pre-processed: missing values were replaced with one-fifth of the minimal value of the feature before applying mean centering and Pareto scaling. Each feature was regressed against both de novo and COSMIC signatures, adjusting for sex and age of diagnosis, as well as BMI and technical factors (batch, acquisition order) that could impact chromatographic peak area. Models for SBS22a and SBS22b were restricted to Romanian and Serbian samples to find potential pathways of aristolochic acid exposure in the Balkan region. Logistic models were used for zero-inflated signatures (≥30% zeros) while quasi-Poisson regressions were used for the least zero-inflated signatures (SBS1, SBS40a, and SBS40b). To derive specific false detection rates, random variables were created from permutations of the initial features and regressed against signatures in the same fashion as true features. Maximum P value thresholds from regressions with random features were compared to adjusted P value thresholds according to Bonferroni’s procedure. The more conservative approach was used in selecting features of interest. Random forest models were also used as cross-checking multivariate models to assess the relative importance of each feature in explaining the signature attribution. As with univariate models, regression models were used for the least zero-inflated signatures (<30% of zeros) while classification models were used for all other signatures, with restriction to Romanian and Serbian samples for SBS22a and SBS22b. Importance was estimated from the total decrease in node impurities from splitting on the variable, averaged over all trees. Node impurity was measured by the Gini index for classification, and by residual sum of squares for regression. The significance of importance metrics for Random forest models were estimated by permuting the response variable (https://github.com/EricArcher/rfPermute).

Features considered for identification, along with their highly correlated counterparts, were searched in Human Metabolome Database (HMDB), LipidMaps, Metlin and KEGG. Compound identity was confirmed by comparison of retention times and MS/MS fragmentation against chemical standards when available, or otherwise against reference MS/MS spectra. Since the feature [email protected] was strongly correlated with several features identified as TMAP (Supplementary Table 10), the integration of these features was inspected and corrected manually, and regressed against SBS40b using the same model applied to features selected for analysis. Creatinine was identified among the features by matching its retention time and MS/MS spectra against a reference standard and also regressed against SBS40b in the same fashion as other metabolites. Estimation of correlation between metabolic features was done using linear regression adjusting for batch and acquisition order.

Targeted metabolomics analyses

Circulating levels of per- and polyfluorinated substances (PFAS) and cystatin C compounds were investigated using targeted mass spectrometry-based methods as described previously30,65.

Out of the 962 subjects from the main analysis, plasma samples from 909 subjects (from all countries except Japan) were randomized and sent frozen in dry ice to each respective laboratory for analyses. Measurement of cystatin C from 906 subjects included its native form and isoforms (3Pro-OH cystatin C, cystatin C-desS, 3Pro-OH cystatin C-desS and cystatin C-desSSP) that were modelled individually and for the total concentration of cystatin C isoforms. Measurements of PFAS compounds included PFOA (total, branch, linear), PFOS (perfluorooctanoic acid; total, branch, linear), PFHxS (perfluorohexane sulfonate), PFNA (perfluorononanoic acid), PFDA (perfluorodecanoic acid), MePFOSAA (n-methylperfluoro-1 octanesulfonamido acetic acid) and EtPFOSAA (2-(N-ethyl-perfluorooctane. sulfonamido) acetic acid).

Multivariable quasi-Poisson (for the least sparse signatures SBS1, SBS40a and SBS40b) and logistic regression were used to estimate the association between plasma concentrations of the aforementioned substances and mutational signatures. All compounds were modelled continuously (log2-transformed) and categorically, with adjustments made by sex, age, date of recruitment, country, BMI, tobacco and alcohol status in the case of PFAS molecules and by sex, age and BMI, in the case of cystatin C.

Geospatial analyses

Geospatial analyses were performed to estimate the regional effect for signature attribution, particularly for signatures thought to be from exogenous exposure (SBS40b, unknown and SBS22a/SB22b, aristolochic acid). Residential history information was available for a large proportion of cases from the countries of interest: Czech Republic for SBS40b and Romania and Serbia for SBS22a and SBS22b, respectively. The 259 cases from Czech Republic within this study were recruited from four separate regions including Prague, České Budějovice (in Southern Bohemia), as well as Brno and Olomouc in the east of the country. Each individual residence was geocoded to its administrative region. All locations outside the country of recruitment were labelled as ‘abroad’. A multi-membership mixed model was used to account for the full list of regions in which each subject resided, as well as the proportion of life spent in that region before diagnosis. As dependent variable, signatures were inverse-normal transformed. Models were adjusted for sex and age of diagnosis (fixed effects). The regional effect was treated as random effect.

Statistics and reproducibility

Analyses were conducted using R version 4.1 (ref. 66) and python version 3.9.13 (ref. 67). Handling of geospatial and other data was conducted using the R packages lme4, matrixStats, Matrix, geojsonio, raster, rgeos, sf, sp, tmaptools, patchwork, leaflet, data.table, dplyr, haven, Hmisc, openxlsx, rgdal, scales, stringr, tidyr, tibble, xlsx, rfPermute, randomForest, forcats, and in python using the packages pandas, numpy, scipy, statsmodels, firthlogist, patsy and jupyter68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97. Figures were created using ggplot, ggnewscale, ggpattern, ggrepel, ggsflabel, ggspatial, ggpubr, cowplot, matplotlib, plotly (https://plot.ly), seaborn and TMB_plotter98,99,100,101,102,103,104,105,106,107,108. Open-source maps of Czech Republic, Romania and Serbia were obtained from the Global Administrative Areas project109 (https://gadm.org), with the surrounding borders obtained from the Natural Earth project110 (https://www.naturalearthdata.com/). Signature extraction was replicated two times independently at both Wellcome Sanger Institute and UCSD, with similar results. Signature attribution was replicated two times independently at both Wellcome Sanger Institute and IARC, with similar results. All attempts at replication were successful. No other experiments other than those mentioned here were replicated independently due to limited resources. Additional details relating to the methods used in this study can be found in Supplementary Figs. 127 and Supplementary Note Tables 110.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link