Nuclear-embedded mitochondrial DNA sequences in 66,083 human genomes - Nature - Alert Breaking News

Study samples

We studied 68,348 genomes from whole-blood DNA in Genomics England Rare Disease Project and 26,488 cancer genomes from Genomics England Cancer Project. DNA was extracted and processed based on the Genomics England Sample Handling Guidelines (https://legacy.genomicsengland.co.uk/about-genomics-england/the-100000-genomes-project/information-for-gmc-staff/sample-handling-guidance/). DNA samples were received in FluidX tubes (Brooks) and accessioned into Laboratory Management Information System (LIMS) at UK Biocentre. Following automated library preparation, libraries were quantified using automated quantitative PCR, clustered and sequenced. Libraries were prepared using the Illumina TruSeq DNA PCR-Free High Throughput Sample Preparation kit or the Illumina TruSeq Nano High Throughput Sample Preparation kit⁴⁶.

Ethical approval

Ethical approval was provided by the East of England Cambridge South National Research Ethics Committee under reference number 13/EE/0325, with participants providing written informed consent for this approved study. All consenting participants in the Rare Disease arm of the 100,000 Genomes Project were enrolled via 13 centres in the National Health Service (NHS) covering all NHS patients in England.

Quality control checks of rare disease genomes

All the samples were passed an initial QC check based on sequencing quality and coverage from the sequencing provider (Illumina) and Genomics England internal QC checks (https://research-help.genomicsengland.co.uk/display/GERE/Sample+QC). We only included the samples aligned to the Homo sapiens NCBI GRCh38 assembly with decoys (N = 58,335). All the samples were sequenced to produce at least 85 Gb of sequence data with sequencing quality of at least 30. Alignments covered at least 95% of the genome at 15x or above with well-mapped reads (mapping quality > 10) after discarding duplicates. Additionally, all included samples have passed a set of basic QC metrics: (1) sample contamination (VerifyBamID freemix⁴⁷) < 0.03, (2) ratio of single nucleotide variant (SNV) Heterozygous-to-Homozygous (Het-to-Hom) calls < 3, (3) total number of SNVs between 3.2 M–4.7 M, (4) array concordance > 90%, (5) median fragment size > 250 bp, (6) excess of chimeric reads < 5%, (7) percentage of mapper reads > 60%, and (8) percentage of AT dropout < 10%. 57,961 genomes were passed WGS QCs. We further excluded the samples with the average depth of mitochondrial genomes below 500x after re-aligned the mitochondrial reads (see details below). For the rare disease genomes study, we included 53,574 individuals, 25,436 male and 28,138 females, age from 0 to 99 years (Extended Data Fig. 1a,b). The average depth of WGS was 42x (s.d. = 7.7x) and average depth of mtDNA was 1,990x (s.d = 866x) (Extended Data Fig. 1c).

Family QC checks

In the family related analysis, WGS family selection quality checks are processed for rare disease genomes, reporting abnormalities of sex chromosomes and reported versus genetic sex summary checks (computed from family relatedness, mendelian inconsistencies, and sex chromosome checks). For the sex determination, the coverage data for the X and Y chromosomes was compared to the average coverage for the sample autosomes using PLINK v1.90⁴⁸ (www.cog-genomics.org/plink/1.9/). The resulting output is compared with the participant sex provided at sample collection. Relatedness checks were based on verification of the mendelian inconsistencies between members of a trio/family. The individual VCF files were merged into a family VCF with BCFTools (v1.3.1)⁴⁹ and the mendelian inconsistencies again checked with PLINK. The relationships are also checked by calculated genomic identity-by-descent values for all pairwise relationships in a family using PLINK and comparing with expected values for reported relationship (https://research-help.genomicsengland.co.uk/). We further processed an independent relatedness check using our previously published method⁵⁰. In brief, a list of 32,665 autosomal SNPs was selected to estimate relatedness. By filtering the merged VCF and the 1000 Genomes reference set⁵¹ with the selected SNPs, the pc-relate function from the GENESIS package was applied to obtain the pairwise relatedness⁵². The first 20 principal components were used to weight the population structure, and the reference set was used to increase genetic diversity accounted for by the principal component analysis. Finally, we included 8,201 families whose relatedness was consistent between two independent prediction methods and the clinical records.

QC checks of cancer genomes

We initially studied 26,488 cancer genomes from Genomics England Cancer Project. Samples were prepared using an Illumina TruSeq DNA Nano, TruSeq DNA PCR-Free or FFPE library preparation kit and then sequenced on a HiSeq X generating 150 bp paired-end reads. Germline samples were sequenced to produce at least 85 Gb of sequences with sequencing quality of at least 30. For tumour samples at least 212.5 Gb was required. Alignments for the germline sample covered at least 95% of the genome at 15x or above with well-mapped reads (mapping quality > 10) after discarding duplicates (https://research-help.genomicsengland.co.uk/).

For the sample cross-contamination checks, germline samples are processed with VerifyBamID⁴⁷ algorithm and PASS status is assigned to the samples with less than 3% of contamination. Tumour samples were processed with the ConPair algorithm⁵³ with a PASS status indicating contamination is below 1% as described in https://research-help.genomicsengland.co.uk/display/GERE/10.+Further+reading+and+documentation?preview=/38047056/45023724/Cancer%2520Analysis%2520Technical%2520Information%2520Document%2520v1-11%2520main.pdf#id-10.Furtherreadinganddocumentation-TechnicalDocumentation.

After the QC steps described above, 12,509 tumour–normal tissue pairs from 12,509 tumour samples and 11,913 matched normal tissue (germline) samples from 11,909 individuals remained. Samples were prepared using 5 different methods (FF, FFPF, CD128 sorted cells, EDTA and ASPIRATE) and three different library types (PCR, PCR-FFPE and PCR-free). We performed the additional QCs by comparing the average number of NUMTs were detected from the samples prepared by different methods and library types. We observed that the average number of NUMTs was significantly different between different groups (Supplementary Fig. 8a). To avoid possible bias caused by sample preparation and library type, we only included the 10,713 tumour–normal sample pairs prepared using FF and library type PCR-free from 9648 individuals across 21 cancer types (Extended Data Fig. 6a). The average WGS depth of tumour sample was 117x (s.d. 10.1x) and the average WGS depth of germline was 43x (s.d. 9.3x) (Supplementary Fig. 8b). The average mtDNA depth of tumour sample was 27,119x (s.d. 13,642x) and the average mtDNA depth of germline was 3,549x (s.d. 2,452x) (Supplementary Fig. 8c).

Inferencing ancestry from nuclear genome sequencing data

Broad genetic ancestries were estimated using ethnicities from the 1000 genomes project phase 3 (1KGP3)⁵¹ as the truth, by generating PCs for 1KGP3 samples and projecting all participants onto these. We included five broad super-populations: African (AFR), Admixed American (AMR), East Asian (EAS), South Asian (SAS) and European (EUR). The brief steps were as follows: (1) all unrelated samples were selected from the 1KGP3, (2) we selected 188,382 high quality SNPs in our dataset, (3) we further filtered for MAF > 0.05 in 1KGP3 (as well as in our data), (4) we calculated the first 20 principal components using GCTA⁵⁴, (5) we projected the individual data onto the 1KGP3 principal component loadings, (6) we trained a random forest model to predict ancestries based on (i) first 8 1KGP3 principal components, (ii) set Ntrees = 400, (iii) train and predict on 1KGP3 AMR, AFR, EAS, EUR and SAS super-populations. The full details can be found at https://research-help.genomicsengland.co.uk/display/GERE/Ancestry+inference. Genetic ancestry was also predicted and checked using our previously published method⁵⁰. The individuals who were not assigned to any of 5 super-populations were labelled as ‘OTHER’. We predicted 1,280 AFR, 170 AMR, 342 EAS, 5,758 SAS, 42,202 EUR and 3,363 OTHER in this study (Fig. 2a). In the cancer germline genomes, we included 312 AFR, 17 AMR, 71 EAS, 338 SAS, 8,348 EUR and 314 OTHER (Extended Data Fig. 6c,d).

We performed a uniform manifold approximation and projection (UMAP)⁵⁵ based on the NUMTs which were unique to each population in rare disease genomes. UMAP was analysed using the UMAP package with default parameters in R and visualized using the M3C package⁵⁶ in R.

Extracting mitochondrial DNA sequences and detecting variants

The subset of sequencing reads which aligned to the mitochondrial genome were extracted from each WGS BAM file using Samtools⁵⁷. We ran MToolBox (v1.0)⁵⁸ on the resulting smaller BAM files to generate the re-aligned mtDNA BAM files. The re-aligned BAM files were used to call the variants. We also used the second variant caller VarScan2⁵⁹ to call mtDNA variants from the re-aligned BAM files (–strand-filter 1, –min-var-freq 0.001, –min-reads2 1, –min-avg-qual 30). The mpileup files used in VarScan2 were generated by Samtools with options -d 0 -q 30 -Q 30. The allele fractions were extracted from VarScan2. We retained only single nucleotide polymorphisms (SNPs) with more than 2 reads on each strand for the minor allele. Variants falling within low-complexity regions (66–71, 300–316, 513–525, 3106–3107, 12418–12425 and 16182–16194) were excluded.

Mitochondrial DNA haplogroup assignment was performed using HaploGrep2^60,61.

Detecting NUMTs and breakpoints not present in the reference sequence

To detect NUMTs, we used a previously published and validated method^5,15. From the aligned WGS BAM files we extracted the discordant read pairs using samblaster⁶² and included the read pairs where one end aligns to nuclear genome and the other end aligns to the mtDNA reference sequence. The reads with mapping quality equal to zero were discarded. The discordant reads were then clustered together based on sharing the same orientation and whether they were within a distance of 500 bp. We detected the clusters supported by at least two pairs of discordant reads, and filtered out the clusters supported by less than five pairs of discordant reads in our main analysis. The NUMTs within a distance of 1,000 bp on both nuclear DNA and mtDNA were grouped as the same NUMT. We generated two sets of NUMTs based on the NUMTs supported by at least two pairs of discordant reads and at least five pairs of discordant reads (Supplementary Table 1). We observed a weak correlation of the average number of NUMTs and WGS depth (R² = 0.134, P < 2.2 × 10⁻¹⁶) and mitochondrial genome depth (R² = 0.092, P < 2.2 × 10⁻¹⁶) (Supplementary Figs. 9a,b) indicating that, although some NUMTs may be missed due to low depth, they are unlikely to have an impact on our conclusions. There was no detected difference of the number of detecting reads with the frequency of NUMTs, suggesting the detection of NUMTs were not biased by the sequencing quality (Supplementary Fig. 9c).

To identify putative breakpoints spanning nuclear DNA and a mtDNA-derived sequence (nuclear-mtDNA breakpoints), we searched for the split reads within a distance of 1,000 bp of discordant reads which were then re-aligned using BLAT⁶³. We further analysed the re-aligned reads where one end of the read mapped to nuclear DNA and the other end of the same read mapped to mtDNA-derived sequence. We defined the breakpoints by at least three split reads within the same NUMT. Each NUMT should have one nuclear breakpoint and two mitochondrial breakpoints, with the exception of NUMTs occurring with other nuclear genome structure variations. The breakpoints with 200 bp flanking regions on nuclear genome were annotated using gencode v29⁶⁴, gnomAD for pIL scores⁶⁵ and a list of datasets were downloaded from UCSC⁶⁶ and the publications (see details below). When the NUMTs were involved in multiple genes, we kept the genes with the highest pIL score. The breakpoints on the mitochondrial genome were annotated using MitoMap⁶⁷.

Detecting concatenated NUMTs

To detect putative concatenated NUMTs, first we searched for the breakpoints spanning two locations on the mtDNA-derived sequence (mtDNA–mtDNA breakpoints). We extracted the split reads which only aligned to mtDNA sequence. Those split reads were further re-aligned using BLAT. We analysed the reads where the two ends of the same read mapped to two locations on the mtDNA sequence. We then filtered the breakpoints as follows: (1) each breakpoint had at least 3 split reads observed in at least one individual, (2) each breakpoint had at least 2 split reads observed in the same individual, (3) we excluded the split reads mapped to nearby the start and end of mtDNA genome (the beginning and end of D-loop region), (4) we excluded two concatenated positions less than 50 bp away (they may be mtDNA deletions). Note our method had its limitations—we were not able to separate mtDNA–mtDNA breakpoints within NUMTs from true mtDNA if the breakpoints located around the beginning and end of D-loop region. Thus, our analysis likely missed the concatenated NUMTs where mtDNA–mtDNA breakpoints around the beginning and end of D-loop region. However, our aim was to detect confident concatenated NUMTs and show concatenated NUMTs exist in the humans. After applying the stringent filtering (above), we detected 8,686 breakpoints from 151 different mtDNA–mtDNA breakpoints in 8,450 individuals (Extended Data Fig. 3d). 279 out of 8,686 breakpoints (140 different breakpoints) from 148 individuals were ultra-rare (frequency < 0.1%). One breakpoint (12867–14977) was exceptionally common (frequency 38.4%), which was also commonly seen in an independent dataset in our previous study⁵. To confirm mtDNA–mtDNA breakpoints from the nuclear genome, we performed two independent analyses: (1) we compared the mtDNA–mtDNA breakpoints observed in the offspring and their two parents. If the mtDNA–mtDNA breakpoints were present in the offspring and their fathers, but not in their mothers, we defined them as father-transmitted mtDNA–mtDNA breakpoints. If the mtDNA–mtDNA breakpoints were present in the offspring and their mothers, but not in their fathers, we defined them as mother-transmitted mtDNA–mtDNA breakpoints. Note we were not able to identify the transmission patterns if the mtDNA–mtDNA breakpoints were present in all three family members using the short-read sequencing technique. (2) For the rare and ultra-rare mtDNA–mtDNA breakpoints (F < 1%), we checked whether the individuals carrying the same mtDNA–mtDNA breakpoints also carried the same NUMT.

Comparing to known NUMTs

Known NUMTs were downloaded from UCSC and previous publications^16,17,18,19. Bedtools⁴⁹ was used to search for the known NUMTs in our dataset. Using a conservative approach, we defined the NUMTs as known providing the known NUMTs within 1,000 bp NUMT flanks (upstream 500 bp + downstream 500 bp) detected in this study on the nuclear genome, regardless of the fragments of inserted mtDNA sequences.

Enrichment analysis

For the enrichment analysis on both nuclear and mtDNA genomes, we studied 1,637 different confident NUMTs with at least 5 discordant reads using a 2-tailed permutation test. Genomics duplications, simple repeats, dbRIP_HS-ME⁹⁰, regulatory elements, CpG islands, satellites, retrotransposons (including LINEs and SINEs) and TSS were downloaded from UCSC⁶⁶ (https://genome.ucsc.edu/). Using this information to compute the frequency of each dataset in 200 bp NUMT flanks (upstream 100 bp + downstream 100 bp). Empirical P values were calculated by resampling 1,000 sets of random positions matched to observed NUMTs. For the enrichment on each nuclear genome chromosome, we excluded the Y chromosome due to the complex duplicated structure of Y chromosome sequences limiting confident alignment.

To investigate the relationship between different chromosomes and NUMTs, we applied linear regression in R (http://CRAN.R-project.org/)⁶⁸.

$${rm{lm}},({rm{Nnumt}}sim {rm{Lchr}}+{rm{Pcentro}}+{rm{Pcpg}}+{rm{Pline}}+{rm{Pltr}}+{rm{Pretroposon}}+{rm{Psine}}+{rm{Pmicrosat}}+{rm{Prmsk}}+{rm{Prepeats}}+{rm{Pdups}}+{rm{Preg}})$$

where Nnumt is number of NUMTs detected in each chromosome, Lchr is the length of chromosome, Pcentro, Pcpg, Pline, Pltr, Pretroposon, Psine, Pmicrosat, Prmsk, Prepeats, Pdups and Preg are log₂-transformed proportions of centromere, CpG islands, LINES, LTRs, retroposon, SINEs, microsatellites, repeats, simple repeats, genomics duplications and regulatory elements on each chromosome.

Comparing NUMTs with mitochondrial DNA deletions

To study the relationship between NUMT insertion and mitochondrial deletion, we compared the frequency of NUMT breakpoint with the frequency of mitochondrial DNA deletion breakpoint. A list of 1,312 mtDNA deletions were downloaded from mitoBreak database⁶⁹. We calculated the frequencies of breakpoints in different mtDNA regions—D-loop, 13 coding genes, 2 RNAs and combined 22 tRNAs, and compared the distribution with the distribution of breakpoints for germline and tumour-specific NUMTs using linear regression.

Searching for de novo NUMTs in rare disease trios and tumour-specific NUMTs in cancer genomes

We used the most conservative methods to define the de novo NUMTs from father–mother–offspring trios. We only included NUMTs with at least five pairs of discordant reads in the offspring and none of discordant read detected in the parents.

We applied for the same approach to define tumour-specific NUMTs in cancer genomes. Tumour-specific NUMTs were defined by at least five pairs of discordant reads in the tumour samples and none of discordant reads in the matched normal samples. Lost NUMTs in cancer genomes were defined by at least five pairs of discordant reads in the normal samples and no more than one pair of discordant reads in the matched tumour samples.

Estimating the rate of de novo NUMTs in trios and tumour-specific NUMTs in cancer genomes

De novo NUMT insertion rate in trios and cancer genomes was estimated as follows:

$$rho ({rm{germline}})={rm{NumtTtrio}}/{rm{Ntrio}}$$

$$rho ({rm{tumour}})={rm{NumtTumour}}/{rm{Ngenome}}$$

where ρ(germline) is the rate of de novo NUMT insertion in trios, ρ(tumour) is the rate of tumour-specific NUMT insertion in tumour samples, NumtTtrio is the number of de novo NUMT event in trios, NumtTumour is the number of tumour-specific NUMTs, Ntrio is the number of total trios and Ngenome is the number of total normal–tumour pairs.

Analysing the correlation of tumour-specific NUMTs and cancer types

To understand the relationship between donor age, sex and the average number of NUMTs, we applied linear regression to each dataset using R (http://CRAN.R-project.org/).

Model 1 < − lm(N ∼ Age + Sex + DPmt)

Model 2 < − lm(Nsoma ∼ Age + Sex + DPmt)

Where N and Nsoma are average numbers of NUMTs and tumour-specific NUMTs, Age is donor age, Sex is donor sex and DPmt is average mitochondrial DNA sequencing depth.

Detecting cancer SNVs, indels and structural variants

Read alignment against human reference genome GRCh38-Decoy+EBV was performed with ISAAC (version iSAAC-03.16.02.19)⁷⁰, SNVs and short insertions–deletions (indels) variant calling together with tumour − normal subtraction was performed using Strelka (version 2.4.7)⁷¹. Strelka filters out the following germline variant calls: (1) all calls with a sample depth three times higher than the chromosomal mean, (2) site genotype conflicts with proximal indel call, (3) locus read evidence displays unbalanced phasing patterns, (4) genotype call from variant caller not consistent with chromosome ploidy, (5) the fraction of basecalls filtered out at a site > 0.4, (6) locus quality score < 14 for heterozygous or homozygous SNP, (7) locus quality score < 6 for heterozygous, homozygous or het-alt indels, (8) locus quality score < 30 for other small variant types or quality score is not calculated. Strelka filters out the following somatic variant calls: (1) all calls with a normal sample depth three times higher than the chromosomal mean, (2) all calls where the site in the normal sample is not a homozygous reference, (3) somatic SNV calls with empirically fitted VQSR score < 2.75 (recalibrated quality score expressing the phred scaled probability of the somatic call being a false positive observation), (4) somatic indels where fraction of basecalls filtered out in a window extending 50 bases to either side of the indel’s call position is > 0.3, (5) somatic indels with quality score < 30 (joint probability of the somatic variant and a homo ref normal genotype), (6) all calls that overlap LINE repeat region.

Structural variants (SVs) and long indel (>50 bp) calling was performed with Manta (version 0.28.0)⁷² which combines paired and split-read evidence for SV discovery and scoring. Copy number variants (CNVs) were called with Canvas (version 1.3.1)⁷³ that employs coverage and minor allele frequencies to assign copy number. These tools filter out the following variant calls: (1) Manta-called SVs with a normal sample depth near one or both variant break-ends three times higher than the chromosomal mean, (2) Manta-called SVs with somatic quality score < 30, (3) Manta-called somatic deletions and duplications with length > 10kb, (4) Manta-called somatic small variant (<1kb) where fraction of reads with MAPQ0 around either break-end > 0.4, (5) Canvas-called somatic CNVs with length < 10kb, (6) Canvas-called somatic CNVs with quality score < 10. The full details of bioinformatics pipeline can be found at https://research-help.genomicsengland.co.uk/pages/viewpage.action?pageId=38046624.

Searching for the evidence of the mechanism of NUMT insertions

PRDM9

PRDM9 determines the locations of meiotic recombination hotspots where meiotic DNA DSBs are formed. To investigate the mechanism of NUMT insertions, we compared the NUMTs with a set of 170,198 published PRDM9-binding peaks cross the genome⁷⁴. We counted the number of NUMTs overlapping PRDM9-binding peaks and performed the permutation analysis (see the details in ‘Enrichment analysis’). Next, we calculated the distance between the breakpoint of each NUMT (from both the germline and tumour-specific NUMTs) with the nearest PRDM9-binding site.

Human DNA repair genes

A list of known human DNA repair genes was downloaded from Human DNA Repair Genes website (https://www.mdanderson.org/documents/Labs/Wood-Laboratory/human-dna-repair-genes.html)^38,39. We extracted the somatic missense mutations in DNA repair genes from all cancer samples, and compared the relationship between samples carrying the mutations and tumour-specific NUMTs.

Somatic mutational signatures

Somatic mutation signatures are the consequence of multiple mutational processes that the human body is subjected to throughout life. Each different process generates a unique combination of mutation types that are called mutation signatures (https://cancer.sanger.ac.uk/signatures/signatures_v2/). Mutational signature was computed using the R package nnls (https://CRAN.R-project.org/package=nnls). The details of how the signatures were computed is described in Alexandrov et al., 2013⁷⁵ and online document https://research-help.genomicsengland.co.uk/pages/viewpage.action?pageId=38046624.

Assessing clinical significance

Rare disease participants with no known genetic diagnosis

The Genomics England PanelApp (https://panelapp.genomicsengland.co.uk/)⁷⁶ list of genes and genomic entities were used to provide a list of potential disease genes (N = 5,883). NUMTs were identified that had a frequency of < 1%, and their breakpoints within 200 bp flanking regions of one of these genes. Consequence annotation was done with gencode v29, including gene, intron, exon, CDS, start codon, stop codon, five prime UTR and three prime UTR regions⁶⁴. NUMTs which were annotated as falling in an exon were analysed in detail. For each gene, we considered the strength of evidence that the gene is associated with a disease, the inheritance pattern of the disorder, the reported types of pathogenic variants and reported mechanism of disease (for example, haploinsufficiency, gain of function or repeat expansion), using information from OMIM (https://omim.org/)⁷⁷ and by searching PubMed (https://pubmed.ncbi.nlm.nih.gov/). For the established disease genes, we considered available clinical information for each proband which included their Human Phenotype Ontology terms⁹¹, family history and age at enrolment. We assumed that the rare NUMT was present on one allele only, unless it was present in both parents or there was documented consanguinity (where parental data was not available). For recessive disorder genes containing a NUMT, we looked whether it was present in one or both parents (if available), whether there was a family history of consanguinity, and at the sequence data to see whether there was a second rare variant. The location of the NUMT insertion was explored in UCSC genome browser⁶⁶.

Rare disease participants with a genetic diagnosis

Participants with a confirmed genetic diagnosis were identified from the Genomic Medicine Centre exit questionnaire (https://research-help.genomicsengland.co.uk/pages/viewpage.action?pageId=38046767). Genomic coordinates of the causative variant were compared with the genomic coordinates of the NUMTs using bedtools⁴⁹.

Rare disease NUMTs in participants with mitochondrial DNA maintenance disorders

Participants with mitochondrial DNA maintenance disorders⁷⁸ were identified from the Genomic Medicine Centre exit questionnaire and from our previous analysis of participants with suspected mitochondrial disorders⁷⁹. We also identified affected family members who had genome sequencing data available. 122 NUMTs were detected from 20 individuals. Only 4 NUMTs (2 different NUMTs) from two families in exons. We compared the average number of NUMTs in these participants to the rest of the rare disease participants.

Cancer genomes

To determine whether a NUMT insertion was a driver mutation in the development of cancers, NUMTs with 200 base pairs flanking region were identified which were located genes of interest. Our genes of interest were defined as those on the COSMIC (Catalogue of Somatic Mutations in Cancer) Cancer Gene Census list (tier 1 and tier 2) which includes genes known to contain mutations causally implicated in cancer²⁸. We also used a list of known human DNA repair genes^38,39. The location of the NUMT insertion in relation to these gene lists was explored in the UCSC genome browser.

Validating the NUMTs using long-read sequencing

To validate NUMT detection in short-read sequencing, we carried out whole-genome sequencing on Oxford Nanopore PromethION in 39 individuals from rare disease genomes. To maximize sequencing yield, 4 μg of germline DNA from 100KGP participants was fragmented to 15–30 Kb with Covaris g-tubes (4,000 rpm, 1 min, 1–3 passes until the desired length was achieved) and then depleted of low molecular weight DNA (<10 Kb) with the Short Read Eliminator kit (Circulomics, SS-100-101-01) as described by the manufacturer. After checking DNA size distribution on an Agilent Femto Pulse system, a sequencing library was generated with the Oxford Nanopore SQK-LSK109 kit, starting from 1.2 µg of high molecular weight-enriched DNA. Samples were quantified with a Qubit fluorometer (Invitrogen, Q33226) and 500 ng loaded onto a PromethION R.9.4.1 flow cell following manufacturer’s instructions. In experiments where throughput was limited by a rapid increase in unavailable pores, the library was re-loaded following a nuclease flush ~20hrs after the initial run. Base-calling was performed with Guppy-3.2.6/3.2.8 in high accuracy mode. Full details of the protocol can be found at https://research-help.genomicsengland.co.uk/display/GERE/Genomic+Data+from+ONT?preview=/38046759/38047942/v1_protocol_ONT_LSK109.pdf. Sequencing reads were aligned to GRCh38 using minimap2⁸⁰ version 2.17. QC statistics and plots were generated using Nanoplot⁸¹ version 1.26.0. The full details of bioinformatics pipeline can be found at https://research-help.genomicsengland.co.uk/display/GERE/Genomic+Data+from+ONT?preview=/38046759/38047944/PromethION%20SV%20calling%20pipeline%20GRCh38.docx. We then extracted the long reads aligned to the same region where a NUMT detected using short-read sequencing from the same individual. The extracted long reads were re-aligned using BLAT. The observed NUMTs were also manually inspected on Integrated Genomics Viewer (IGV)⁸². 182 out of 184 NUMTs (29 out of 31 distinct NUMTs) detected using short-read sequencing were also seen in long-read sequencing data. Two NUMTs from the same individual were missing in long-read sequencing likely due to the low number of aligned reads in long-read sequencing.

Detecting methylation state of NUMTs using long-read sequencing

Whole-genome-wide methylation detection was carried out using call-methylation function from Nanopolish v0.13.3⁸³ in 39 individuals. The methylation detection output includes the position of the CG dinucleotide on the reference genome, the ID of the read that was used to make the call, and the log-likelihood ratio. We extracted the long reads mapped to mtDNA genome, and further grouped them into two groups: (1) long reads also mapped to nuclear genome, (2) long reads only mapped to mtDNA genome. Next, we calculated methylation frequency of each site using the calculate_methylation_frequency.py script from the package in each read group. The methylation calls detected by the 1st group were from NUMTs, and the calls detected by the 2nd group were from true mtDNA. We used the methylation profile of true mtDNA as reference, and NUMTs methylation was estimated as the log₂ ratio of methylation frequency of each site between NUMTs and true mtDNA from the same individual. Note, if the individuals carried concatenated NUMTs, the calls detected by 2nd group were from mixed true mtDNA and concatenated NUMTs. We were not able to separate the long reads mapped to the middle of concatenated NUMTs where the reads also only mapped to mtDNA genome and true mtDNA genome.

In this analysis, we focused on the concatenated NUMTs and the large NUMTs where long reads were confidently aligned to NUMTs. We only included the calls with at least 3 reads mapped to NUMTs and at least 10 reads mapped to true mtDNA sequences. We also used 4 reads, 5 reads, 6 reads, 7 reads, 8 reads 9 reads and 10 reads as the cut-offs to detect NUMTs methylation. We observed the same distribution of methylation frequency across different cut-offs (Fig. 3a), indicating read-thresholds did not affect our results.

Detecting mutations within the NUMT insertions

We performed a de novo assembly of all 335,891 NUMTs detected in this study. The steps of processes were: (1) we clustered the discordant reads detected from each NUMT in the same individual. (2) The consensus sequence of NUMT contig was generated using CAP3⁸⁴. (3) The contigs were then aligned against mitochondrial reference genome⁸⁵ using Blat⁶³ and Clustal Omega⁸⁶. (4) The aligned sequences from Clustal Omega were used to detect the nucleotide changes between NUMT sequences and mitochondrial reference genome sequences using BioPython⁸⁷. To ensure the confident calls, we applied the additional filtering as follows: (1) we only included NUMTs shorter than 1,000 bp; (2) we excluded the variants within 5 bp of NUMT breakpoints; (3) we removed the variants where the aligned reference allele were different from mtDNA reference genome at the same position; (4) we only included single nuclear variations; (5) we excluded the individuals carrying many more variants than the overall population (> mean number of variants + 3 × s.d.).

To define NUMT-specific variants, we applied the additional filtering: (1) we excluded variants present more than 50% individuals carrying the same common or rare NUMTs and 75% individuals carrying the same ultra-rare NUMTs. This stringent filtering strategy was designed to provide maximum confidence that any NUMT-specific variants were highly likely to have occurred after NUMT sequences have inserted into nuclear genome, compromising the sensitivity of the analysis. (2) We excluded variants only detected in 1 individual to minimize the likelihood of sequencing errors; (3) to obtain the most confident NUMT-specific mutations, we only included the variants detected in at least two individuals from the same family. In the main text, we reported 3 groups of NUMT-specific variants. Total group A, after applying step (1); subgroup B, after step (2); and subgroup C, after step (3).

Estimating the ages of NUMTs

The age of NUMTs was estimated using the method described previously¹⁹. We aligned the mitochondrial sequences from human, chimpanzee and the consensus sequence from each NUMT contig using Clustal Omega. The ancestral mitochondrial sequences from chimpanzee was downloaded from ENSEMBL(Pan_tro_3.0). The aligned sequences were used to generate the nucleotide changes using BioPython. We calculated the ratio of the number of sites that matched human allele to the total number of sites where the human and ancestral mitochondrial sequences differ within each NUMT region. The ratio was used to derive an approximate age for each NUMT, relative to an estimated human-chimpanzee divergence time of 6 million years. To ensure the confident results, we applied the filtering as follows: (1) we only included NUMTs with length between 50 and 1,000 bp; (2) we excluded NUMTs without different allele between human and chimpanzee; (3) the age was estimated from more than 50% of individuals carrying the same NUMT and at least in 2 individuals. After applying this filtering, we excluded all the private NUMTs which were only seen in one individual. (4) We excluded concatenated NUMTs.

Statistical analysis and plotting

All statistical analyses in this study were suggested in the text and performed using R⁶⁸ (http://CRAN.R-project.org/) and Python (http://www.python.org). Figures were generated using R and Matplotlib (https://matplotlib.org) in Python. Circos plots were made using Circos (http://circos.ca/)⁸⁸. Chromosome maps were made using chromoMap⁸⁹.

A web interface to deposit NUMTs detected in this study was developed using Shiny v1.7.1 (https://CRAN.R-project.org/package=shiny)(https://cran.r-project.org/web/packages/shiny/index.html)⁹².

Web resources

NUMTs detected in this study are publicly available through a web interface at https://wwei.shinyapps.io/numts/.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link