Single-cell analysis reveals context-dependent, cell-level selection of mtDNA - Nature - Alert Breaking News

Figures

The following figures were created with BioRender (agreement number QJ26EVEXW9; https://biorender.com): Figs. 1a, 2a,d,e and 4a. The following figures were created using GraphPad Prism (license number 27313 to MGB; https://graphpad.com): Figs. 1f, 3c,d,e,f, 3h,i, 4c,d,e, 5a,b; Extended Data Figs. 2b, 3a,b,c, 3e,f,g,h, 5a,b, 9a.

Reagents

All oligonucleotides and reagents are listed in Supplementary Tables 1–3.

Mammalian cell culture

293T (ATCC CRL-3216), HeLa (ATCC CCL-2) and K562 (ATCC CCL-243) cells were purchased from the ATCC (American Type Culture Collection). Nthy-ori 3-1 cells were purchased from the European Collection of Cell Cultures (ECACC) (90011609). Cell lines were authenticated by STR profiling by the supplier. 293T, HeLa and K562 cells were routinely cultured in DMEM with high glucose and pyruvate (Gibco), supplemented with 10% (v/v) FBS (Gibco). Nthy-ori 3-1 cells were routinely cultured in RPMI 1640, supplemented with 10% (v/v) FBS (Gibco). Cell cultures were maintained at 37 °C with 5% CO₂ and were tested negative for mycoplasma. For selected experiments described in the main text, cells were cultured in DMEM without glucose (Gibco), supplemented with 10% (v/v) dialysed FBS (Gibco), 25 mM glucose or galactose (Sigma), 1 mM sodium pyruvate (Gibco) and 50 μg ml⁻¹ uridine (Sigma). In the indicated experiments, cells were cultured in the presence of 2.5 nM oligomycin A (Sigma). In the indicated experiments, cells were cultured in the presence of 100 ng ml⁻¹ EtBr (Sigma). For selected experiments, described in the main text, cell cultures were maintained in a hypoxia chamber glove box at 37 °C with 1% O₂.

Proliferation and viability experiments

Cells were seeded in triplicate at 0.5 × 10⁶ (glucose conditions) or 2 × 10⁶ (galactose or oligomycin conditions) in 2 ml media in 6-well plates, and cell counts were performed after 3 days of culture using the Vi-Cell XR Cell Viability Analyzer (Beckman Coulter). Population doubling was calculated as log₂(final density/seeding density). Cell viability was calculated based on trypan blue staining performed with the Vi-Cell XR Cell Viability Analyzer (Beckman Coulter).

Transient transfection of 293T cells

Cells were seeded in triplicate at 0.5 × 10⁶ in 2 ml media in 6-well plates, 24 h before transfection. Lipofection was performed at a cell density of approximately 50%. Cells were transfected with a total of 1,000 ng of plasmid DNA (500 ng of each DdCBE monomer), using 3 μl Lipofectamine 2000 reagent (Thermo Fisher Scientific) per well. Cells were split 24 h after lipofection and used for experiments. See Supplementary Table 2 for a list of plasmids used for transfection.

Transient transfection of Nthy-ori 3-1 cells

Cells were transfected using the SF cell line 4D-nucleofector X kit (Lonza) with the CN-114 program according to the manufacturer’s protocol. One million cells were transfected with a total of 2,000 ng of plasmid DNA (1,000 ng of each DdCBE monomer). Cells were split 24 h after transfection and used for experiments. See Supplementary Table 2 for a list of plasmids used for transfection.

Cell tracing

K562 cells were stained with CellTrace Far Red Cell Proliferation reagent according to the manufacturer’s protocol. Cells were then cultured for 4 days and used for FACS.

Lineage tracing

293T cells were edited using LHON or SILENT DdCBEs according to the transient transfection protocol described above. Twenty-four hours after transfection, cells were transduced with a CloneTracker lentiviral barcode library (Cellecta) containing 10 million unique barcodes, enabling clonal expansion tracking. Cells were transduced at a multiplicity of infection (MOI) of 0.1 to ensure that each cell had one unique, expressed barcode. Infections were performed using 2 million cells in 10-cm culture dishes, and in media supplemented with 6 μg ml⁻¹ polybrene. Twenty-four hours after transduction, cells were selected for 48 h with 2 μg ml⁻¹ puromycin (Invitrogen). At that point, cells were counted and split 1/20 to decrease diversity of barcodes. Cultures were expanded for 3 days, and approximately 15% of cells were removed for SCI-LITE. The remaining cells were plated, and cultures were expanded for an additional 5 days and collected for SCI-LITE. SCI-LITE was performed to capture both the MT-ND4 transcript and the expressed lineage barcodes according to the detailed protocol, with the difference that PCRs to amplify MT-ND4 and lineage barcodes were performed separately, and all PCR products were pooled together after the PCR purification step.

SCI-LITE time course experiments design

For the LHON and SILENT SCI-LITE time course (Fig. 3g–i), the population size and splitting strategy (done in biological triplicate) were as follows: on day 0, we removed 1 million cells from approximately 4 million LHON-edited cells and approximately 7 million SILENT-edited cells for SCI-LITE, and seeded the remaining cells for further culture. Cells were passaged at days 3, 5, 8, 10 and 13 at 1:2 ratio for a 2-day interval and at 1:3 ratio for a 3-day interval. At three timepoints (days 5, 10 and 15), 1 million cells were collected for SCI-LITE from each replicate.

For the ONC SCI-LITE time course (Fig. 5c), the population size and splitting strategy (done in biological triplicate) were as follows: on day 0, we seeded 0.5 million cells for glucose and 2 million cells for galactose and oligomycin conditions. On day 3, 0.5 million of 4 million cells were removed for SCI-LITE from each replicate, and the remaining cultures were seeded further for glucose and galactose conditions. All cells were collected for oligomycin conditions, and 0.5 million cells were used for SCI-LITE. On day 6, 0.5 million cells from each replicate were removed for SCI-LITE, and the remaining cultures were seeded further for glucose and galactose conditions. On day 10, 0.5 million cells from each replicate were used for SCI-LITE.

FACS

Cells transfected with DdCBEs with fluorescent reporters were harvested 24 h after lipofection and used for FACS as described previously³⁰. Sorting was performed based on the intensity of eGFP and mCherry fluorescence. See Supplementary Fig. 2d for the representative gating strategy. Cells stained with the CellTrace reagent were harvested 4 days after staining and sorted based on the CellTrace intensity (AF647 channel). SH800S software version 2.1.6 and FCS Express software version 7.10.0007 were used to analyse FACS data.

Western blot

Cells were harvested, washed with PBS and lysed in 100 μl of ice-cold 1X RIPA buffer (Boston BioProducts) supplemented with protease inhibitor (Roche). Lysates were vortexed with maximum strength for 5 s, followed by incubation on ice for 5 min. Mixing and incubation steps were repeated three times, followed by centrifugation at 12,000g for 10 min at 4 °C. 10 µg of protein lysate was resuspended in Laemmli SDS-sample buffer (Boston BioProducts), and was used for gel electrophoresis, performed using Novex Tris-glycine 4–20% gels (Thermo Fisher Scientific), after boiling the lysates for 5 min at 95 °C. Samples were separated by electrophoresis at 200 V for 1 h in Tris-glycine running buffer (Bio-Rad). Semi-dry western blotting was performed using the Trans-Blot Turbo blotting system and nitrocellulose membranes (Bio-Rad). Obtained membranes were blocked in Odyssey Blocking Buffer (LI-COR) for 30 min at room temperature, and incubated with the primary antibodies anti-FLAG (Sigma; 1:1,000 dilution), anti-HA (BioLegend; 1:6,000 dilution) and anti-actin (Cell Signaling; 1:1,000 dilution) in 5% (w/v) BSA (Sigma) in TBST (50 mM Tris-HCl, 150 mM NaCl and 0.05% (v/v) Tween-20, pH 7.4) overnight at 4 °C. Afterwards, membranes were washed three times for 5 min with TBST, and incubated with IRDye-labelled secondary antibodies (goat anti-rabbit 680RD (Li-Cor) or goat anti-mouse 800CW (Li-Cor)) diluted 1:10,000 in 5% (w/v) Blotting-Grade Blocker (Bio-Rad) in TBST for 1 h at room temperature, followed by washing three times for 5 min with TBST. Signals were recorded using an Odyssey Imaging System (Li-Cor).

Mouse studies

Female NSG (NOD.Cg-Prkdcscid Il2rgtm1Wjl/SzJ) mice of 4–6 weeks of age were purchased from The Jackson Laboratory (RRID: IMSR_JAX:005557) and housed in the animal facility at the Massachusetts General Hospital under ethics oversight from the Massachusetts General Hospital Institutional Animal Care and Use Committee. Housing conditions for the mice were: 12-h dark–light cycle, ambient temperature of 65–75 °F and humidity of 40–60%. Human SV40 immortalized normal thyroid follicular epithelial cells Nthy-ori 3-1 (ref. ³³) (purchased from the ECACC; cat. 90011609) were nucleofected with either active or catalytically dead (mock) ONC DdCBE, and 2 × 10⁶ cells were injected subcutaneously into the flank of mice (along with 50 μl Matrigel). Tumour size was measured using digital callipers every week, with volumes estimated by the formula volume = ½ × length × width². The maximal tumour size allowed as per Institutional Animal Care and Use Committee guidelines is 1,000 mm³, and these limits were not exceeded in any of the experiments. No statistical methods were used to predetermine sample size. The experiments were not randomized, and the investigators were not blinded during experiments or outcome assessment.

DNA isolation and Sanger sequencing

Total DNA was extracted from cells using the DNeasy Blood & Tissue Kit (Qiagen) according to the manufacturer’s protocol. An MT-ND4 gene fragment spanning the SNP site in HeLa and 293T cells was amplified with the Phusion Hot Start II High-Fidelity PCR Master Mix (Thermo Fisher Scientific). The primers used for the PCR are listed in Supplementary Table 3. PCR products were purified with the QIAquick PCR Purification Kit (Qiagen) and subjected to Sanger sequencing at Azenta.

Analysis of relative mtDNA levels by qPCR

qPCRs were performed as previously described⁴. In brief, 4 ng of isolated DNA was used in the qPCR, performed with the use of the iQ SYBR Green Supermix (Bio-Rad) in a 10-μl reaction volume using the CFX Opus 384 machine (Bio-Rad). The relative abundance of the amplified ND1 gene fragment was normalized to the amplified B2M gene fragment.

RNA isolation and RT–qPCR

Total RNA was extracted from cells with the RNeasy Mini Kit (Qiagen). Isolated RNA (500 ng) was digested with ezDNAse enzyme (Invitrogen) and used for reverse transcription performed with SuperScript IV Reverse Transcriptase (Invitrogen) and 2 μM gene-specific reverse primers. The obtained cDNA was used for qPCR. Analysis of mtRNA abundance was performed with iQ SYBR Green Supermix (Bio-Rad) using the primers listed in Supplementary Table 3.

Oxygen consumption analysis by Seahorse XF Analyzer

A Seahorse plate was coated with 0.01% (w/v) poly-l-lysine (Sigma), and 1.5 × 10⁴ cells per well were seeded 16 h before analysis with the Seahorse Xfe96 Analyzer (Agilent). Analysis was performed in the Seahorse XF DMEM Medium pH 7.4 (Agilent), supplemented with 10 mM glucose (Agilent), 2 mM l-glutamine (Gibco) and 1 mM sodium pyruvate (Gibco). The Mito stress protocol was applied using 1.5 μM oligomycin, 1 μM FCCP and 1 μM piericidin + 1 μM antimycin. After the run, media was removed from the wells, and cells were stained with Hoechst 33342 (Thermo Fisher Scientific; final 1:5,000 dilution in PBS) for 15 min at room temperature. Next, the staining solution was removed, and cells were imaged in PBS using the BioTek Cytation 5 Cell Imaging Multimode Reader (Agilent). The oxygen consumption rate values were normalized to the number of cells per well, calculated as imaged nuclei in each well.

High-throughput amplicon sequencing of DNA samples

Sites of interest were amplified from isolated DNA samples and sequenced with the use of Illumina MiSeq system. The first round of PCR (PCR1) was performed with the use of primers amplifying the region of interest and containing partial Illumina adapters (Supplementary Table 3). Of total DNA, 200 ng was used for PCR1, performed with the use of Phusion Hot Start II High-Fidelity PCR Master Mix (Thermo Fisher Scientific) in 10 μl final volume using the following protocol: 98 °C for 30 s and then 16 cycles of 98 °C for 10 s, 56 °C for 30 s and 72 °C for 30 s, followed by a final 72 °C extension for 7 min. PCR1 was purified using AMPure XP beads (Beckman Coulter) according to the manufacturer’s protocol. PCR2, adding Illumina indexes, was performed using 2 µl of purified PCR1 product and amplified with Phusion Hot Start II High-Fidelity PCR Master Mix (Thermo Fisher Scientific) in a 10 μl final volume using the following protocol: 98 °C for 30 s and then 20 cycles of 98 °C for 10 s, 58 °C for 30 s and 72 °C for 30 s, followed by a final 72 °C extension for 7 min. PCR2 products were evaluated by electrophoresis in a 1.5% agarose gel, then up to 96 PCR2 products containing various Illumina barcode combinations were mixed and purified using the QIAquick PCR Purification Kit (Qiagen). The library concentration was measured using the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific) and further verified by qPCR using the NEBNext Library Quant Kit for Illumina (New England Biolabs). Libraries amplified with primers without a heterogeneity spacer were sequenced using an Illumina MiSeq system with 50-bp single-end reads at a final concentration of 12 pM with 5% PhiX spike-in (Illumina), or with 150-bp paired-end reads at a final concentration of 6 pM with 20% PhiX spike-in (Illumina). If primers with a heterogeneity spacer were used, libraries were sequenced at a concentration of 12 pM, without addition of the PhiX spike-in.

High-throughput amplicon sequencing of RNA samples

Total RNA was extracted from cells using the RNeasy Mini Kit (Qiagen) and digested with ezDNase enzyme (Invitrogen). Of isolated RNA, 150 ng was used for reverse transcription, performed with SuperScript IV Reverse Transcriptase (Thermo Fisher Scientific) and target-specific reverse transcription primers listed in Supplementary Table 3. The obtained cDNA was used for preparing amplicon libraries as described above.

Barnyard experiment to evaluate doublet rate

For evaluating the doublet rate, a SCI-LITE experiment was performed according to the detailed protocol (see Supplementary Protocol containing SCI-LITE detailed workflow.) with a minor modification: at the beginning of the experiment, HeLa and 293T cells were harvested and counted using the Vi-Cell XR Cell Viability Analyzer (Beckman Coulter). Equal numbers of HeLa and 293T cells were mixed and used for SCI-LITE. See Supplementary Discussion for more details on the analysis and interpretation of the Barnyard data.

Initial processing and quality control of sequencing data

Illumina MiSeq runs were demultiplexed and converted to FASTQ format using bcl2fastq v2.20.0.422. The read quality in each FASTQ file was verified using the FASTQC tool v0.11.9 and aggregated into a single report per sequencing run using MultiQC v1.11.

Analysis of high-throughput amplicon sequencing data of DNA and RNA samples

To analyse the bulk amplicon sequencing data, we used a custom script to iterate over each read in each FASTQ file and test for the presence of the expected alleles at the expected positions, keeping a running count of each allele in each sample. Counts from reads with sequences that did not match the expected edits or wild-type alleles were discarded. LHON heteroplasmy for each sample was calculated as the sum of all reads with LHON-containing alleles detected, divided by the sum of all non-discarded reads in that sample. The silent heteroplasmy was calculated as the sum of all reads that showed only the silent edit, and no LHON edit, divided by the sum of all non-discarded reads. The analyses were performed using Python v3.7.12, with the following modules: matplotlib v3.4.2, numpy v1.21.0, pandas v1.1.5, plotly v5.16.1, pysam v0.16.0.1, scikit-learn v0.23.1, scipy v1.7.0 and seaborn v0.11.1.

Analysis of single-cell SCI-LITE data

We processed the SCI-LITE data using a custom Python pipeline that iterates over each read in each FASTQ file. The pipeline was run with Python v3.9.15, pysam v0.20.0, numpy v1.26.2, pandas v1.5.2, and matplotlib v3.8.2. The script parsed each read to extract the amplicon sequence, the reverse transcription barcode, the ligation barcode and the UMI sequence, before storing these in a table where information pertaining to each read composes one entry. Next, the script corrected all amplicon sequences, reverse transcription barcodes and ligation barcodes within a Hamming distance of 1 from an expected amplicon or barcode sequence, and filtered out any reads lacking matches to an expected amplicon sequence, reverse transcription barcode sequence or ligation barcode sequence. Cell IDs were constructed for the remaining reads by combining the Illumina barcode, the reverse transcription barcode and the ligation barcode, and then these reads were deduplicated for each cell ID based on the UMI sequences. On occasion, there were reads with the same UMI sequence but different MT-ND4 alleles. To handle these cases, we filtered out any UMIs with fewer than three reads and then assigned each remaining UMI to an allele if at least 2/3 of the reads support that allele. Any UMIs with multiple alleles but without a 2/3 majority of their reads supporting one of those alleles were discarded. Once UMI alleles were resolved, valid cells were called by setting a UMI coverage threshold at the ‘knee’ of a plot of log₁₀ UMI counts per cell versus the log₁₀ ranking of cells by UMI count (Fig. 1c) besides in Fig. 1f where all cells with more than one UMI were analysed. Finally, heteroplasmy was calculated in each cell for each amplicon sequence type as the number of UMIs for each possible allele divided by the total number of UMIs belonging to that amplicon sequence type. As with the bulk sequencing data, for the LHON-edited and SILENT-edited experiments, UMIs showing both the missense and the silent edits on the same molecule were included in the calculation of the missense heteroplasmy and excluded from the silent heteroplasmy.

Simple model of heteroplasmy dynamics

We created a simple simulation of a population of proliferating LHON-edited cells where heteroplasmy impacts cell fitness. In this simulation, we assumed that (1) the population doubles in size each day, (2) the heteroplasmy within a cell does not change, and (3) the relative probability that a cell will divide is a function of its heteroplasmy according to an inverse sigmoid function, which simulates a ‘biochemical threshold’ of heteroplasmy, beyond which cell fitness drops off rapidly (Extended Data Fig. 6d).

We initialized the day 0 distribution of heteroplasmy by sampling 2,500 cells, with replacement, from the empirically observed day 0 LHON-edited SCI-LITE data. To avoid having multiple cells with identical allele counts and heteroplasmy values if the same observed cell was sampled twice, we resampled the MT-ND4 mtRNA molecules for each cell by making random draws from a multinomial distribution that was parameterized by the heteroplasmy fraction of the cell for LHON, SILENT and wild-type alleles. The number of multinomial draws for each cell was one of the fitted model parameters (described below) and was scaled for each cell by the number of observed UMIs in that cell divided by the maximum number of UMIs present in any observed day 0 cell. Thus, the number of simulated molecules is related to the amount of data supporting the heteroplasmy estimates in the observed cell. Once the multinomial draws were complete, then the LHON, SILENT and wild-type heteroplasmy values were recalculated based only on the simulated molecules.

Next, we simulated growth over a 15-day time course. At each simulated day, the culture doubles in size by drawing cells with replacement, and the probability of being selected is inversely related to its missense heteroplasmy according to an inverse sigmoid function (described below). Once the dividing cells were selected, the heteroplasmy was adjusted using the same multinomial sampling strategy that we used to generate the day 0 cells. To match the experimental protocol that generated the SCI-LITE time course data, cell culture passaging was simulated on days 3, 5, 8, 10 and 13 by sampling, without replacement, 5,000 cells before a 2-day interval between passages, or 2,500 cells before a 3-day interval between passages. Finally, a number of cells equal to the number of observed day 0 cells were sampled, with replacement, from the simulated culture at days 5, 10 and 15 to simulate the collection of SCI-LITE data. The fit of the model was assessed by comparing (1) the observed and simulated mean LHON and SILENT heteroplasmy estimates, and (2) the observed and simulated joint heteroplasmy distributions at each timepoint in the time series.

The model fitting procedure optimized four model parameters in parallel: three to define the inverse sigmoid function, and one to set the number of molecules to simulate per cell. We describe the inverse sigmoid function, which is defined by the following equation,

$$y(x)=left(left(1-dright)times left[frac{{10}^{-pleft(x-lright)}}{1+{10}^{-pleft(x-lright)}}right]right)+d$$

where d specifies the depth of the sigmoid below 1.0 (for example, function values range from d to 1.0; this largely defines the relative growth rate of a homoplasmic mutant cell compared with a homoplasmic wild-type cell), l specifies the location of the sigmoid inflection point on the x axis, and (p) specifies the pitch of the transition from 1.0 to d (the combination of location and pitch defines the biochemical threshold, as well as the relative growth rate of a heteroplasmic mutant cell compared with a homoplasmic wild-type cell) (Extended Data Fig. 7a). We also tested different numbers of molecules to simulate per cell, a parameter that defines the precision of the heteroplasmy estimates in our cells. To estimate these four parameters, we took a random search approach, in which we created 20,000 random parameter sets, ran the models and chose the combination of parameters that led to modelling results most similar to our observed heteroplasmy distributions (Extended Data Fig. 7b). Our random search optimization minimized the combined mean squared error (MSE) between the simulated and observed missense and silent mean heteroplasmy values, and the simulated and observed joint heteroplasmy distributions, represented by 2D kernel density estimates, for days 5, 10 and 15. The two MSE calculations were combined by adding the 2D kernel density estimates MSE plus twice the mean heteroplasmy MSE. To smooth out noise in the random search, we set the combined MSE for each of the 20,000 iterations to be the average combined MSE of the 50 nearest neighbours in the search space. The model with the minimum smoothed combined MSE had an inverse sigmoid function with a depth (d) of 0.31, a pitch ((p)) of 4.95 and a location (l) of 0.75, and simulated 1,000 molecules per cell (Extended Data Fig. 7c,d).

Having identified optimal model parameters, we then ran the simulated growth experiment 100 times, each with a different random seed. We present the joint heteroplasmy distributions for one representative model (Extended Data Fig. 6b), and also plot the distributions of the 100 mean heteroplasmy values and compare them to the observed bulk heteroplasmy values (Extended Data Fig. 6c).

Lineage tracing data analysis

The data from the lineage tracing SCI-LITE experiment were processed in a way similar to the other SCI-LITE experiments, with some modifications to account for the lineage tracing barcodes. During initial read processing, sequences were checked against the list of expected lineage barcodes with an error tolerance of 1 mismatch. Lineage barcode UMIs underwent the UMI allele assignment in the same way as described above, with a requirement for at least 3 reads supporting the UMI, and then assignment of the allele shown by at least 2/3 of the reads. When calling valid cells using the knee plot, we did not include lineage barcode UMIs in the UMI coverage per cell, because it is most critical to maximize coverage of the mtDNA alleles to acquire good heteroplasmy estimates, and including lineage barcode UMIs in the knee plot yielded some cells with high coverage of the lineage barcode but low coverage of the mtDNA alleles. After cells were called, and heteroplasmy per cell was calculated as described above, we additionally filtered for cells that had a lineage barcode. We found that it was most common for a cell to have UMIs showing a single lineage barcode; however, some cells showed multiple lineage barcodes, and we corrected these in a similar way to UMIs that showed multiple alleles. In the cells showing multiple lineage barcodes, it was often the case that one lineage barcode had an overwhelming fraction of the UMIs supporting it. We found that in such cases, the second most common barcode often had fewer than 10 UMIs supporting it. So, in each cell, we removed any lineage barcodes with fewer than 10 UMIs detected, and then additionally required that the number of UMIs supporting the most common lineage barcode in a cell be at least four times the number of UMIs supporting the second most common lineage barcode in that cell. Cells were assigned to the lineage barcodes passing these filtering steps, and otherwise were left unassigned to a lineage barcode.

Kimura distribution fitting

First, we filtered our LHON and SILENT time course SCI-LITE data to remove cells with zero heteroplasmy, to focus on cells that had undergone editing. Next, we used the ‘heteroplasmy’ R package (v0.0.2.1; https://github.com/StochasticBiology/heteroplasmy-analysis)³⁷ to fit our LHON and SILENT time course data to the two-parameter Kimura distribution, Kimura(h_i|p,b), where h_i is the heteroplasmy at the current timepoint, p is the initial population heteroplasmy and b is the extent of drift. We fixed the initial population heteroplasmy, p, to be the mean heteroplasmy observed empirically at our day 0 timepoint, and then used maximum likelihood estimation to fit the drift parameter (b) based on the day 5 SCI-LITE heteroplasmy measurements. We fitted both our LHON-edited and SILENT-edited data, and compared our observed heteroplasmy distributions to the fitted Kimuras using the Kolmogorov–Smirnov test in the test_kimura_par() function, with num_MC = 100. The Kolmogorov–Smirnov D statistic highlights the poor fit of the Kimura to the LHON data (D = 0.05 versus D = 0.26 for SILENT edited versus LHON edited), and this difference remains if we instead fit the Kimura by minimizing the Kolmogorov–Smirnov distance (D = 0.05 versus D = 0.25 for SILENT versus LHON).

Note that using the day 0 data is critical to seeing the influence of selection; if we fit both p and b to day 5 data, then we find no difference in the Kimura fit between SILENT and LHON. In addition, fitting both parameters by minimizing the Kolmogorov–Smirnov distance yields fits with Kolmogorov–Smirnov P > 0.05 for both the SILENT and the LHON data.

For visualization, we used the dkimura() function from the kimura R package (v0.0.0.9001; https://github.com/lbozhilova/kimura) to compute the probability density function of our Kimura distributions, and numpy.histogram() with density=True to compute the empirical density distribution of our day 5 heteroplasmy estimates (Extended Data Fig. 5). The analysis was done with Python v3.10.12, matplotlib v3.8.0, numpy v1.24.4, pandas v2.1.1, rpy2 v3.5.11, seaborn v0.13.0, R v4.2.3, heteroplasmy v0.0.2.1 and kimura v0.0.0.9001.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link