May 20, 2024
Widespread somatic L1 retrotransposition in normal colorectal epithelium – Nature

Widespread somatic L1 retrotransposition in normal colorectal epithelium – Nature

Human tissues

For the in vitro establishment of clonal organoids from colorectal tissues, healthy mucosal tissues were obtained from surgical specimens of 19 patients undergoing elective tumour-removal surgery (Supplementary Table 1). Normal tissues (approximately 1 × 1 × 1 cm3 in size) were cut from a region more than 5 cm away from the primary tumour. Matched blood and colorectal tumour tissues from the same patients were also collected for bulk-tissue WGS.

Fresh biopsies from one patient with MUTYH-associated familial adenomatous polyposis were obtained by colonoscopy. Tissues (approximately 0.5 × 0.5 × 0.5 cm3 in size) were cut from four polyps. Matched blood and buccal mucosa tissue from the same patient were also collected.

All tissues were transported to the laboratory for organoid culture experiments within 8 h of the collection procedure. All procedures in this study were approved by the Institutional Review Board of Seoul National University Hospital (approval no. 1911-106-1080) and KAIST (approval no. KH2022-058), and informed consent was obtained from all study participants. This study was conducted in accordance with the Declaration of Helsinki and its later amendments. No statistical methods were used to predetermine sample size. The experiments were conducted without randomization and the investigators were not blinded during the experimental procedures and data analysis.

Publicly available datasets

We included publicly available whole-genome sequences of single-cell expanded clones to reach a more complete picture of L1 retrotransposition in various human tissues. We included 474 whole-genome sequences from two previous datasets, one for haematopoietic cells (140 clones from one individual)9 and one for mesenchymal fibroblasts from our previous work (334 clones from seven individuals)4. In addition, we included 259 whole-genome sequences produced from LCM-based patches dissected from 13 organs investigated in a previous study5,11. Furthermore, we explored 578 whole-genome sequences generated from LCM-based patches of colorectal tissues51 to investigate differences in sensitivity for soL1R detection between LCM and clonal expansion methods.

To understand the PAF of rc-L1s we collected 2,852 publicly available whole-genome sequences of normal tissues with known ethnicity information. These data were collected from various studies52,53,54,55,56,57.

To understand the impact of the level of genome instability on the frequency of soL1Rs in tumours, we further explored variant calls from the ICGC/TCGA Pan-Cancer Analysis of Whole-Genome (PCAWG) Consortium, which included 2,677 cancer and matched normal whole-genome sequences across around 40 tumour types17,53. SoL1Rs from PCAWG samples can be found in a previous paper17. Other somatic mutation calls (including TP53-inactivating mutations, structural variations and mutational signatures) generated by the consortium are available for download at https://dcc.icgc.org/releases/PCAWG. Our matrix used in the analysis is available in Supplementary Table 5, which includes driver mutations of 19 matched colorectal cancers identified using CancerVision (Genome Insight).

Organoid culture of colorectal crypts

All organoid establishment procedures and media compositions were adopted from the literature, with slight modifications58. Mucosal tissues were cut into sections of approximately 5 mm and washed with PBS. Tissues were transferred to 10 mM EDTA (Invitrogen) in 50 ml conical tubes, followed by shaking incubation for 30 min at room temperature. After incubation, the tubes were gently shaken to separate crypts from connective tissues. The supernatant was collected, and 20 μl of suspension was observed under a stereomicroscope to check for the presence of crypts. Crypt suspension was centrifuged at 300 relative centrifugal force for 3 min, and the pellet was washed once with PBS to reduce ischaemic time. Isolated crypts were embedded in growth-factor-reduced Matrigel (Corning) and plated on a 12-well plate (TPP). Plating of crypts was performed at limited dilution by modification of the protocol from a previous study59. In brief, approximately 2,000 crypts were transferred to 900 μl of Matrigel and 3 × 150 μl of droplets were plated in three wells of a 12-well plate. Next, 450 μl of Matrigel was added to the remaining dilution and plating of three droplets in three wells was repeated. Serial dilution was performed at least four times and the final remaining dilution was plated in six wells. Plates were transferred to an incubator at 37 °C for 5–10 min to solidify the Matrigel. Each well was overlaid with 1 ml of organoid culture media, the compositions of which are described in Supplementary Table 6.

Clonal expansion of single-crypt-derived organoid

Primary culture of bulk and diluted crypts was maintained for at least 10 days to ensure the initial mass of single-crypt-origin organoid. After growth of organoids, a single example was manually picked using a 200 μl pipette under an inverted microscope. The picked organoid was placed in an Eppendorf tube and dissociated using a 1 ml syringe with a 25 G needle under TrypLE Express (Gibco). Next, blocking of TrypLE by ADF+++ (Advanced DMEM/F12 with 10 mM HEPES, 1× GlutaMAX and 1% penicillin-streptomycin) was followed by centrifugation and washing. The pellet was placed in a single well of a 24-well plate. Plates were transferred to a humidified 37 °C/5% CO2 incubator and medium changed every 2–3 days. After successful passage, clonal organoids were transferred to a 12-well plate and further expanded. Confluent clones were collected for nucleic acid extraction and organoid stock.

Reclonalization of single-crypt-derived organoid

Cultured single-crypt-derived organoids were harvested and dissociated using TrypLE Express. After blocking of TrypLE and washing, organoids were resuspended using ADF+++. Organoid suspensions were filtered through a 40 μm strainer (Falcon), then single cells were sorted into a FACS tube by cell sorter (FACSMelody, BD Biosciences). Single cells were selected based on forward- and side-scatter characteristics according to the manufacturer’s protocol. Sorted cells were sparsely seeded with growth-factor-reduced Matrigel (500 per well) in 12-well plates. Grown reclonalized single organoids were manually picked and expanded by the methods described above.

Primary culture of skin fibroblasts

We obtained seven fibroblast clones for methylation analysis. Dermal skin fibroblasts were cultured by a method described previously4. In brief, skin samples were washed with PBS (Gibco) and adipose tissue and blood vessels removed. The remaining tissues were cut into small pieces (1–2 mm2) and treated with 1 mg ml–1 collagenase/dispase solution (Roche) at 37 °C for 1 h. After treatment, the epidermal layer was separated from the dermal layer and the latter washed with DMEM medium containing 20% FBS (Gibco) to inhibit collagenase/dispase activity. Dermal tissue was then minced into small pieces and cultured in collagen I-coated 24-well plates (Corning) with 200 µl of medium in a humidified incubator at 37 °C with 5% CO2 concentration.

Library preparation and WGS

For Illumina sequencing we extracted genomic DNA materials from clonally expanded cells, matched peripheral blood and colorectal tumour tissues using either the DNeasy Blood and Tissue kit (Qiagen) or the Allprep DNA/RNA kit (Qiagen) according to the manufacturer’s protocol. DNA libraries were generated using Truseq DNA PCR-Free Library Prep Kits (Illumina) and sequenced on either the Illumina HiSeq X Ten platform or the NovaSeq 6000 platform. Colorectal clones were whole-genome sequenced with a mean 17-fold depth of coverage. Matched peripheral blood and colorectal tumour tissues were sequenced with a mean coverage of 181- and 35-fold, respectively. For PacBio sequencing we extracted genomic DNA from colon organoids using the Circulomics Nanobind Tissue Big DNA kit (Circulomics) according to the manufacturer’s protocol. DNA libraries were prepared using the MRTbell express template prep kit 2.0 (PacBio) and sequenced on a PacBio Sequel IIe platform.

Whole-transcriptome sequencing of organoids

Total RNA was extracted from clonally expanded cells using the Allprep DNA/RNA kit (Qiagen). The total RNA sequencing library was constructed using the Truseq Stranded Total RNA Gold kit (Illumina) according to the manufacturer’s protocol.

Whole-genome DNA methylation sequencing of organoids

Genomic DNA was extracted from clonally expanded cells using either the DNeasy Blood and Tissue kit (Qiagen) or the Allprep DNA/RNA kit (Qiagen). The libraries were prepared from 200 ng of input DNA with control DNA (CpG methylated pUC19 and CpG unmethylated lambda DNA) using the NEBNext Enzymatic Methylation-seq kit (NEB) according to the manufacturer’s protocol. Paired-end sequencing was performed using the NovaSeq 6000 platform (Illumina).

Variant calling and filtering of WGS data

Sequenced reads were mapped to the human reference genome (GRCh37) using the Burrows–Wheeler aligner (BWA)–MEM algorithm60. Duplicated reads were removed by either Picard (available at http://broadinstitute.github.io/picard) or SAMBLASTER61. We identified SNVs and short indels as previously reported4. Briefly, base substitutions and short indels were called using Haplotypecaller2 (ref. 62) and VarScan2 (ref. 63). To establish high-confidence variant sets we removed variants with the following features: (1) 1% or more VAF in the panel of normal, (2) high proportion of indels or clipping (over 70%), (3) three or more mismatched bases in the variant reads and (4) frequent existence of error reads in other clones.

Calling structural variations

We identified somatic structural variations in a similar way to our previous report4. We called structural variations using DELLY64 with matched blood samples and phylogenetically distant clones to retain both early embryonic and somatic mutations. We then discarded variants with the following features: (1) the presence in the panel of normals, (2) insufficient number of supporting read pairs (fewer than ten read pairs with no supporting SA tag or fewer than three discordant read pairs with one supporting SA tag) and (3) many discordant reads in matched blood samples. To remove any remaining false-positive events and rescue false-negative events located near breakpoints, we visually inspected all the rearrangements passing the filtering process using Integrative Genomics Viewer65.

Calling L1 retrotransposition and other mobile element insertions

We called L1 retrotranspositions using MELT20, TraFiC-mem16, DELLY64 and xTea66 with matched blood samples and phylogenetically distant clones to retain both early embryonic and somatic mutations. Potential germline calls, overlapping with events found in unmatched blood samples, were removed. To confirm the reliability of calls and remove remaining false-positive events we visually inspected all soL1R candidates focusing on two supporting pieces of evidence: (1) poly-A tails and (2) target site duplications using Integrative Genomics Viewer65. Additionally we excluded variants with a low number of supporting reads (fewer than 10% of total reads) to exclude potential artefacts. We obtained the 5′ and 3′ ends of the inserted segment to both calculate the size of soL1Rs and determine whether L1-inversion or L1-mediated transduction was combined. When both ends of the insert were mapped on opposite strands, the variant was considered to be inverted. When the inserted segment was mapped to unique and non-repetitive genomic sequences, where a full-length L1 element is located within a 15 kb upstream region, we determined that the L1 insertion was combined with the 3′ transduction and derived from the L1 element on the upstream region of unique sequences. To calculate the VAF of soL1Rs we divided the number of L1-supporting read pairs by the total number of informative read pairs around insertion sites. A read pair was considered informative if the region covering its start and end spanned the insertion breakpoint. Furthermore, we counted the number of reference-supporting read pairs twice when calculating the total number of informative read pairs, because insertion is supported by reads pairs at both ends of the insert. To identify clonal L1 insertions in cancer samples we established a cutoff based on the minimum cell fraction value of shared soL1Rs in normal colorectal clones, because shared soL1Rs are considered true variants. We used the same approach for other mobile element insertions, including Alu and SVA.

Mutational signature analysis

To extract mutational signatures in our samples we used three different tools (in-house script, SigProfiler67 and hierarchical dirichlet processes68) to achieve a consensus set of mutational signatures for each type of colon sample, including normal epithelial cells, adenoma and carcinoma. In brief, our in-house script is based on non-negative matrix factorization with or without various mathematical constraints, and borrows core methods from the predecessor of SigProfiler69 such as using a measure of stability and reconstruction error for model selection; however, it provides greater flexibility in examining a broader set of possible solutions, including those that can be missed by SigProfiler, and enables a deliberate approach for determining the number of presumed mutational processes. As a result, we selected a subset of signatures that best explain the given mutational spectrum: SBS1, SBS5, SBS18, SBS40, SBS88, SBS89, ID1, ID2, ID5, ID9, ID18 and IDB for normal colorectal epithelial cells; SBS1, SBS5, SBS18, SBS36, SBS40, ID1, ID2, ID5 and ID9 for MUTYH-associated adenoma; and SBS1, SBS2, SBS5, SBS13, SBS15, SBS17a, SBS17b, SBS18, SBS21, SBS36, SBS40, SBS44, SBS88, ID1, ID2, ID5, ID9, ID12, ID14 and ID18 for colorectal cancers. All signatures are attributed to known mutational signatures available from v.3.2 of the COSMIC mutational signature (available at https://cancer.sanger.ac.uk/cosmic/signatures) and IDB, which is a newly found signature from previous research on normal colorectal epithelial cells51 but not yet catalogued in COSMIC mutational signature.

Reconstruction of early phylogenies

We reconstructed the phylogenetic tree of the colonies and the major clone of cancer tissue from an individual by generating an n × m matrix representing the genotype of n mutations of m samples, as previously conducted4. Briefly, SNVs and short indels from all samples of an individual were merged and only variants with five or more mapped reads in all samples were included to avoid incorrect genotyping for low coverage. Additionally, variants with VAF < 0.25 in all samples were removed to exclude potential sequencing artefacts. If the VAF of the ith mutation in the jth sample was more than 0.1, Mij was assigned 1; otherwise, 0. Mutations shared in all samples were regarded as germline variants and discarded. We grouped all mutations according to the types of samples in which they were found and established the hierarchical relationship between mutation groups. In short, if the samples of mutation group A contain all the samples of mutation group B in addition to other samples, mutation group B is subordinate to mutation group A. We then reconstructed the phylogenetic tree that best explains the hierarchy of the mutation groups. The final phylogenetic tree is a rooted tree in which each sample (colony) is attached to one terminal node of the tree, with the number of mutations in the corresponding mutation group being the length of the branch. For cancer samples, the length of branches represents clonal point mutations with cancer cell fractions greater than 0.7. To convert molecular time (number of early mutations) to physical cell generations we used a mutation rate of 2.4–3.8 pcpcd for the first two cell divisions and then 0.7–1.2 pcpcd, which were estimated from a previous work4,5.

Estimation of soL1R rates in various stages

When calculating soL1R rates we classified point mutations on phylogenetic trees into four different stages: pregastrulation, postgastrulation, ageing (postdevelopment) and tumourigenesis. Mutations shared by multiple clones and detected in bulk blood whole-genome sequences (mesodermal origin) were considered pregastrulational. Mutations in early branches4,51,70 but not found in bulk blood whole-genome sequences were considered postgastrulational. All other mutations in normal clones were considered to have accumulated during the ageing process. For mutations in ageing and tumourigenesis we counted those attributable to endogenous mutational processes (SBS1 and SBS5/40 for SNVs, ID1 and ID2 for indels), to exclude extra mutations by external carcinogen exposure. For mutations in tumours we counted clonal point mutations (cancer cell fractions greater than 0.7) to exclude subclonal mutations. Finally we calculated soL1R rates in each stage by dividing the number of soL1Rs by the total number of endogenous point mutations. The calculation of soL1R rate for tumourigenesis included only non-hypermutated tumours.

Population allele frequency of L1 sources

To calculate the PAF of rc-L1 sources we collected 2,852 publicly available and eight in-house (overall 2,860) whole-genome sequences of normal tissues with known ethnicity information (714 Africans, 588 Europeans, 538 South Asians, 646 East Asians and 374 Americans)52,53,54,55,56,57. Initially we determined whether individuals had rc-L1s in their genome. Briefly, we calculated the proportion of L1-supporting reads for non-reference L1 and the proportion of reads with small insert size opposing L1 deletion for reference L1, respectively. Only rc-L1s with a proportion of 15% or more were considered to exist in the genome. We then calculated the PAF of a specific rc-L1 as the proportion of individuals with the L1 in the population.

Long-read, whole-genome sequence analysis

Sequenced reads were mapped to the human reference genome (GRCh37) using pbmm2 (https://github.com/PacificBiosciences/pbmm2), a wrapper for minimap2 (ref. 71). Sequences for L1-supporting reads near source elements were extracted and mapped to the L1HS consensus sequences18 using BWA60. We next identified sequence variations of source elements, including truncating mutations, and assigned each source element to corresponding L1 subfamilies21.

Methylation analysis

Sequenced reads were processed using Cutadapt72 to remove adaptor sequences. Trimmed reads were mapped using Bismark73 to the genome combining human reference genome (GRCh37) modified by the incorporation of L1 consensus sequences at the non-reference L1 source sites, pUC19 and lambda DNA sequences. For a single CpG site, the number of reads supporting methylation (C or G), the number of reads supporting demethylation (A or T) and the proportion of former reads among total reads (methylation fraction) were calculated using Bismark. Conversion efficacy was estimated with reads mapped on CpG methylated pUC19 and CpG unmethylated lambda DNA. To observe overall methylation status we examined the methylation fraction in regions ranging from 600 bp upstream to 600 bp downstream from L1 transcription start site for each L1 source element. We then focused on CpG sites located between the L1 transcription start site and the 250 bp downstream region (+1 to +250) and classified each CpG site into one of three categories according to methylation fraction: homozygous demethylation (methylation fraction below 25%), heterozygous (methylation fraction at least 25% and methylation fraction below 75%) and homozygous methylation (methylation fraction at least 75%). Next, methylation scores were assigned to CpG sites (0 for homozygous demethylation, 5 for heterozygous and 10 for homozygous methylation) and summarized by averaging the score of all CpG sites on the +1 to +250 region of the L1 element. Finally we compared the methylation score across every sample and every known source element to determine the relationship between methylation status and source activation.

For the analysis of L1 promoter methylation level in bulk tissues we downloaded whole-genome bisulfite sequencing data of 16 different tissues from Roadmap Epigenomics74. The Roadmap codes are E050 BLD.MOB.CD34.PC.F (Mobilized_CD34_Primary_Cells_Female), E058 SKIN.PEN.FRSK.KER.03 (Penis_Foreskin_Keratinocyte_Primary_Cells_skin03), E066 LIV.ADLT (Adult_Liver), E071 BRN.HIPP.MID (Brain_Hippocampus_Middle), E079 GI.ESO (Esophagus), E094 GI.STMC.GAST (Gastric), E095 HRT.VENT.L (Left_Ventricle), E096 LNG (Lung), E097 OVRY (Ovary), E098 PANC (Pancreas), E100 MUS.PSOAS (Psoas_Muscle), E104 HRT.ATR.R (Right_Atrium), E105 HRT.VNT.R (Right_Ventricle) E106 GI.CLN.SIG (Sigmoid_Colon), E109 GI.S.INT (Small_Intestine) and E112 THYM (Thymus). The methylation fractions of CpG sites in referenced L1 sources were collected and summarized by averaging the fraction of all CpG sites on the +1 to +250 region of the L1 element, then compared the averaged L1 promoter methylation level across different tissues.

Gene expression analysis

Sequenced reads were processed using Cutadapt72 to remove adaptor sequences. Trimmed reads were mapped to the human reference genome (GRCh37) using the BWA–MEM algorithm60. Duplicated reads were removed by SAMBLASTER61. To identify the expression level of each L1 source element we collected reads mapped on regions up to 1 kb downstream from the 3′ end of the source element, and calculated the FPKM value. Only reads in the same direction with the source element were considered. If the source element was located on the gene and both were on the same strand, the FPKM value was not calculated because the origin of reads on the downstream region is ambiguous.

Association with genome features

The L1 insertion rate was calculated as the total number of soL1Rs per sliding window of 10 Mb, with an increment of 5 Mb. To examine the relationship between L1 insertion rate and other genomic features at single-nucleotide resolution we used a statistical approach described previously17,75. In brief, we divided the genome into four bins (0–3) for each of the genomic features, including replication time, DNA hypersensitivity, histone mark (H3K9me3 and H3K36me3), RNA expression and closeness to the L1 canonical endonuclease motif (here defined as either TTTT|R (where R is A or G) or Y|AAAA (where Y is C or T)). By comparison of breakpoint sequences with the L1 endonuclease motif, we assigned genomics regions with more than four (most dissimilar), three, two and fewer than one (most similar) mismatches to the L1 endonuclease motif into bins 0, 1, 2 and 3, respectively. DNA hypersensitivity and histone mark data from the Roadmap Epigenomics Consortium were summarized by averaging fold-enrichment signal across eight cell types. Genomic regions with fold-enrichment signal lower than 1 belonged to bin 0, and the remainder were divided into three equal-sized bins: bin 1 (least enriched), bin 2 (moderately enriched) and bin 3 (most enriched). RNA sequencing data were also obtained from Roadmap and FPKM and averaged across eight cell types. Regions with no expression (FPKM = 0) belong to bin 0 and the remainder were divided into three equal-sized bins: bin 1 (least expressed), bin 2 (moderately expressed) and bin 3 (most expressed). Replication time was processed by averaging eight ENCODE cell types, and genomic regions were stratified into four equal-sized regions: bin 0 contained regions with the latest replicating time and bin 3 contained regions with the earliest replicating time. For every feature, enrichment scores were calculated by comparison of bins 1–3 against bin 0. Therefore, the log value of the enrichment score for bin 0 should be equal to 0 and is not described on plots.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link