Mega-scale experimental analysis of protein folding stability in biology and design - Nature - Alert Breaking News

Selection of natural proteins for mutational scanning

We first collected all monomeric proteins in the PDB in the 30–100 amino acid length range in June 2021. We next excluded structures that had only a single helix, contained other molecules (for example, proteins, nucleic acids or metals), were annotated to have DNAse, RNAse or protease inhibition activity, or included more than four cystines. We then removed redundant sequences (amino acid sequence distance <2). We then predicted the structures of these PDB sequences using AlphaFold (even though the PDB structures were known), and used the AlphaFold models to trim amino acids from the N- and C termini that had a low number of contacts with any other residues. Finally, we selected domains with up to 72 amino acids after excluding N- or C-terminal flexible loops.

EEHH design method

EEHH protein design was performed in three steps: (1) backbone construction, (2) sequence design, (3) selection of designs for deep mutational analysis. Backbone construction (the de novo creation of a compact, three-dimensional backbone with a pre-specified secondary structure) was performed using a blueprint-based approach described previously^51,52. All blueprints are included as Blueprints_for_EEHH.zip in Source data.

Hallucination design method

We used a TrRosetta hallucination protocol described previously in the previous reports^31,32 and available at https://github.com/gjoni/trDesign/tree/master/02-GD to unconditionally generate protein backbones and sequences with lengths ranging from 46 to 69 amino acids by maximizing the Kullback–Leibler divergence between the predicted and background distance/angle distributions. Predicted distograms and anglegrams were used to obtain 3D structures of these models as described in the TrRosetta paper⁵³. We selected the best designs according to the predicted distogram and 3D structure match.

DNA oligonucleotide library construction

All sequences were reverse-translated and codon-optimized using DNAworks2.0⁵⁴. Sequences were optimized using E. coli codon frequencies because we used an in vitro translation kit derived from E. coli. Oligonucleotide libraries encoding amino acid sequences of Library 1 were purchased from Agilent Technologies.

Library 1

We selected ~250 designed proteins and ~50 natural proteins that are shorter than 45 amino acids. Then, we created amino acid sequences for deep mutational scanning followed by padding by Gly, Ala and Ser amino acids so that all sequences have 44 amino acids. The total number of sequences is ~244,000 sequences Purchased from Agilent Technologies, length 230 nt.

Library 2

We selected ~350 natural proteins that have PDB structures that are in a monomer state and have 72 or less amino acids after removing N and C-terminal linkers. Then, we created amino acid sequences for deep mutational scanning followed by padding by Gly, Ala and Ser amino acids so that all sequences have 72 amino acids. The total number of sequences is ~650,000 sequences. This library also includes scramble sequences to construct unfolded state model. Purchased from Twist Bioscience, length 250 nt.

Library 3

We selected ~150 designed proteins and created amino acid sequences for deep mutational scanning of the proteins. We also included comprehensive deletion and Gly or Ala insertion of all wild-type proteins included in Library 1 and Libary 2. Additionally, amino acid sequences for comprehensive double mutant analysis on polar amino acid pairs were also included. The total number of sequences is ~840,000 sequences. Purchased from Twist Bioscience, length 250 nt.

Library 4

Amino acid sequences for exhaustive double mutant analysis on amino acid pairs located in close proximity were included. We also include overlapped sequences to calibrate effective protease concentration and to check consistency between libraries. The total number of sequences is ~900,000 sequences. Purchased from Twist Bioscience, length 300 nt.

DNA and mRNA preparation for cDNA display proteolysis method

Oligonucleotide libraries were amplified by PCR using KOD PCR Master Mix (Toyobo) to add T7 promoter, PA tag to an N-terminal, and His tag to an C-terminal of the proteins. The number of cycles was chosen based on a test qPCR run to avoid overamplification using SsoAdvanced Universal SYBR Green Supermix (Bio-Rad). The PCR product was gel extracted to isolate the expected length product. Then we used T7-Scribe Standard RNA IVT Kit (Cellscript) to synthesize mRNA using the DNA fragment as a template.

Preparation of protein–cDNA complex

We followed the protocol essentially as described^24,55, with some modifications, described below.

Photo-crosslinking between mRNA and the puromycin linker

We prepared the photocrosslinking reaction solution (usually at 40 μl scale) using 100 mM NaCl, 20 mM Tris-HCl (pH 7.5), 1 μM cnvK linker (EME), 1 μM mRNA. The solution was incubated at 95 °C for 5 min, then slowly cooled down to 45 °C (0.1 °C s⁻¹) using a thermal cycler. Then the solution including the duplex was irradiated with UV light at 365 nm using a 6 W Handheld lamp (Thermofisher) for 15 min. At 40 μl scale (40 pmol cnvK linker and 40 pmol mRNA total), this produces crosslinked mRNA sufficient for 48 proteolysis reactions.

In vitro translation and reverse transcription

We used the PUREfrex 2.0 (GeneFrontier) translation system according to the manufacturer protocol. We typically used a 160 μl total reaction including 40 μl of the mRNA-cnvK linker duplex product from Step 1 and RiboLock RNase Inhibitor (Thermofisher). We incubated the reaction at 37 °C for 2 h. After the incubation, 500 mM EDTA (16 μl for a 160 μl reaction) was added to the sample to dissociate ribosomes. Then, an equal amount (160 μl for a 160 μl reaction) of 2× binding/washing buffer (20 mM Tris pH 7.5, 2 mM EDTA, 2M NaCl, 0.2% Tween) was added. The solution was added to Dynabeads MyOne Streptavidin C1 (Thermofisher, 200 μl for 40 pmol mRNA) to pull down the protein-mRNA complex and incubated at room temperature for 20 min. Before use, streptavidin beads were pre-washed with (1) 100 mM NaOH, 50 mM NaCl, then (2) 100 mM NaCl to remove any RNase activity. After streptavidin pull-down, the beads were washed by 1× binding/washing buffer once and rinsed twice by TBS (10 mM Tris-HCl pH7.5, 100 mM NaCl), and we added reverse transcription solution (PrimeScript RT Reagent Kit; Takara) onto the beads with protein mRNA complex, and incubated the beads at 37 °C for 30 min.

Purification of protein–cDNA complex

After the reverse transcription, the protein–cDNA complex was eluted with His-binding buffer (30 mM Tris pH7.4, 0.5 NaCl, 0.05% Tween) with RNase T1 (Thermofisher) usually in 400 μl scale. The eluent was added to His Mag Sepharose Ni (Cytiva) (800 μl for 40 pmol starting mRNA) and incubated at room temperature for 30 min. Then the complex was eluted by His-binding buffer with 400 mM imidazole (usually 400 μl) and the eluent was buffer-exchanged to PBS by Zeba Spin Desalting Column (Thermofisher). Then the complex was snap-frozen with liquid nitrogen and stored at −80 °C until the following protease assay. When starting from 40 pmol cnvK linker and 40 pmol mRNA for step 2, we would typically finish this step with 400 μl of protein–cDNA complex divided into 4 frozen tubes (100 μl each) for four sets of 12 protease experiments (48 conditions total).

Protease assay on protein–cDNA complex

Proteolysis reactions were performed in two ‘replicates’ of 12 conditions each (11 protease concentrations in a threefold dilution series and one condition with no protease). Replicate 1 used a maximum protease concentration of 25 μM and replicate 2 used 43.3 μM (25 x 3^0.5 μM). For one replicate (12 reactions), we started from ~25 μl complex, diluted this in PBS up to 240 μl, then added 20 μl to each of the 12 Protein LoBind tubes used for that replicate. Each reaction contained protein–cDNA complex equivalent to 0.83 pmol starting cnvK linker and 0.83 pmol starting mRNA. To start each reaction, we added 40 μl of protease solution to each tube. After 5 min protease digestion at room temperature, we added 200 μl chilled 2% BSA in PBS to quench the reaction, then the solution was added to 40 μl Dynabeads Protein G (Thermofisher) preincubated with anti-PA tag (Wako; Clone number: NZ-1; 1 μg antibody per 30 μl beads), and incubated at 4 °C for 1 h. Then the beads were washed by washing buffer (PBS including 800 mM NaCl and 1% Triton) three times and rinsed by PBS three times, then the complex was eluted with 50 μl PBS including 250 μg ml⁻¹ PA peptide (Wako) and 200 μg ml⁻¹ BSA (Thermofisher). Trypsin experiments used Trypsin-EDTA (0.25%) with phenol red (Thermo Fisher Scientific) for consistency with ref. ¹⁸ and chymotrypsin experiments used α-chymotrypsin from bovine pancreas (Sigma).

qPCR analysis of cDNA display proteolysis results on individual proteins

The cDNA amount for each specific sequence in the eluents was quantified by qPCR using SsoAdvanced Universal SYBR Green Supermix and specific primers for each sequence. The qPCR was performed using CFX96 Touch Real-Time PCR Detection System (Bio-Rad), and the qPCR cycles were determined by the CFX Maestro Software (Bio-Rad). Extended Data Fig. 1.

Next-generation sequencing sample preparation

For DNA library analysis, one-half volume (25 μl) of the eluted cDNA of the complex was amplified by PCR using SsoAdvanced Universal SYBR Green Supermix (BioRad) to add P5 and P7 NGS adapter sequences. The number of cycles was chosen based on a test qPCR run using the same PCR reagents to avoid overamplification. The DNA fragment length and concentration were confirmed by 4200 TapeStation System (Agilent), then the samples were analysed by NovaSeq 6000 System (Illumina).

Processing of next-generation sequencing data

Each library in a sequencing run was identified via a unique 6- or 8-bp barcode. Following sequencing, reads were paired using the PEAR program⁵⁶ then the adapter sequences were moved by Cutadapt⁵⁷. Reads were considered counts for a sequence if the read perfectly matched the ordered sequences at the nucleotide level.

Overall strategy for inferring K
₅₀ and ΔG from sequencing data

We used Bayesian inference to infer K₅₀ and ΔG values for all sequences in our library. This analysis uses two main models. The first model is called the ‘K₅₀ model’ and infers each sequence’s K₅₀ values based on the sequencing count data. The second model is called the ‘unfolded state model’ and predicts each sequence’s unfolded state K₅₀ value (K_50,U) based on its sequence. Both models are implemented in Python 3.9 using the Numpyro package⁵⁸ version 0.80. In Supplementary Notes, we describe the structure of each model and the procedure for fitting each model. Our scripts to reproduce the complete fitting process are provided in the Source Data.

Replicate analysis of K
₅₀

Instead of sampling K₅₀ values using 24 samples per protease at one time as described in step 5 above, we sampled K₅₀ values using one experiment set (that is, 12 samples) and obtained K₅₀ for trypsin replicates 1 and 2, and chymotrypsin replicates 1 and 2. Note that we still used the calibrated protease concentrations to improve consistency between replicates. The replicates were conducted on different days using the same preparation of the protein–cDNA complex.

Data for purified protein experiments

The data on purified proteins shown in Fig. 1g was taken from refs. ^{29,59,60,61,62,63,64,65,66,67,68,69,70,71}.

Data quality filtering and classification of datasets 1–3

Our data (Fig. 2) were filtered for quality in three stages. First, our Bayesian procedure produces confidence intervals for K₅₀ and ΔG estimates, producing a quality estimate for each individual measurement. Second, we evaluated the quality of each full mutational scan, classified these into categories, and removed the low quality categories from our main analysis (below). Third, we filtered our mutational scanning data to remove mutants that showed evidence of causing cleavage from the folded state or intermolecular disulfide cross-linking.

Analysis of Bayesian confidence intervals

Nearly all low-confidence ΔG estimates result from stabilities that are outside the main dynamic range of the assay (−1 to 5 kcal mol⁻¹). This is due to the very steep slope of ΔG with respect to K₅₀ in this range (see Fig. 1e). For all figures, we clip all ΔG estimates to the range of −1 to 5 kcal mol⁻¹ before further analysis. In the table of all data, the ‘dG_ML’ column categorizes sequences as ‘<−1’ and ‘>5’ if the 95% confidence interval is fully outside the range. Of the sequences with ΔG estimates between −1 and 5 kcal mol⁻¹, the median sequence had a 95% confidence interval width of 0.14 kcal mol⁻¹, and 99.9% of sequences had confidence intervals smaller than 0.96 kcal mol⁻¹. Although a very small fraction of ΔG estimates were low confidence (that is, had a wide confidence interval), we still included these sequences in all analyses. Note that these confidence intervals only reflect the model’s uncertainty stemming from the finite deep sequencing counts; other uncertainties (such as uncertainty in K_50,U, K_50,F, protease concentrations, the validity of the kinetic model, and so on) are not reflected in these confidence intervals.

Classification of mutational scans

All mutational scanning data were classified into 12 groups (0 to 11) according to the protocol in Extended Data Fig. 8. Groups 0 and 1 contain the mutational scans that passed all quality filters. Domains in group 0 have wild-type ΔG values below 4.75 kcal mol⁻¹ so that stabilizing mutations can still fall within the cDNA proteolysis assay’s dynamic range. Group 1 contains the remaining high-quality domains. Groups 2–11 contain mutational scans that failed one or more quality filters. All mutational scans are included in only one group, so a mutational scan classified as ‘group 5’ (for poor correlation between independent trypsin and chymotrypsin results) might also fail other filters (such as having a poor slope or intercept between trypsin and chymotrypsin results).

Below, we define each group, along with a short explanation of possible causes.

Group 0: Passing all quality filters.

Group 1: Passing all quality filters, but wild-type ΔG > 4.75 kcal mol⁻¹, so stabilizing mutants may not be resolved compared to the wild type.

Group 2: Poor expression in the assay, based on low counts in next-generation sequencing.

Group 3: The wild-type protein is too unstable to see sequence–stability relationships. This may be due to a truly unstable wild-type sequence or due to rapid cleavage of some segment in the folded state.

Group 4: The wild-type stability (ΔG) is inconsistent. We often observed this for high stability proteins in our first library where the wild-type stability exceeded the dynamic range of the assay.

Group 5: Poor correlation between trypsin experiment and chymotrypsin experiments. This can suggest that one or both proteases are not probing global unfolding, leading to different mutational patterns between the proteases.

Group 6: Poor slope between trypsin experiment and chymotrypsin experiments. This can suggest that some cleavage is occurring from folded state(s) for one or both proteases. If cleavage can occur from the folded state for one protease, the modelled K_50,F will be different from the true K_50,F, creating a slope between the inferred ΔG values and the true ΔG values (see Fig. 1e).

Group 7: Too many stabilizing mutants. In a typical well-folded domain, most mutations are neutral, so a very large fraction of stabilizing mutations suggests the wild-type ΔG may not have been measured accurately. Furthermore, when the large majority of hydrophobic substitutions at surface sites are stabilizing, this suggests the domain may be stabilized by non-specific intermolecular interactions. For these reasons, we removed domains showing these patterns.

Groups 8 and 9: Includes multiple cysteines with proper folding (G8) or misfolding (G9). Disulfide linkages have the potential to disrupt our assay by preventing the C-terminal cDNA from dissociating from the protein N terminus even after the protein is proteolysed. In general, we found that proteins with >1 Cys performed poorly in our assay and many of these proteins are found in groups 2–7. Owing to these results, we decided to remove the remaining proteins with >1 Cys (group 9). However, two proteins appeared to produce good results. Although we chose not to include these proteins in our main analysis, they have been separated into group 8 (high-quality data from proteins with >1 Cys).

Group 10: Poor intercept between trypsin experiment and chymotrypsin experiments. A poor intercept indicates that our trypsin and chymotrypsin experiments cannot agree on where ΔG = 0 is for the overall mutational scan. This depends on the unfolded model for each protease (the inference of K_50,U for each protease). Because the two proteases did not agree on the ΔG values for these sequences, the ΔG values are likely less reliable than those in group 0 and group 1. However, ΔΔG values for this group are still consistent across both proteases.

Group 11: Probably cleavable in folded states. In many cases, excessive cleavage from the folded state or partially folded states will lead to low wild-type stability (G3), poor correlation between the proteases (G5), or a poor slope (G6). However, we saw some evidence of folded state cleavage even in mutational scans that passed these filtering criteria. Specifically, we observed cases where mutating out a wild-type cut site led to increased protease resistance (higher K₅₀) and apparently higher stability (ΔG) to one specific protease but not the other (for example, R16 in Extended Data Fig. 6a,b). This increase in apparent stability for just one protease suggests that either the site can be cleaved from folded state(s) for that protease, or removing the cut site is decreasing unfolded state susceptibility (K_50,U) in a way that is not properly accounted for by our model. Because these conditions lower the reliability of our ΔG estimates, we removed these mutational scans from analysis. The code to perform this filtering is provided (Data_quality_filtering_script.ipynb).

Removal of individual mutants that may disrupt the assay

In the previous stage, we filtered out entire domains; here, we filtered out data from individual mutants in domains that otherwise passed filtering (that is, were in group 0 or group 1). We focused on two specific types of mutations that could disrupt our assay. First, we filtered out data where introducing new cleavage sites into poorly structured regions of a protein resulted in apparent destabilization. Because these mutants are located in poorly structured sites, the apparent destabilization may result from cleavage from folded or partially folded states. These mutants were identified based on (1) apparent destabilization from introducing the new cleavage site, and (2) a low variance in stability between the other amino acids, which indicates a poorly structured region of the protein where cleavage might occur in the folded state. Second, we filtered out data where introducing Cys mutants into poorly structured regions of a protein resulted in apparent stabilization. Again, because these mutants are located in poorly structured sites, the apparent stabilization may result from the formation of inter- or intramolecular disulfide linkages that prevent the dissociation of the C-terminal cDNA following protease cleavage. The code to perform this filtering is provided (Data_quality_filtering_script.ipynb).

All sequences in dataset 2 and dataset 3 are included in Tsuboyama2023_Dataset2_Dataset3_20230416.csv. All sequences in this file have an inferred ΔG estimate, but only sequences in dataset 3 have a tabulated ΔΔG estimate. Of course, one can calculate ΔΔG for the remaining sequences in dataset 2, but these ΔΔG values will be biased toward destabilizing mutations because stabilizing mutations would typically be indistinguishable from the wild-type stability. Note that datasets 2 and 3 include a very small number of sequences with low-quality data (wide confidence intervals) because these sequences come from mutational scans that are high quality overall. Although these tables include all K₅₀, ΔG and ΔΔG data (for dataset 3), low-quality data (including mutant data filtered in Stage 3) have been filtered out and replaced by a – symbol in the columns labelled ‘_ML’ (for machine learning).

Principal component analysis

We performed principal component analysis to determine the factors influencing stability of different amino acids (Fig. 3). To this end, we used 17,093 sites in the 365 domains that are classified as G0 in the above. All folding stability data were clipped between from −1 and 5 kcal mol⁻¹ because the folding stability outside the dynamic range is not reliable, and then the average of the stability for 20 amino acids for each site was subtracted from the data. Using the data, we performed principal components analysis using the scikit-learn library implemented in Python 3.

Side chain contacts and burial analysis

Burial values and contact counts (Fig. 3b and Extended Data Fig. 9h) were computed based on AlphaFold models¹ of all sequences using the included script Burial_side_chain_contact_Fig3_Fig6.ipynb based on Bio.PDB⁷² and BioPython⁷³. The calculation is based on the Rosetta ‘sidechain_neighbors’ LayerDesign method previously reported¹⁸. In brief, to calculate the burial or contacts of residue X, we added up the number of residues in a cone projecting out 9 Å away from the Cβ atom on residue X in the direction of the residue X Cα-Cβ vector. ‘Burial’ (Fig. 6h) indicates the number of Cα atoms in the cone. Contact counts (Fig. 3d) each count different atoms inside the cone: ‘side chain contact count’ (Fig. 3d) counts all Cβ atoms; ‘aromatic side chain contact count’ counts all CE2 atoms of Phe, Tyr, and Trp; ‘acidic side chain contact count’ counts all Glu OE1 and Asp OD1 atoms; and ‘Basic side chain contact count’ counts all Lys NZ and Arg NE atoms.

Secondary structure determination

Using the DSSP algorithm^74,75, we obtained secondary structure information based on AlphaFold models (Fig. 3b).

Selection method of site pairs for double mutational analysis

Double mutants (Fig. 4) were selected for analysis in two ways. First, we manually selected polar interactions where either amino acid appeared important for stability in single mutational analysis. These pairs were mainly included in library 3. Second, we used the program confind^76,77 to identify interacting residues. All confind pairs with notable interactions such as polar interactions and cation–π interactions were selected, along with a randomly chosen subset of more common interactions such as hydrophobic interactions. These pairs were included in library 4.

Thermodynamic coupling analysis

The thermodynamic coupling model and the procedure for fitting the model (Fig. 4) are described in Supplementary Notes.

Wild-type amino acid prediction model

The wild-type sequence prediction model (Fig. 5) and the procedure for fitting the model are described in Supplementary Notes.

GEMME analysis

To calculate the normalized averaged GEMME score, which represents the sensitivity of a wild-type amino acid to substitutions inferred from evolutionary information (ΔΔE in the previous reports^36,37), we ran GEMME⁴² on each natural amino acid sequence using the default parameters. We computed a single score for each site by averaging the scores of the 19 amino acids (except Cys), and then standardized each domain individually (subtracted the domain’s mean and divided by the domain’s standard deviation) so that the site scores within a domain had a mean of zero and a standard deviation of one. Finally, we flip the sign of the score so that positive values imply high susceptibility to mutations (that is, very negative raw GEMME scores for non-wild-type amino acids). We define this standardized score for each site as the normalized GEMME score. To build the input multiple sequence alignments, we performed five iterations of the profile HMM homology search tool Jackhmmer^78,79 against the UniRef100 database of non-redundant proteins⁸⁰ using the EVcouplings framework⁸¹. We used the default bitscore threshold of 0.5 bit per residue.

Structural modelling by AlphaFold

For most of the structural analysis, we used structural models predicted by AlphaFold¹. We ran AlphaFold using default parameters and chose the model with the highest pLDDT score for each sequence. For designed sequences, we skipped a step for generating multiple sequence alignment.

Statistics and reproducibility

We did not use statistical tests here. We did not perform multiple experiments under exactly the same conditions, but we used two different proteases and two different protease concentration sets to confirm reproducibility. In addition, we also confirmed that the same amino acid sequences show consistent K₅₀ values in different libraries (Extended Data Fig. 5).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link