. 2023 Dec 1;32(12):e4836. doi: 10.1002/pro.4836

SSDraw: Software for generating comparative protein secondary structure diagrams

Ethan A Chen ¹, Lauren L Porter ^1,^2,^✉

PMCID: PMC10680343 PMID: 37953705

Abstract

The program SSDraw generates publication‐quality protein secondary structure diagrams from three‐dimensional protein structures. To depict relationships between secondary structure and other protein features, diagrams can be colored by conservation score, B‐factor, or custom scoring. Diagrams of homologous proteins can be registered according to an input multiple sequence alignment. Linear visualization allows the user to stack registered diagrams, facilitating comparison of secondary structure and other properties among homologous proteins. SSDraw can be used to compare secondary structures of homologous proteins with both conserved and divergent folds. It can also generate one secondary structure diagram from an input protein structure of interest. The source code can be downloaded (https://github.com/ncbi/SSDraw) and run locally for rapid structure generation, while a Google Colab notebook allows easy use.

Keywords: metamorphic proteins, protein evolution, protein fold switching, protein folding, protein structure, structural bioinformatics

1. INTRODUCTION

Recent advancements in cryo‐electron microscopy (cryo‐EM; Yip et al., ²⁰²⁰), metagenomics (Garlapati et al., 2019; Wooley et al., ²⁰¹⁰), and deep learning‐based protein structure prediction methods (Baek et al., 2021; Chowdhury et al., ²⁰²²; Jumper et al., ²⁰²¹; Lin et al., ²⁰²³) have led to an explosion in the number of available protein structures. For instance, the number of entries in the Protein Data Bank (PDB; Berman et al., ²⁰⁰²; Burley et al., ²⁰¹⁷), a repository of experimentally determined protein structures, has nearly doubled in the past 10 years. Earlier this year, the authors of ESMfold, a large language model that rapidly predicts three‐dimensional protein structures from single sequences, released a web‐based collection of >772 million structures predicted from metagenomic sequences (Lin et al., 2023). Similarly, >200 million structures predicted by AlphaFold2 (Jumper et al., 2021)—a highly accurate deep‐learning‐based model for protein structure prediction–are available through a web repository for easy user access (Varadi et al., 2021), and many have been deposited into the UniProt sequence database (UniProt, 2021).

The enormous increase in available models of protein structure presents opportunities to identify large‐scale relationships between structure and other properties, such as sequence conservation or prediction confidence. Such relationships are often most effectively depicted when multiple protein structures are compared, motivating the development of structural alignment algorithms that match common elements of protein structure rather than amino acid sequence (Pettersen et al., 2021). Nevertheless, important relationships between protein structures can be obscured by three‐dimensional visualizations that do not effectively convey all structural features through one image. This shortcoming especially impacts homologous proteins with nonconserved structural features arising from insertions, deletions, or mutations that cause substantial changes in secondary structure. Indeed, the need for easily interpretable structure diagrams is underscored by several recent studies highlighting how protein structure can transform dramatically in response to seemingly minor sequence changes (Chakravarty, Sreenivasan, et al., 2023; Dishman et al., ²⁰²¹; Liu et al., ²⁰²³; Ruan et al., ²⁰²³; Solomon et al., ²⁰²³). To observe these transformations accurately, secondary structures of the proteins of interest must be registered, meaning that amino acids with annotated secondary structures must be aligned with their corresponding amino acids in a multiple sequence alignment (MSA). Once registered, secondary structures of homologous proteins aligned within the MSA can be compared, and their respective secondary structure diagrams become comparative. That is, the secondary structure of Protein A at position X can be compared directly to the secondary structure of its homolog, Protein B, at position X if their secondary structures are both registered to the same MSA (Figure 1). Comparative secondary structure diagrams also simplify the visualization of fold‐switching proteins, single sequences evolutionarily selected to remodel their secondary and tertiary structures in response to cellular stimuli (Murzin, 2008; Porter & Looger, ²⁰¹⁸; Schafer & Porter, ²⁰²³). In short, as increasing evidence indicates that highly similar or identical protein sequences can assume folds with drastically different secondary structures (Porter, 2023), the need to graphically depict structural differences among homologous proteins and relate them to other protein properties increases.

Comparative secondary structure diagrams result from registering secondary structure annotations with their corresponding aligned amino acid sequences. Secondary structures of unregistered diagrams (above) are not aligned, disallowing reliable inferences about secondary structure evolution. Diagrams align when their secondary structures have been registered with their corresponding aligned amino acid sequences (comparative secondary structure diagrams, below), suggesting possible secondary structure evolution where α‐helices align with β‐sheets. Secondary structure diagrams were made from structures of bacterial response regulators FixJ (PDB ID: 5XSO, chain A) and KdpE (PDB ID: 4KFC, chain A). The C‐terminal domains of FixJ and KdpE are colored blue and red, respectively, indicating different folds (helix‐turn‐helix, blue; winged helix, red), while their structurally conserved N‐terminal domains are gray. Arrows pointing to the gray domains indicate homologous secondary structures; arrows pointing to the colored domains indicate divergent secondary structures. Previous phylogenetic analysis and ancestral reconstruction (Chakravarty, Sreenivasan, et al., 2023) indicate that the C‐terminal β‐sheet of the winged helix evolved from the C‐terminal α‐helix of the helix‐turn‐helix by stepwise mutation.

To effectively depict relationships between the structures of homologous proteins and other properties of interest, we present SSDraw, a Python‐based program that rapidly generates secondary structure diagrams from three‐dimensional protein coordinates. These linear diagrams are registered to a user‐inputted MSA and colored by any property of interest. Running SSDraw once generates a diagram of one protein from an MSA. Multiple diagrams from one MSA can be generated and stacked for easy comparison. These functionalities distinguish SSDraw images from other secondary structure visualizations (Gouet et al., 2003; Hutarova Varekova et al., ²⁰²¹; Hutchinson & Thornton, ¹⁹⁹⁰; Kocincova et al., ²⁰¹⁷; Midlik et al., ²⁰²²; O'Donoghue et al., ²⁰¹⁵; Stivala et al., ²⁰¹¹). For instance, ESPript (Gouet et al., 2003) relates secondary structures derived from one representative protein structure to multiple homologous sequences, usually divided on multiple lines of text. This format works well when the user seeks to visualize sequence conservation patterns in a protein family with conserved secondary structures. SSDraw may be preferable if the user seeks to compare structures of homologous proteins with divergent secondary structures by stacking each diagram and comparing structural differences. As another example, Aquaria (O'Donoghue et al., ²⁰¹⁵) also generates stackable linear secondary structure diagrams but colors by sequence conservation only. SSDraw may be preferable if the user seeks to color the stacked diagrams by a property other than sequence conservation. In short, SSDraw was written to flexibly relate secondary structure differences between homologous proteins with other protein properties of interest. While this software was originally designed for fold‐switching proteins (Porter & Looger, 2018) and homologous sequences that with different secondary structures (Chakravarty, Sreenivasan, et al., 2023), it also serves as a tool to quickly generate secondary structure diagrams for individual proteins with custom coloring by sequence position in seconds (local install) to minutes (Google Colab notebook).

2. RESULTS

2.1. Software overview

SSDraw requires two inputs to run: (1) a file containing three‐dimensional protein coordinates in PDB format and (2) an MSA in FASTA format (Figure 2). SSDraw requires only alpha carbon coordinates to generate an image. The user may specify the chain ID if they input a multi‐chain PDB. The MSA can be generated with programs such as MUSCLE (Edgar, 2004), Clustal Omega (Sievers & Higgins, 2014), or HMMER (Finn et al., 2011), so long as it is inputted in FASTA format. The user may also input a single ungapped FASTA sequence if they are interested in generating a diagram from a single sequence.

SSDraw flowchart. Inputs to SSDraw include three‐dimensional coordinates of a protein structure in Protein Data Bank (PDB) format and a single sequence or multiple sequence alignment (MSA) in FASTA format. Protein secondary structure is determined from the input PDB using Define Secondary Structure of Proteins (DSSP). Stacked polygons represent α‐helices, arrows represent β‐sheets, rectangles represent loops, β‐turns, disordered regions, and secondary structure shorter than three to four residues; alignment gaps are empty spaces. Secondary structures are registered with the input sequence or alignment to account for gaps and drawn using the Matplotlib patches library for Python3. Finally, secondary structures are colored by sequence conservation scores, B‐factor, or another user‐defined input. In this figure, the final diagram is colored by default sequence conservation scores. Output structures can be saved as .png, .eps, .svg, .ps, or .tiff files at a user‐specified resolution. MSA depicted using the Alignment Viewer program.

By default, SSDraw computes secondary structure annotations for each amino acid using Define Secondary Structure of Proteins (DSSP; Joosten et al., ²⁰¹¹; Kabsch & Sander, ¹⁹⁸³), which annotates secondary structure from three‐dimensional protein structures based on hydrogen bonding patterns (Section 4). In lieu of a PDB file, users may input alternative secondary structure annotations (Midlik et al., 2021; Srinivasan & Rose, ¹⁹⁹⁹) or precomputed DSSP annotations in .horiz format.

Annotated secondary structures are then aligned in register with the input sequence alignment (Figure 2) in FASTA format. For proper alignment, the user inputs the name of the reference sequence in the alignment. Protein structures determined by x‐ray crystallography or cryo‐EM often have unresolved regions due to weak or missing electron density, leading to gaps in their experimentally determined structures. These structural gaps lead to alignment gaps between reference sequences and annotated secondary structures. Accordingly, SSDraw adjusts the reference sequence to be the same length as the secondary structure annotations taken from experimentally determined structures; experimentally unresolved regions are assumed to be disordered and are therefore visualized as loops.

Secondary structures are then drawn with patches from the Matplotlib (Hunter, 2007) package for Python3 (Figure 2; Section 4). Successive slanted polygons are used to represent α‐helices, arrows represent β‐sheets, rectangles represent loops, and empty spaces between secondary structures represent alignment gaps. Loops are layered under secondary structures. Segments of regular secondary structure shorter than 4/3 successive residues (α‐helices/β‐sheets), loops, β‐turns, and disordered regions are represented as thin rectangles layered under secondary structure elements (Section 4).

If desired, secondary structures can be colored by sequence conservation score, B‐factor, or another user‐defined input (Figure 1). This feature was originally developed to compare secondary structure conservation in a family of bacterial response regulators with some secondary structure elements that switch from α‐helix to β‐sheet in response to stepwise mutation (Chakravarty, Sreenivasan, et al., 2023). Sequence conservation scores are computed automatically from the input sequence alignment (Section 4), though scores from Rate4Site (Pupko et al., 2002), a more accurate conservation metric, may also be inputted. Alternatively, the image can be colored with a solid fill specified by the user. For instance, the first diagram in Figure 2 was generated using a white fill. Custom coloring schemes and custom colormaps may be specified by the user.

If the user wants to assign custom coloring scores to each residue, they have two options. The first is to upload a custom scoring file that contains residue‐specific scores. This file is formatted with two columns: column one contains one‐letter amino acid codes for each residue to be colored; column two contains scores corresponding to the amino acids in column one; columns are delimited by one space. The second option for custom scoring is to input a PDB file with C‐alpha B‐factors corresponding to custom scores and coloring the image by B‐factor. This option allows the user to easily visualize confidence scores from structure predictors such as AlphaFold2 (Jumper et al., 2021) and ESMfold (Lin et al., 2023), if desired. Any range of scores can be used for custom coloring: scores are normalized before the image is colored. Because SSDraw uses the Matplotlib (Hunter, 2007) Python package, any premade Matplotlib colormap may be used; users can also specify custom colormaps as input.

For those desiring to visualize a protein region rather than the whole protein, starting and ending residues can be specified. The Google Colab notebook provides a sliding window that allows the user to select which portion of the alignment will be drawn. Residue numbers corresponding to PDB numbering can be inputted into the local install.

The final output is a linear secondary structure diagram, colored as the user specifies (Figure 2). Output files can be saved as .png, .eps, .svg, .ps, and .tiff files at a user‐specified resolution. By default, figures are saved as .png files at 600 ppi (pixels per inch), a publication‐quality resolution. The user also has the option to include ticks with residue numbers at any specified interval in these final figures.

SSDraw can be used to generate three sorts of outputs: single ungapped diagrams, single gapped diagrams, and multiple aligned and stacked diagrams (Figure 3). The first may be best when the user wishes to depict continuous secondary structure of one protein structure (Figure 3A), whereas the second may be preferred if the input structure has unresolved regions. In the latter case, the user may input a sequence with gaps corresponding to unresolved regions in the structure, which will then be depicted as gaps (Figure 3B). Finally, the user may wish to generate multiple secondary structure diagrams of homologous proteins for comparison (Figure 3C). To accurately compare these diagrams, the secondary structures of each input structure should be aligned to the same MSA. Secondary structures of homologous sequences aligned to different MSAs will likely be unregistered (Figure 1, top) and thus cannot be compared. Two examples of more advanced uses of SSDraw are now presented.

SSDraw has three modes of use. In the first mode, the user inputs a protein structure and an ungapped sequence; a single ungapped secondary structure diagram is outputted (A). In the second mode, the user inputs a protein structure and a gapped sequence; a single gapped secondary structure diagram is outputted (B). In the third mode, the user inputs multiple structures and a multiple sequence alignment that aligns their sequences; multiple stacked secondary structure diagrams are outputted. In all three panels, the experimentally determined structure of the transcriptional regulator RfaH (Belogurov et al., 2007; Zuber et al., ²⁰¹⁸; PDB ID: 5OND, chain A, dark purple) and its sequence (in different alignments) are inputted. In panel (C), its homolog NusG (Kang et al., 2018; PDB ID: 6ZTJ, chain CF, light purple) is also inputted with a multiple sequence alignment (MSA). RfaH and NusG are members of the only known family of transcriptional regulators conserved from bacteria to humans (Werner, 2012). They share a structurally conserved N‐terminal domain, while their C‐terminal domains differ dramatically in the ground state (Burmann et al., 2012; Porter et al., ²⁰²²): RfaH's is all α‐helical, while NusG's is all β‐sheet.

2.2. Advanced example 1: Comparing distinct structures with highly identical sequences using a custom color map

SSDraw can be used to compare secondary structures of proteins with high levels of sequence identity but different folds (Figure 4). Extensive work has been performed to engineer (Alexander et al., 2007; Alexander et al., ²⁰⁰⁹; He et al., ²⁰¹²; Ruan et al., ²⁰²³) and characterize (Sikosek et al., 2016; Solomon et al., ²⁰²³; Tian & Best, ²⁰²⁰) variants of the human serum albumin‐binding protein GA and the immunoglobulin binding protein GB. While GA folds into a trihelical bundle, GB folds into a 4β + α structure. One or several mutations can cause the protein to flip from one ground‐state fold to the other (Alexander et al., 2009; He et al., ²⁰¹²). The distinct secondary structures of GA and GB variants can be visualized readily with SSDraw (Figure 4). The top structure (GB95) is the reference and therefore has no mutations. Three mutations (cyan) to GB95 switch its fold to the three helical bundle (GA95); two mutations to GA95 (magenta) switch it back to the 4β + α fold (GB98). GB98 can be switched back to the trihelical fold with one mutation (yellow), which can be switched back to the 4β + α‐fold with three mutations (GB98‐T25I, L20A, white). Finally, another single mutation (green) switches GB98‐T25I, L20A back to the trihelical fold (GB98‐T25I). Interestingly, fold‐switching mutations tend to occur in the central region of the protein (residues 20, 25, and 30) rather than at the termini, where the closest known fold‐switching mutation is 11 residues away from the C‐terminus (position 45). Furthermore, all mutations occur in regions of secondary structure rather than loops.

Comparing the structures of proteins with highly identical amino acid sequences but different folds. Diagrams show very different secondary structures derived from the nuclear magnetic resonance structures (Protein Data Bank [PDB] IDs in parentheses) of engineered variants of immunoglobulin binding protein GB (4β + α‐fold) and human serum albumin binding protein GA (trihelical bundle). This figure should be read from top to bottom. Position‐specific mutations required to switch a given fold from that of its predecessor, the diagram directly above it, are shown in different colors representing mutations unique to each protein. Black positions were not mutated relative to their immediate predecessors.

2.3. Advanced example 2: Comparing sequence conservation in similar structures with a default color map

SSDraw can also be used to relate sequence conservation to secondary structure in protein families with conserved folds. These comparisons for ubiquitin and ubiquitin‐like proteins (Walters et al., 2004) are shown in Figure 5. Not surprisingly, sequences in loop regions tend to be least conserved, while sequences that fold into secondary structures tend to be more conserved. One exception is the second β‐sheet, which has been identified as a SUMO1 binding site and putative NEDD8 binding motif by NMR spectroscopy (Song et al., 2004) and structural modeling (He et al., 2017), respectively. Thus, sequence variation in the second β‐sheet may foster different binding functions in different ubiquitin‐like proteins. Sequence conservation was calculated directly from the input sequence alignment (Section 4).

SSDraw diagrams for ubiquitin and ubiquitin‐like proteins colored by sequence conservation score (1.0 is highly conserved; 0.0 is least conserved). Sequences of secondary structure elements tend to be more conserved, with the notable exception of the second β‐sheet, whose binding functions vary among some ubiquitin‐like proteins.

3. DISCUSSION AND CONCLUSIONS

SSDraw generates publication‐quality secondary structure diagrams in seconds to minutes. These diagrams can be colored by conservation score, B‐factors, or a user‐specified metric, allowing relationships between secondary structure and other protein properties to be observed readily. SSDraw is expected to be most useful for comparing secondary structures of homologous proteins with different folds, an emerging class of proteins (Chakravarty, Schafer, & Porter, 2023) for which few computational tools are available. Nevertheless, SSDraw may also be used to (1) diagram single structures and color them by any property of interest and (2) compare secondary structures of homologous proteins with conserved folds.

4. METHODS

4.1. Secondary structure annotation

SSDraw uses DSSP (Joosten et al., 2011; Kabsch & Sander, ¹⁹⁸³) to annotate secondary structure from three‐dimensional protein coordinates in PDB format. The local install uses the DSSP module in Biopython (Cock et al., 2009) to parse the annotations generated by separate compiled software. Only C‐alpha coordinates are necessary for annotation. In addition to regular secondary structure (α‐helices and β‐sheets), DSSP annotates various local structures such as β‐turns and 3₁₀ helices. These features are not displayed in SSDraw diagrams because they are not represented well enough. Due to limitations of the patches library, at least 4 consecutive identical annotations (e.g., HHHH or EEEE) would be needed to introduce additional structural elements into these diagrams. Table 1 shows that α‐helices, β‐sheets, and loops comprise 87% of all consecutive identical annotations; the next most frequent annotation is Turns, representing 4%. These statistics were calculated from DSSP annotations of 185,725 unique PDB files. Helices are drawn for at least four consecutive “H” annotations, and β‐sheets are drawn for at least three consecutive “E” or “B” annotations, combined in any way. All other annotations are visualized as loops. Short helices with <4 consecutive “H” annotations and short β‐sheets with <3 “E” or “B” annotations are also visualized as loops.

TABLE 1.

Frequencies of at least four consecutive DSSP annotations from 185,725 PDB files.

DSSP secondary structure annotation	Frequency of at least four identical consecutive annotations (%)
α‐helix (H)	37
β‐sheet (E,B)	36
Loop (″)	14
Hydrogen‐bonded Turn (T)	4
3₁₀ helix (G)	3
Bend (S)	3
5‐helix (I)	2
Polyproline II helix (P)	1

Open in a new tab

Abbreviation: DSSP, Define Secondary Structure of Proteins; PDB, Protein Data Bank.

In some cases, the user who wishes to install SSDraw locally may have difficulty installing DSSP with conda. The user may run SSDraw with the PyDSSP library. PyDSSP is a simplified pytorch‐based implementation of DSSP that makes three‐state secondary structure annotations (Helix, Sheet, Loop) that match DSSP 97% of the time (https://github.com/ShintaroMinami/PyDSSP).

4.2. Drawing secondary structures

Annotated secondary structures are grouped into three categories: Loop, Helix, and Strand. The lengths of each segment of structure in each category are calculated. Then, each category is drawn separately using the patches library from Matplotlib (Hunter, 2007) for Python3. First, Loops are drawn. Loop lengths are calculated as the number of consecutive annotations divided by 6.0 with the Rectangle patch. When Loops connect elements of secondary structure, they are extended at both ends by 1.0/6.0. All loops have a zorder of 0 so that their images are layered under strand and helix diagrams. Then, coordinates for images of β‐sheets and α‐helices are stored to be drawn later for better performance. Strands are drawn using the FancyArrow patch with a width of 1.0, linewidth of 0.5, zorder = index increasing over all secondary structures from left to right, head_width of 2.0, and head length of 2.0/6.0. Length is defined as the number of consecutive annotations for the strand being drawn/6.0; to avoid incorrect gapping, this length is extended by 1.0/6.0 if C‐terminal elements of secondary structure follow the strand. Helices are drawn as stacked Polygon patches with right‐leaning patches layered on top and left‐leaning patches layered underneath. The short sides of the polygons measure 1.0/6.0; the long sides measure 1.8/6. Helices begin and end with shorter polygons that align with other secondary structures (height of 1.4/6, width of 1.0/6). All lengths are proportional measures scaled to fit into a figure 25 inches long. Consequently, shorter proteins will have larger secondary structures in the horizontal dimension and vice versa. Vertical heights of all secondary structures are kept constant.

4.3. Coloring secondary structures

Secondary structures have black edges; their insides are filled by clipping an input colormap equal in size to the diagram. Groups of loops, helices, and strands are each converted to clipping paths using Matplotlib's mpath.Path command. These paths are then converted to patches with mpatch.PathPatch. Finally, an input colormap equal in size to the diagram is generated from user‐specified parameters or a solid color and clipped to fill the insides of the path (im.set_clip_path command); the rest of the colormap is discarded. Repetitively generating the colormap slows performance considerably. For instance, generating one diagram of a 215‐residue response regulator with a mixture of helices and strands (PDB ID: 1A04) takes 1 min, 5 s when a colormap for each secondary structure element–including every polygon to make the helices–must be generated. To improve performance, SSDraw generates colormaps three times—once for each class of secondary structure. Running this improved implementation hastened image generation of 1A04 to 2.6 s, a 25‐fold speed‐up from 1 min, 5 s. The Google Colab notebook takes about 2 min to generate its first secondary structure diagram because it must load outside software packages, such as DSSP, before running.

4.4. Conservation scores

Conservation scores are computed directly from an input sequence alignment. First, the consensus sequence is determined by calculating the most common amino acids in each column of the alignment. A conservation score is then calculated by:

Determining the number, N, of amino acids in column i with substitution scores ≥0 for the consensus residue in column i. Substitution scores are calculated using the BLOSUM62 (Henikoff & Henikoff, 1992) matrix supplied by Biopython (Cock et al., 2009).
N is then normalized by the total number of amino acids in column i. Gaps are not included in the normalization.

SSDraw can also take Consurf and Rate4Site scores as input. Consurf scores are taken directly from the input file and used to color the output structure with no modification to the values. Rate4Site scores are normalized and grouped into nine bins as in Ref. (Ashkenazy et al., 2016).

The multiple sequence alignment in Figure 2 was generated using Alignment Viewer: https://github.com/sanderlab/alignmentviewer. All three‐dimensional protein structures in Figures 2 and 3 were generated using PyMOL (Schrödinger, LLC, n.d.).

AUTHOR CONTRIBUTIONS

Lauren L. Porter: Conceptualization; funding acquisition; writing—original draft; methodology; visualization; writing—review and editing; software; supervision; resources. Ethan A. Chen: Writing—original draft; visualization; writing—review and editing; software; formal analysis.

ACKNOWLEDGMENTS

We thank Myeongsang Lee and Joseph Schafer for testing local installs of SSDraw and Leslie Ronish and Joseph Thole for testing the Google Colab notebook. L.L.P. thanks Loren Looger for inspiring the idea of SSDraw. This work was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health (LM202011, L.L.P.).

Chen EA, Porter LL. SSDraw: Software for generating comparative protein secondary structure diagrams. Protein Science. 2023;32(12):e4836. 10.1002/pro.4836

Review Editor: Nir Ben‐Tal

DATA AVAILABILITY STATEMENT

The complete code, documentation, and examples for SSDraw can be found at https://github.com/ncbi/SSDraw. A Google Colab notebook is also available at: https://colab.research.google.com/github/ncbi/SSDraw/blob/main/SSDraw.ipynb. To upload local files into the Colab notebook, the user must run the notebook with Google Chrome.

REFERENCES

Alexander PA, He Y, Chen Y, Orban J, Bryan PN. The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc Natl Acad Sci U S A. 2007;104(29):11963–11968. [DOI] [PMC free article] [PubMed] [Google Scholar]
Alexander PA, He Y, Chen Y, Orban J, Bryan PN. A minimal sequence code for switching protein structure and function. Proc Natl Acad Sci U S A. 2009;106(50):21149–21154. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ashkenazy H, Abadi S, Martz E, Chay O, Mayrose I, Pupko T, et al. Consurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules. Nucleic Acids Res. 2016;44(W1):W344–W350. [DOI] [PMC free article] [PubMed] [Google Scholar]
Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, et al. Accurate prediction of protein structures and interactions using a three‐track neural network. Science. 2021;373(6557):871–876. [DOI] [PMC free article] [PubMed] [Google Scholar]
Belogurov GA, Vassylyeva MN, Svetlov V, Klyuyev S, Grishin NV, Vassylyev DG, et al. Structural basis for converting a general transcription factor into an operon‐specific virulence regulator. Mol Cell. 2007;26(1):117–129. [DOI] [PMC free article] [PubMed] [Google Scholar]
Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, et al. The Protein Data Bank. Acta Crystallogr D Biol Crystallogr. 2002;58(1):899–907. [DOI] [PubMed] [Google Scholar]
Burley SK, Berman HM, Kleywegt GJ, Markley JL, Nakamura H, Velankar S. Protein data bank (PDB): the single global macromolecular structure archive. Methods Mol Biol. 2017;1607:627–641. [DOI] [PMC free article] [PubMed] [Google Scholar]
Burmann BM, Knauer SH, Sevostyanova A, Schweimer K, Mooney RA, Landick R, et al. An alpha helix to beta barrel domain switch transforms the transcription factor RfaH into a translation factor. Cell. 2012;150(2):291–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chakravarty D, Schafer JW, Porter LL. Distinguishing features of fold‐switching proteins. Protein Sci. 2023;32(3):e4596. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chakravarty D, Sreenivasan S, Swint‐Kruse L, Porter LL. Identification of a covert evolutionary pathway between two protein folds. Nat Commun. 2023;14(1):3177. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chowdhury R, Bouatta N, Biswas S, Floristean C, Kharkare A, Roye K, et al. Single‐sequence protein structure prediction using a language model and deep learning. Nat Biotechnol. 2022;40:1611–1623. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dishman AF, Tyler RC, Fox JC, Kleist AB, Prehoda KE, Babu MM, et al. Evolution of fold switching in a metamorphic protein. Science. 2021;371(6524):86–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004;5:113. [DOI] [PMC free article] [PubMed] [Google Scholar]
Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011;39(Web Server issue):W29–W37. [DOI] [PMC free article] [PubMed] [Google Scholar]
Garlapati D, Charankumar B, Ramu K, Madeswaran P, Ramana MM. A review on the applications and recent advances in environmental DNA (eDNA) metagenomics. Rev Environ Sci Bio/Technol. 2019;18:389–411. [Google Scholar]
Gouet P, Robert X, Courcelle E. ESPript/ENDscript: extracting and rendering sequence and 3D information from atomic structures of proteins. Nucleic Acids Res. 2003;31(13):3320–3323. [DOI] [PMC free article] [PubMed] [Google Scholar]
He S, Cao Y, Xie P, Dong G, Zhang L. The NEDD8 non‐covalent binding region in the smurf hect domain is critical to its ubiquitn ligase function. Sci Rep. 2017;7:41364. [DOI] [PMC free article] [PubMed] [Google Scholar]
He Y, Chen Y, Alexander PA, Bryan PN, Orban J. Mutational tipping points for switching protein folds and functions. Structure. 2012;20(2):283–291. [DOI] [PMC free article] [PubMed] [Google Scholar]
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992;89(22):10915–10919. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9(3):90–95. [Google Scholar]
Hutarova Varekova I, Hutar J, Midlik A, Horsky V, Hladka E, Svobodova R, et al. 2DProts: database of family‐wide protein secondary structure diagrams. Bioinformatics. 2021;37(23):4599–4601. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hutchinson EG, Thornton JM. HERA—a program to draw schematic diagrams of protein secondary structures. Proteins. 1990;8(3):203–212. [DOI] [PubMed] [Google Scholar]
Joosten RP, te Beek TA, Krieger E, Hekkelman ML, Hooft RW, Schneider R, et al. A series of PDB related databases for everyday needs. Nucleic Acids Res. 2011;39(Database issue):D411–D419. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with Alphafold. Nature. 2021;596(7873):583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers. 1983;22(12):2577–2637. [DOI] [PubMed] [Google Scholar]
Kang JY, Mooney RA, Nedialkov Y, Saba J, Mishanina TV, Artsimovitch I, et al. Structural basis for transcript elongation control by NusG family universal regulators. Cell. 2018;173(7):1650–1662 e1614. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kocincova L, Jaresova M, Byska J, Parulek J, Hauser H, Kozlikova B. Comparative visualization of protein secondary structures. BMC Bioinformatics. 2017;18(Suppl 2):23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary‐scale prediction of atomic‐level protein structure with a language model. Science. 2023;379(6637):1123–1130. [DOI] [PubMed] [Google Scholar]
Liu S, Chen H, Yin Y, Lu D, Gao G, Li J, et al. Inhibition of FAM46/TENT5 activity by BCCIPα adopting a unique fold. Sci Adv. 2023;9(14):eadf5583. [DOI] [PMC free article] [PubMed] [Google Scholar]
Midlik A, Hutarova Varekova I, Hutar J, Chareshneu A, Berka K, Svobodova R. Overprot: secondary structure consensus for protein families. Bioinformatics. 2022;38(14):3648–3650. [DOI] [PubMed] [Google Scholar]
Midlik A, Navratilova V, Moturu TR, Koca J, Svobodova R, Berka K. Uncovering of cytochrome p450 anatomy by secstrannotator. Sci Rep. 2021;11(1):12345. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murzin AG. Biochemistry. Metamorphic proteins. Science. 2008;320(5884):1725–1726. [DOI] [PubMed] [Google Scholar]
O'Donoghue SI, Sabir KS, Kalemanov M, Stolte C, Wellmann B, Ho V, et al. Aquaria: simplifying discovery and insight from protein structures. Nat Methods. 2015;12(2):98–99. [DOI] [PubMed] [Google Scholar]
Pettersen EF, Goddard TD, Huang CC, Meng EC, Couch GS, Croll TI, et al. UCSF ChimeraX: structure visualization for researchers, educators, and developers. Protein Sci. 2021;30(1):70–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
Porter LL. Fluid protein fold space and its implications. Bioessays. 2023;45:e2300057. [DOI] [PMC free article] [PubMed] [Google Scholar]
Porter LL, Kim AK, Rimal S, Looger LL, Majumdar A, Mensh BD, et al. Many dissimilar NusG protein domains switch between alpha‐helix and beta‐sheet folds. Nat Commun. 2022;13(1):3802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Porter LL, Looger LL. Extant fold‐switching proteins are widespread. Proc Natl Acad Sci U S A. 2018;115(23):5968–5973. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pupko T, Bell RE, Mayrose I, Glaser F, Ben‐Tal N. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics. 2002;18(Suppl 1):S71–S77. [DOI] [PubMed] [Google Scholar]
Ruan B, He Y, Chen Y, Choi EJ, Chen Y, Motabar D, et al. Design and characterization of a protein fold switching network. Nat Commun. 2023;14(1):431. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schafer JW, Porter L. Evolutionary selection of proteins with two folds. Nat Commun. 2023;14(1):5478. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schrödinger, LLC . The PyMOL Molecular Graphics System, Version 2.0. Schrödinger, LLC. [Google Scholar]
Sievers F, Higgins DG. Clustal Omega, accurate alignment of very large numbers of sequences. Methods Mol Biol. 2014;1079:105–116. [DOI] [PubMed] [Google Scholar]
Sikosek T, Krobath H, Chan HS. Theoretical insights into the biophysics of protein bi‐stability and evolutionary switches. PLoS Comput Biol. 2016;12(6):e1004960. [DOI] [PMC free article] [PubMed] [Google Scholar]
Solomon TL, He Y, Sari N, Chen Y, Gallagher DT, Bryan PN, et al. Reversible switching between two common protein folds in a designed system using only temperature. Proc Natl Acad Sci U S A. 2023;120(4):e2215418120. [DOI] [PMC free article] [PubMed] [Google Scholar]
Song J, Durrin LK, Wilkinson TA, Krontiris TG, Chen Y. Identification of a SUMO‐binding motif that recognizes SUMO‐modified proteins. Proc Natl Acad Sci U S A. 2004;101(40):14373–14378. [DOI] [PMC free article] [PubMed] [Google Scholar]
Srinivasan R, Rose GD. A physical basis for protein secondary structure. Proc Natl Acad Sci U S A. 1999;96(25):14258–14263. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stivala A, Wybrow M, Wirth A, Whisstock JC, Stuckey PJ. Automatic generation of protein structure cartoons with Pro‐origami. Bioinformatics. 2011;27(23):3315–3316. [DOI] [PubMed] [Google Scholar]
Tian P, Best RB. Exploring the sequence fitness landscape of a bridge between two protein folds. PLoS Comput Biol. 2020;16(10):e1008285. [DOI] [PMC free article] [PubMed] [Google Scholar]
UniProt C. Uniprot: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49(D1):D480–D489. [DOI] [PMC free article] [PubMed] [Google Scholar]
Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. Alphafold protein structure database: massively expanding the structural coverage of protein‐sequence space with high‐accuracy models. Nucleic Acids Res. 2021;50(D1):D439–D444. [DOI] [PMC free article] [PubMed] [Google Scholar]
Walters KJ, Goh AM, Wang Q, Wagner G, Howley PM. Ubiquitin family proteins and their relationship to the proteasome: a structural perspective. Biochimica et Biophysica acta (BBA)‐molecular. Cell Res. 2004;1695(1–3):73–87. [DOI] [PubMed] [Google Scholar]
Werner F. A nexus for gene expression‐molecular mechanisms of Spt5 and NusG in the three domains of life. J Mol Biol. 2012;417(1–2):13–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010;6(2):e1000667. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yip KM, Fischer N, Paknia E, Chari A, Stark H. Atomic‐resolution protein structure determination by cryo‐EM. Nature. 2020;587(7832):157–161. [DOI] [PubMed] [Google Scholar]
Zuber PK, Artsimovitch I, NandyMazumdar M, Liu Z, Nedialkov Y, Schweimer K, et al. The universally‐conserved transcription factor RfaH is recruited to a hairpin structure of the non‐template DNA strand. Elife. 2018;7:e36349. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

PERMALINK

SSDraw: Software for generating comparative protein secondary structure diagrams

Ethan A Chen

Lauren L Porter

Abstract

1. INTRODUCTION

FIGURE 1.