Documentation


Contents


Abbreviations

Populations

AFRAfrican/African American
AMRLatino
AMSAmish
ASJAshkenazi Jewish
EASEast Asian
EUREuropean (1000 genomes only)
FINFinnish
MDEMiddle Eastern
NFENon-Finnish European
SASSouth Asian
OTHOther

N.B. if a population classification is not used by a database the value None is given.


Diseases

CMCardiomyopathy (any)
DCMDilated Cardiomyopathy
HCMHypertrophic Cardiomyopathy
ARVCArrythmogenic Right Ventricular Cardiomyopathy
RCMRestrictive Cardiomyopathy
MDMuscle Disease
MmDMulti-minicore Disease
MmD-HDMulti-minicore Disease with Heart Disease
CNMCentronuclear Myopathy
LGMD2JLimb-Girdle Muscular Dystrophy Type 2J
TMDTibial Muscular Dystrophy
MFMMyofibrillar Myopathy
HMERFHereditary Myopathy with Early Respiratory Failure
IHInguinal Hernia
T1PType 1 Fibre Predominance
CFTDCongenital Fibre Type Disproportion
OtherUnknown or novel titinopathy

Variant and Annotation Data


Population and Disease-Associated Variant Data Collection

Population missense variant data was originally obtained from 3 sources: the 1000 Genomes Project (1KGP) [1], gnomAD database 2.1.1 (gnomAD2) and gnomAD database 3.1.2 (gnomAD3) [2]. Data from 1KGP was obtained from dbSNP [3], while data from gnomAD2 and gnomAD3 were obtained from the gnomAD Broad Institute web server [4]. The gnomAD2 release is composed of 125,748 exomes and 15,708 genomes, while gnomAD3 covers 76,156 genomes. Data from 1KGP is present in gnomAD databases, and there is minor overlap between gnomAD2 and gnomAD3 genomes surveyed.

As of November 2023, gnomAD database 4.0.0 has been released, with a vast increase in both genome and variant quantity over both versions 2 and 3, composed of 730,947 exomes and 76,215 genomes. While all data from the versions 2.1.1 and 3.1.2 data sets are reported to be present in the version 4.0.0 release, there are a non-negligible number of variants in titin present in one or the other of these older releases that are not reported in the newest database. Consequently, TITINdb2 includes all three databases as separate sources of variation. Users should be aware that gnomAD database 4.0.0 (gnomAD4) possesses a much higher sampling of individuals with European ancestry compared to the older releases, thanks to the inclusion of data from the UK Biobank [5]; consequently, we recommend increased scrutiny of reported minor allele frequency (MAF) within this data set, and to be mindful of the calculated allele frequency in ethnic subpopulations (a breakdown of the genetic ancestry of all gnomAD data sets can be found in the gnomAD v4.0.0 announcement [6]).

Disease-associated titin nsSNVs were originally obtained from Chauveau et al. (2012) [7]. The current build of TITINdb2 includes all known reported disease-associated titin missense variants obtained from a search of the available literature up to the end of 2023; notably, this excludes certain variants reported in other databases of reported disease-associated variants, such as ClinVar [8], where no accompanying peer-reviewed literature in support is provided. The published form of this literature review includes additional information about each of the variants reported in TITINdb2 [9].


Saturation Mutagenesis

Saturation mutagenesis, or deep mutational scanning (DMS), refers to the systematic evaluation of the impacts of single protein residue alterations at all positions; in experimental protocols, this involves high-throughput assays assessing a measurable parameter for protein constructs, such as stability or binding affinity, for all possible single amino acid substitutions at all possible sequence positions in the protein [10]. Similarly, in silico prediction tools can be used to produce computational mutational scans, allowing for comparison of vast numbers of protein substitutions in large proteins such as titin. TITINdb2 incorporates saturation mutagenesis data from a number of different prediction tools, which may be freely accessed using the "SatMut SAVs" table on the domain structure page for any given domain.

Saturation mutagenesis pathogenicity predictions for all SAVs present in titin's canonical (N2AB) isoform were extracted from the pre-computed databases for PolyPhen-2 [11] using their online web server. These predictions were then used as input for saturation mutagenesis predictions for all N2AB SAVs located within domain boundaries for Rhapsody [12], a predictor which uses sequence-, structural-, and dynamics-based features to predict pathogenicity. As Rhapsody's web server does not have access to the full structural coverage of titin available in TITINdb2, we implemented a local pipeline incorporating pre-computed PolyPhen-2 predictions and titin domain models from AlphaFold2 (see below) to generate predictions. As in the previous build of TITINdb, all possible SAVs within domain boundaries in the inferred complete (IC) isoform of titin were assessed using mCSM [13] to predict protein stability by web server query, using homology models of titin domains generated by Modeller (see below). Protein stability predictions were also generated using DUET [14], for all possible nsSNVs only.

In addition to the generated saturation mutagenesis data, TITINdb2 also includes saturation mutagenesis predictions made available as pre-computed data for the community. Pathogenicity predictions for all possible nsSNVs in titin's canonical isoform are available for the meta-predictors Condel [15] and REVEL [16]; Condel predictions were obtained from FannsDB [17], while REVEL predictions were mapped from chromosomal to protein coordinates before being added to TITINdb2. Finally, the recent release of AlphaMissense [18] has included pre-computed predictions for all possible SAVs at every position in titin, both for the canonical N2AB isoform and for the inferred complete isoform. As these predictions contain minor differences, both are included in saturation mutagenesis data in TITINdb2 for the sake of comparison; both are sourced from the distribution of AlphaMissense predictions through Zenodo [19].


Pre-computed Variant Pathogenicity Predictions

Pathogenicity predictions for reported SNVs in population or pathogenic variant databases have been collected in various online repositories; we include predictions sourced from one of the most comprehensive of these databases, dbNSFPv4 [20], for nsSNVs reported therein. Of the predictors included in dbNSFPv4, we selected 13 pathogenicity prediction algorithms which were both commonly cited in comparisons of pathogenicity prediction accuracy and not sourced from other databases, and added these annotations to TITINdb2. A full list of all available variant impact prediction algorithms in TITINdb2, including both saturation mutagenesis and selected variant annotations, is given below:

Predictor Name Source Description Availability Reference
AlphaMissensePredictions for AlphaMissense (Zenodo) [19]Pathogenicity predictor (large language model trained on human/primate variant frequency databases)Saturation mutagenesis of SAVs (all positions)Cheng et al. (2023) [18]
CADDdbNSFPv4 [20]Pathogenicity meta-predictor (sequence conservation, other predictors)Selected nsSNVsRentzsch et al. (2019) [21]
CondelFannsDB [17]Pathogenicity meta-predictor (FATHMM, MutationAssessor)All possible nsSNVsGonzález-Pérez & López-Bigas (2011) [15]
DUETCalculation with homology models (web server)Protein stability meta-predictor (mCSM, SDM)All possible nsSNVsPires et al. (2014b) [14]
FATHMMdbNSFPv4Pathogenicity predictor (sequence homology, disease weighting)Selected nsSNVsShihab et al. (2013) [22]
M-CAPdbNSFPv4Pathogenicity predictor (sequence homology, genetic constraints)Selected nsSNVsJagadeesh et al. (2016) [23]
mCSMCalculation with homology models (web server)Protein stability predictor (graph-based structural signatures)Saturation mutagenesis of SAVs (all domains)Pires et al. (2014a) [13]
MetaSVMdbNSFPv4Pathogenicity meta-predictor (11 other predictors, allele frequency)Selected nsSNVsDong, C. et al. (2015) [24]
MetaLRdbNSFPv4Pathogenicity meta-predictor (11 other predictors, allele frequency)Selected nsSNVsDong, C. et al. (2015) [24]
MPCdbNSFPv4Pathogenicity predictor (sequence homology, genetic constraints)Selected nsSNVsSamocha et al. (2017) [25]
MutationAssessordbNSFPv4Pathogenicity predictor (sequence homology)Selected nsSNVsReva et al. (2011) [26]
MutationTasterdbNSFPv4Pathogenicity predictor (sequence homology)Selected nsSNVsSchwarz et al. (2014) [27]
MutPreddbNSFPv4Pathogenicity predictor (sequence homology, conservation, structural change)Selected nsSNVsPejaver et al. (2020) [28]
MVPdbNSFPv4Pathogenicity predictor (sequence conservation, structural features, other predictors)Selected nsSNVsQi et al. (2021) [29]
PolyPhen-2Retrieval from web serverPathogenicity predictor (sequence homology, structural annotation)Saturation mutagenesis of SAVs (N2AB domains)Adzhubei et al. (2013) [11]
PrimateAIdbNSFPv4Pathogenicity predictor (deep learning from sequence alignments)Selected nsSNVsSundaram et al. (2018) [30]
PROVEANdbNSFPv4Pathogenicity predictor (sequence homology)Selected nsSNVsChoi et al. (2012) [31]
REVELREVEL genome segment files [28]Pathogenicity meta-predictor (13 other predictors)All possible nsSNVsIoannidis et al. (2016) [16]
RhapsodyCalculation with AlphaFold2 models (local pipeline)Pathogenicity predictor (sequence conservation, structural features, coarse-grained dynamics)Saturation mutagenesis of SAVs (N2AB domains)Ponzoni et al. (2020) [12]
SIFTdbNSFPv4Pathogenicity predictor (sequence conservation)Selected nsSNVsNg & Henikoff (2003) [33]

AlphaFold2 and AlphaMissense

Recent developments in computational prediction, driven by advances in artificial intelligence and large language models, have given researchers access to a wide variety of pre-computed data; chief among these are the protein structure prediction tool AlphaFold2 [34] and the previously mentioned variant pathogenicity prediction tool AlphaMissense [18], both developed by DeepMind. In addition to the methodologies, pre-computed data for both structural models of all proteins and predicted pathogenicity for all possible SAVs in the human proteome have been released; however, accessing the data for titin is not straightforward, owing to its immense size. Thus, in TITINdb2 both the AlphaFold2 structural models for titin domains and AlphaMissense predictions of pathogenicity for titin SAVs have been extracted from the proteome data files and are readily available alongside other predicted structures and variant impact scores. AlphaMissense predictions have been sourced from both canonical and isoform proteome data sets via Zenodo (see above); AlphaFold2 structures for both single and tandem titin domain constructs have been extracted from predicted proteome structures and can be viewed in the provided molecule viewer, while the source 1400-residue proteome structures can be downloaded from the structures page for a given protein domain (for details, see Protein Structure Data, below).


Protein Sequence Data


Definition of Titin Domain Boundaries

HMMER [35] was used to scan the protein sequence of the titin IC isoform (NP_001254479.2, obtained from the RefSeq database [36]) against Pfam seed libraries [37]. Where hits overlapped the hit with the lowest E-value was accepted. When the lowest E-value hit for a region was greater than 0.0001, additional evidence was required to accept a hit. This was the case for domain Ig-94, which was identified with an E-value higher than the threshold (0.0001), but is verified by PDB structures 1WAA, 1TIU and 1TIT. Ig-88, Ig-89, Ig-90 and Ig-98 were also identified with E-values higher than the threshold; however when the titin sequence was scanned using an HMM created from an alignment of all (165) other titin Ig domains, these were identified with high significance (5.9E-11, 1.2E-12, 2.9E-10,7.4E-12).

Sequence logos created using Weblogo [38] (see Fig 1) showing aligned titin Fn3 sequences, differ substantially from such logos depicting Pfam seed alignments, particularly towards the end of the sequence where the conservation drops off gradually. Therefore, the boundary does not appear to be clearly defined from sequence alone. When mapped onto structure it becomes clear that the Pfam-defined boundaries do not cover the whole domain. Due to this information it was decided the Pfam Fn3 domain boundaries were not appropriate. Therefore, Fn3 domains were initially identified using Pfam/HMMER and the sequences of these domains, including an extra 5 amino acids upstream and 16 amino acids downstream of the Pfam defined boundary, were aligned using T-Coffee [39]. This alignment was cut using structural information from available titin Fn3 crystal structures, in particular 3LPW. An HMM was created from this alignment and titin scanned again using this HMM to redefine Fn3 boundaries.

Fig 1. Sequence logos showing aligned titin domains and Pfam seed alignments for Ig and Fn3 domains. The Pfam Fn3 domain definition can be seen mapped onto structure in purple with structure absent from the Pfam definition in turquoise and blue.

Mapping of Titin Isoforms

Isoform sequences were obtained from RefSeq [36]. Stretcher [40] was used to align all titin isoforms to the IC isoform. Positions were mapped according to these alignments.

Isoform Protein Length (aa) UniProt ID Ensembl Protein ID NCBI RefSeq Protein ID
Inferred Complete (IC) 35991 Q8WZ42-12 ENSP00000467141.1 NP_001254479.2
Canonical (N2AB) 34350 Q8WZ42-1 ENSP00000465570.1 NP_001243779.1
Primary Cardiac (N2B) 26926 Q8WZ42-3 ENSP00000434586.1 NP_003310.4
Primary Skeletal (N2A) 33423 Q8WZ42-11 ENSP00000343764.6 NP_596869.4
Cardiac Novex-1 (novex1) 27051 Q8WZ42-10 ENSP00000352154.5 NP_597676.3
Cardiac Novex-2 (novex2) 27118 NA ENSP00000340554.6 NP_597681.4
Cardiac Novex-3 (novex3) 5604 Q8WZ42-6 ENSP00000354117.4 NP_596870.2

Annotation of Shared Structural Positions

While there are limited numbers of variants reported at each individual position in titin, the Ig and Fn3 domains that make up the protein have a high sequence identity, and information about variation at a given position in a domain may be provided by variants at the same structural position in homologous domains. Multiple sequence alignments of all 169 Ig domains and 132 Fn3 domains in titin were constructed and aligned positions were allocated; the structural position annotation, accessible as a link from the page for any residue located within a domain, allows the user to navigate to a page where all reported variants at that shared structural position, as well as the wild-type reside and the domains they occur in, can be viewed.

Consensus structural alignments were generated using PROMALS3D [41] with available experimental X-ray crystal structures from PDBe (see below) for both Ig and Fn3 domains; these were then mapped back onto the multiple sequence alignments of all Ig and Fn3 domains, respectively, and shared structural positions were annotated with consensus secondary structure information at that position.


Protein Structure Data


Protein Data Bank Experimental Structures

The Protein Data Bank in Europe (PDBe) [42] was searched for experimental structures of single or tandem titin domain constructs, solved with X-ray crystallography, electron microscopy, or NMR spectroscopy. The following experimental structures are available in TITINdb2 for visualisation of variants on their respective domains:

PDB ID Reported Domain UniProt Domain TITINdb2 Domain Chain Solution Method Resolution
2F8VTitin-NtermIg-like 1Ig-1AX-Ray Diffraction2.75Å
Ig-like 2Ig-2
2A38Titin-NtermIg-like 1Ig-1AX-Ray Diffraction2.00Å
Ig-like 2Ig-2
6FWXTitin-Z1Ig-like 1Ig-1AX-Ray Diffraction3.00Å
Titin-Z2Ig-like 2Ig-2
6SDBTitin-Z1Ig-like 1Ig-1AX-Ray Diffraction2.80Å
Titin-Z2Ig-like 2Ig-2
1YA5Titin-Z1Ig-like 1Ig-1AX-Ray Diffraction2.44Å
Titin-Z2Ig-like 2Ig-2
6DL4Titin-Z10Ig-like 9Ig-9ASolution NMR
1G1CTitin-I1Ig-like 10Ig-10AX-Ray Diffraction2.10Å
4QEGTitin-I10Ig-like 16Ig-19AX-Ray Diffraction2.00Å
5JDJTitin-I10Ig-like 16Ig-19AX-Ray Diffraction1.74Å
5JDDTitin-I9N/AIg-18AX-Ray Diffraction1.53Å
Titin-I10Ig-like 16Ig-19
Titin-I11Ig-like 17Ig-20
5JDETitin-I9N/AIg-18AX-Ray Diffraction1.90Å
Titin-I10Ig-like 16Ig-19
Titin-I11Ig-like 17Ig-20
2RIKTitin I67Ig-like 64Ig-70AX-Ray Diffraction1.60Å
Titin I68Ig-like 65Ig-71
Titin I69Ig-like 66Ig-72
3B43Titin I65Ig-like 62Ig-68AX-Ray Diffraction3.30Å
Titin I66Ig-like 63Ig-69
Titin I67Ig-like 64Ig-70
Titin I68Ig-like 65Ig-71
Titin I69Ig-like 66Ig-72
Titin I70Ig-like 67Ig-73
5JOETitin I81N/AIg-84AX-Ray Diffraction2.00Å
7AHSTitin-N2A Ig81Ig-like 77Ig-84AX-Ray Diffraction2.04Å
Titin-N2A Ig82Ig-like 78Ig-85
Titin-N2A Ig83Ig-like 79Ig-86
6YJ0Titin I83Ig-like 79Ig-86ASolution NMR
1TITTitin-I27N/AIg-94ASolution NMR
1TIUTitin-I27N/AIg-94ASolution NMR
1WAATitin-I27N/AIg-94AX-Ray Diffraction1.80Å
4O00Titin-A3FnIII 3Fn3-3AX-Ray Diffraction1.85Å
8BXRTitin I109N/AIg-109AX-Ray Diffraction2.70Å
Titin I110FnIII 4Fn3-4
Titin I111FnIII 5Fn3-5
8BW6Titin I110FnIII 4Fn3-4AX-Ray Diffraction1.95Å
8BVOTitin I110FnIII 4Fn3-4AX-Ray Diffraction2.55Å
Titin I111FnIII 5Fn3-5
1BPVTitin-A71FnIII 62Fn3-62ASolution NMR
3LPWTitin-A77FnIII 66Fn3-66AX-Ray Diffraction1.65Å
Titin-A78FnIII 67Fn3-67
3LCYTitin-A164Ig-like 140Ig-156AX-Ray Diffraction2.50Å
Titin-A165N/AIg-157
2J8HTitin-A168Ig-like 141Ig-158AX-Ray Diffraction1.99Å
Titin-A169Ig-like 142Ig-159
2J8OTitin-A168Ig-like 141Ig-158AX-Ray Diffraction2.49Å
Titin-A169Ig-like 142Ig-159
2ILLTitin-A168Ig-like 141Ig-158AX-Ray Diffraction2.20Å
Titin-A169Ig-like 142Ig-159
2NZITitin-A168Ig-like 141Ig-158AX-Ray Diffraction2.90Å
Titin-A169Ig-like 142Ig-159
Titin-A170FnIII 132Fn3-132
4JNWTitin KinaseProtein kinaseKinase-1AX-Ray Diffraction2.06Å
1TKITitin KinaseProtein kinaseKinase-1AX-Ray Diffraction2.00Å
6YGNTitin-A168FnIII 132Fn3-132AX-Ray Diffraction2.40Å
Titin KinaseProtein kinaseKinase-1
Titin-M1Ig-like 143Ig-160
2BK8Titin-M1Ig-like 143Ig-160AX-Ray Diffraction1.69Å
6HCITitin-M3Ig-like 145Ig-162AX-Ray Diffraction2.12Å
6H4LTitin-M4Ig-like 146Ig-163AX-Ray Diffraction1.60Å
3QP3Titin-M4Ig-like 146Ig-163AX-Ray Diffraction2.00Å
1NCTTitin-M5Ig-like 147Ig-164ASolution NMR
1NCUTitin-M5Ig-like 147Ig-164ASolution NMR
1TNNTitin-???Ig-like 147Ig-164ASolution NMR
1TNMTitin-???Ig-like 147Ig-164ASolution NMR
3PUCTitin-M7Ig-like 149Ig-166AX-Ray Diffraction0.96Å
4C4KTitin-M10Ig-like 152Ig-169TX-Ray Diffraction1.95Å
4UOWTitin-M10Ig-like 152Ig-1691X-Ray Diffraction3.30Å
3Q5OTitin-M10Ig-like 152Ig-169AX-Ray Diffraction2.05Å
2Y9RTitin-M10Ig-like 152Ig-169TX-Ray Diffraction1.90Å
2WP3Titin-M10Ig-like 152Ig-169TX-Ray Diffraction1.48Å
2WWKTitin-M10Ig-like 152Ig-169TX-Ray Diffraction1.70Å
2WWMTitin-M10Ig-like 152Ig-169DX-Ray Diffraction2.30Å
3KNBTitin-CtermIg-like 152Ig-169AX-Ray Diffraction1.40Å

All of the above structures represent a complete domain or domains aligning to H. sapiens titin; incomplete structures or epitopes are excluded, as well as titin domains from non-human model organisms. Additionally, two human titin structures available in PDB are not present in TITINdb2: 6I0Y is a large TnaC-stalled ribosome complex to which titin Ig-94 is bound; 2RQ8 is a mutant domain Ig-94, which includes variant Y14326P for analysis of domain destabilisation and thus does not align to the domain sequence.


Homology Modelling of Titin Domains

Prior to the proliferation of Deep Learning-based protein structure prediction algorithms, visualisation of individual titin domains lacking reliable, experimentally solved structures was achieved through the use of template modelling approaches. In the first version of this database, an automated homology modelling pipeline was set up to provide structural coverage to domains lacking solved experimental structures. The pipeline takes a FASTA file of domain sequences as input and uses PDB structures publicly available prior to March 2017 as templates. The overall modelling process can be seen in Fig 2A and a flow diagram detailing the template selection process is depicted in Fig 2B. The template search, modelling and model assessment were performed using Modeller [43], the alignment of query and templates performed using 3DCoffee [39], and the overall pipeline written in Python 2.7.

Fig 2. Flow-diagrams showing A) an outline of the modelling pipeline B) template selection criteria.

The I-TASSER server [44] was used to model Ig-112 as a satisfactory (negative) zDOPE score could not be obtained using Modeller.


AlphaFold2 Predicted Structures

Template-based models in TITINdb2 have been supplemented by models predicted by newer Deep Learning-based approaches, namely AlphaFold2 [34]. AlphaFold has provided the structural biology community with predictions of all proteins in the human proteome via their online Protein Structure Database [45]; however, owing to the size of titin, its structure cannot be predicted as a single protein. Thus, we obtained the full proteome predictions from the database and extracted the predictions that covered the genetic location of the titin gene. This is made up of 166 overlapping "fragments" of 1400 residues each, which together provide redundant structural coverage across the entirety of the titin protein. Each individual domain is typically represented in 6-7 fragments; these were extracted using the domain boundaries defined above, and re-numbered using PyMol [46], to produce an ensemble of models for both single and two-domain tandem systems (see Fig. 3).

Fig 3. Overview of the extraction of AF2 domain structures from 1400-residue fragments. Each fragment is searched for a complete match to the sequence of each domain in titin, and matching residues are extracted to a new PDB object. For example, the domain Ig-9 is found in fragments 5-11, each of which produces a slightly different domain structure prediction.

The original AlphaFold2 fragments are named sequentially; for example, the fragment AF-Q8WZ42-F5-model_v1 is the fifth fragment in the proteome data, and covers residues 801-2200 of the canonical sequence of titin. The subsequent fragment, AF-Q8WZ42-F6-model_v1, the sixth fragment, covers residues 1001-2400 of N2AB titin. This continues until the final fragment is reached, AF-Q8WZ42-F166-model_v1, which covers residues 33001-34350. This naming convention continues for the extracted models of domain constructs; for example, the model TTN-IC_Ig-3_AF-Q8WZ42-F5-model_v1.pdb is the single domain Ig-3 (IC isoform) extracted from the fifth fragment of titin in AlphaFoldDB.

To compare the relative quality of models from Modeller and AlphaFold2, we used a local version of MolProbity [47] to compute summary statistics for each individual model. Two metrics are reported for all models: MolProbity score, an estimate of "effective resolution" calculated from rotamer, Ramachandran, and clash scores; and Rama Z-score, a single-value summary of Ramachandran favoured/outlier residues. Available structures for each domain may be sorted by either of these metrics. AlphaFold2 models were overall estimated to be of substantially higher quality on average than Modeller homology models by MolProbity score, and the best-ranked AlphaFold2 models for each domain compare favourably to estimates of available crystal structures. However, there is little difference between models estimated by Rama Z-score, and almost all models are within accepted limits for model quality (absolute Rama Z-score < 2.0).

Fig 4. Comparison of estimated quality of models from AlphaFold2 and Modeller with the estimated quality of X-ray crystal structures.

As the canonical sequence was used by AlphaFold2 to model all fragments, there is no coverage of the domains located in exons 46, 47 and 48 of the TTN gene. These correspond to Ig-25 (IC, novex-1), Ig-26 (IC, novex-2), and the alternative novex-3 domains (denoted in TITINdb2 as Ig-25_novex3, Ig-26_novex3, Ig-27_novex3, Ig-28_novex3, Ig-29_novex3, and Ig-30_novex3). To remedy this, we used an adaptation of the AlphaFold2 pipeline distributed through Google Colaboratory, ColabFold [48], which allows for remote access to GPUs for prediction, and incorporates a faster but less exhaustive multiple sequence alignment search. The models produced are also not relaxed by the Amber force field because of memory limitations. The resulting ColabFold structures are of lower general quality than the AlphaFold2 predicted structures, but are still of high overall quality compared to template-based models.


Definition of Structural Elements

Residues in surface and core regions were defined using POPS [49]. Residues with a Q(SASA) (quotient solvent accessible surface area) > 0.3 were defined as being surface residues and those with a Q(SASA) ≤ 0.3 defined as core residues. Here Q(SASA) is defined as the quotient of the SASA (solvent-accessible surface area) and the surface area of the isolated residue, using the default probe radius of 1.4 Å. With the introduction of AlphaFold2 structural models to TITINdb2, we have also annotated domain-domain interface residues in domains; these reflect residues which appear surface-exposed when the single domain is assessed, but are buried when the adjacent domains are included in the model. To determine interface residues, tandem domains (domains separated by a linker of < 10 residues) were taken and the SASA of each residue was calculated. Residues were annotated as interface-buried if the SASA of the residue in the single-domain system decreased by at least 30 Å2 in the tandem-domain system (see Fig.5). This was repeated for all available fragments in which the domain tandem was represented, and residues confirmed as interface-buried if the majority of fragments were in agreement (> 50% interface-buried).

Fig 5. Definition of interface-buried residues. a) Example interface-buried region in titin Ig1, where SASA is decreased by 38.73 Å2 when Ig2 is included in the construct (fragment F1 for comparison). b) Example distribution of interface-buried residues on domain of interest, on both proximal and terminal domain-domain interfaces (interface with preceding and following domains, respectively).

A second category of "boundary" was also defined, which incorporated residues defined as interface-buried within 10 residues of the N- or C-termini of the domain, in order to prevent residues in linkers or at the domain boundaries from being erroneously labelled as interface-buried.


Protein-Protein Interactions and Site Annotations

The assessment of the impact of SNVs on protein-protein interactions was performed using mCSM [15], where experimental binary complexes were available. The following structures were used for these calculations:

Domain PDB Structure Solution Method Resolution Interacting Protein
Ig-11YA5X-ray Diffraction2.44ÅTCAP
Ig-21YA5X-ray Diffraction2.44ÅTCAP
Ig-1693KNBX-ray Diffraction1.40ÅOBSL1
Ig-1694C4KX-ray Diffraction1.95ÅOBSCN

Putative PPI interface regions were predicted using SPPIDER II [50] with a balanced trade-off between sensitivity and specificity (SPPIDER estimates this based on a control data set of 149 protein chains with no sequence homology).

Site annotations, including ligand binding sites and modified residues, were obtained from UniProt [51].


TITINdb2 Variant Converter

The titin variant conversion tool was developed to aid clinicians and researchers working with different standard formatting for titin variants to quickly and reliably translate variants between protein, transcript, and chromosomal positions in the seven major isoforms (IC, N2AB, N2B, N2A, novex-1, novex-2, novex-3) in bulk. In brief, by entering the query variants into the provided textbox (whitespace separated) and selecting the desired input and output formats, the tool will return a list consisting of the input variants translated to the output format, again whitespace separated.

The conversion from SAV to SNV relies on coercion of the wild-type codon to a mutant codon matching the mutated residue; if multiple codon substitutions match the amino acid substitution, the tool will only report the first match. Additionally, the tool only supports the conversion of SNVs and SAVs; it will not, for example, translate an SAV that corresponds to an indel at the genetic level (e.g. the SAV C31712R may be coerced to the SNV 95359T>C, but the theoretical SAV C31712K could not be coerced to the indel 95359_95360delinsAA). The format of input variants should follow the simple variant nomenclature syntax outlined below, and not the standard HGVS formatting for variants:

Format (IC) Example HGVS Nomenclature TITINdb2 Variant Converter
ChromosomeNC_000002.12:g.178546102A>G178546102A>G
TranscriptNM_001267550.2:c.95359T>C95359T>C
ProteinNP_001254479:p.Cys31712ArgC31712R

In addition to selecting isoform and format by name, users may select input and output formats using accession IDs derived from RefSeq [36], UniProt [51] and Ensembl [52]. These resources were also used to cross-reference sequence data for protein and transcript isoforms. Data for genetic sequences, exon boundaries, and chromosomal numbering were derived from CardioDB [53][54].

Note that the 225bp upstream non-translated sequence is included in the numbering for all titin transcripts; thus, the ATG start codon for the initial methionine residue in all isoforms is numbered as 226-228. This may be inconsistent with other numbering schemes that omit these nucleotides, thus users should be aware of these potential differences.


Citing TITINdb2

When using this tool in publication, please cite Laddach, A., M. Gautel and F. Fraternali (2017). "TITINdb2-a computational tool to assess titin's role as a disease gene". Bioinformatics 33(21): 3482-3485.

To cite the literature review for disease-associated missense variants in titin: Weston, T., Rees, M., Gautel, M. & Fraternali, F. "Walking with giants: The challenges of variant impact assessment in the giant sarcomeric protein titin". WIREs Mechanisms of Disease, e1638.


References


  1. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68-74 (2015).
  2. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434-443 (2020).
  3. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308-311 (2001).
  4. The Genome Aggregation Database. https://gnomad.broadinstitute.org/
  5. UK Biobank - Enabling scientific discoveries that improve human health. https://www.ukbiobank.ac.uk/
  6. gnomAD v4.0 Announcement. https://gnomad.broadinstitute.org/news/2023-11-gnomad-v4-0/
  7. Chauveau, C., Rowell, J. & Ferreiro, A. A Rising Titan: TTN Review and Mutation Update. Human Mutation 35, 1046-1059 (2014).
  8. Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
  9. Weston, T. G. R., Rees, M., Gautel, M. & Fraternali, F. Walking with giants: The challenges of variant impact assessment in the giant sarcomeric protein titin. WIREs Mechanisms of Disease, e1638.
  10. Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat Methods 11, 801-807 (2014).
  11. Adzhubei, I., Jordan, D. M. & Sunyaev, S. R. Predicting Functional Effect of Human Missense Mutations Using PolyPhen-2. Curr Protoc Hum Genet 0 7, Unit7.20 (2013).
  12. Ponzoni, L., Peñaherrera, D. A., Oltvai, Z. N. & Bahar, I. Rhapsody: predicting the pathogenicity of human missense variants. Bioinformatics 36, 3084-3092 (2020).
  13. Pires, D. E. V., Ascher, D. B. & Blundell, T. L. mCSM: predicting the effects of mutations in proteins using graph-based signatures. Bioinformatics 30, 335-342 (2014).
  14. Pires, D. E. V., Ascher, D. B. & Blundell, T. L. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res 42, W314-W319 (2014).
  15. González-Pérez, A. & López-Bigas, N. Improving the Assessment of the Outcome of Nonsynonymous SNVs with a Consensus Deleteriousness Score, Condel. Am J Hum Genet 88, 440-449 (2011).
  16. Ioannidis, N. M. et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet 99, 877-885 (2016).
  17. Functional ANnotations for Non-Synonymous SNVs. https://bbglab.irbbarcelona.org/fannsdb/
  18. Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 0, eadg7492 (2023).
  19. Cheng, J. et al. Predictions for AlphaMissense. (2023) doi:10.5281/zenodo.8360242
  20. Liu, X., Li, C., Mou, C., Dong, Y. & Tu, Y. dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Medicine 12, 103 (2020).
  21. Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res 47, D886-D894 (2019).
  22. Shihab, H. A. et al. Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum Mutat 34, 57-65 (2013).
  23. Jagadeesh, K. A. et al. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat Genet 48, 1581-1586 (2016).
  24. Dong, C. et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet 24, 2125-2137 (2015).
  25. Samocha, K. E. et al. Regional missense constraint improves variant deleteriousness prediction. 148353 (2017) doi:10.1101/148353.
  26. Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Research 39, e118 (2011).
  27. Schwarz, J. M., Cooper, D. N., Schuelke, M. & Seelow, D. MutationTaster2: mutation prediction for the deep-sequencing age. Nat Methods 11, 361-362 (2014).
  28. Pejaver, V. et al. Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nat Commun 11, 5918 (2020).
  29. Qi, H. et al. MVP predicts the pathogenicity of missense variants by deep learning. Nat Commun 12, 510 (2021).
  30. Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 50, 1161-1170 (2018).
  31. Choi, Y., Sims, G. E., Murphy, S., Miller, J. R. & Chan, A. P. Predicting the Functional Effect of Amino Acid Substitutions and Indels. PLOS ONE 7, e46688 (2012).
  32. REVEL: Rare Exome Variant Ensemble Learner - REVEL genome segment files. https://sites.google.com/site/revelgenomics/downloads/revel-genome-segment-files .
  33. Ng, P. C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res 31, 3812-3814 (2003).
  34. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583-589 (2021).
  35. Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 39, W29-37 (2011).
  36. O'Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733-745 (2016).
  37. Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res 42, D222-230 (2014).
  38. Crooks, G. E., Hon, G., Chandonia, J.-M. & Brenner, S. E. WebLogo: a sequence logo generator. Genome Res 14, 1188-1190 (2004).
  39. Notredame, C., Higgins, D. G. & Heringa, J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205-217 (2000).
  40. Myers, E. W. & Miller, W. Optimal alignments in linear space. Comput Appl Biosci 4, 11-17 (1988).
  41. Pei, J., Kim, B.-H. & Grishin, N. V. PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res 36, 2295-2300 (2008).
  42. Armstrong, D. R. et al. PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res 48, D335-D343 (2020).
  43. Eswar, N. et al. Comparative protein structure modeling using Modeller. Curr Protoc Bioinformatics Chapter 5, Unit-5.6 (2006).
  44. Zhang, Y. I-TASSER server for protein 3D structure prediction. BMC Bioinformatics 9, 40 (2008).
  45. AlphaFold Protein Structure Database. https://alphafold.ebi.ac.uk/
  46. Schrödinger, LLC. The PyMOL Molecular Graphics System, Version 1.8. (2015).
  47. Williams, C. J. et al. MolProbity: More and better reference data for improved all-atom structure validation. Protein Sci 27, 293-315 (2018).
  48. Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat Methods 19, 679-682 (2022).
  49. Cavallo, L., Kleinjung, J. & Fraternali, F. POPS: A fast algorithm for solvent accessible surface areas at atomic and residue level. Nucleic Acids Res. 31, 3364-3366 (2003).
  50. Porollo, A. & Meller, J. Prediction-based fingerprints of protein-protein interactions. Proteins 66, 630-645 (2007).
  51. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47, D506-D515 (2019).
  52. Martin, F. J. et al. Ensembl 2023. Nucleic Acids Research 51, D933–D941 (2023).
  53. Roberts, A. M. et al. Integrated allelic, transcriptional, and phenomic dissection of the cardiac effects of titin truncations in health and disease. Sci Transl Med 7, 270ra6 (2015).
  54. Titin Variation in Dilated Cardiomyopathy - Cardiovascular Genetics & Genomics, Imperial College London. https://www.cardiodb.org/titin/titin_transcripts.php.