A benchmark database for variations


Home | Instructions | Datasets | Developers | Citing |


Download

1. Variation datasets affecting protein tolerance

Dataset of neutral single nucleotide polymorphisms

This is the neutral dataset or non synonymous coding SNP dataset comprising 21,170 human non synonymous coding SNPs with allele frequency 40.01 and chromosome sample count 449 from the dbSNP database build 131. This dataset was filtered for the disease-associated SNPs. The variant position mapping for this dataset was extracted from dbSNP database. The dataset is available for download below as an excel file.

Download: Neutral_Dataset

Dataset of pathogenic single nucleotide polymorphisms

This is the pathogenic dataset of 19,335 missense mutations obtained from the PhenCode database downloaded in June 2009), IDbases and from 18 individual LSDBs. For this dataset, the variations along with the variant position mappings to RefSeq protein (>=99% match), RefSeq mRNA and RefSeq genomic sequences are available for download.  

Download: Pathogenic_Dataset
Reference: Thusberg J, Olatubosun A, Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011, 32(4):358-68.   PUBMED  

2. Variation datasets affecting protein stability

These benchmark datasets with variations affecting stability of the protein have been collected from literature. For these datasets,  residue-residue level mappings from the structure entries in PDB to the sequence entries in the UniProt are available for download as excel tables below.

Dataset 1

This dataset contains 1784 mutations from 80 proteins with experimentally determined ΔΔG values in ProTherm (ProTherm update Dec. 19, 2008). It consists of 1,154 positive cases of which 931 are destabilizing (ΔΔG ≥0.5 kcal/mol), 222 are stabilizing (ΔΔG ≤ -0.5 kcal/mol), and 631 neutral cases (0. 5 kcal/mol≥ ΔΔG ≥ -0.5 kcal/mol).

Download: Dataset 1
References:
Khan S, Vihinen M. Performance of protein stability predictors. Hum Mutat. 2010, 31(6):675-684.
Kumar MD, Bava KA, Gromiha MM, Prabakaran P, Kitajima K, Uedaira H, Sarai A: ProTherm and ProNIT: thermodynamic databases for proteins and protein-nucleic acid interactions. Nucleic Acids Res 2006, 34(Database issue):D204-206.   PUBMED  

Dataset 2

This dataset of 2156 variations was made from a list of 964 single mutations ( Guerois et al. 2002) and from a set of 2972 single variations obtained from the ProTherm database (Kumar et al., 2006) after filtering for duplicate entries. NMR determined structures are excluded from this dataset and only the average ΔΔG value was given when several ΔΔG values were present for a single variation.

Download: Dataset 2

Reference: Potapov V, Cohen M, Schreiber G. Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details. Protein Eng Des Sel. 2009, 22(9):553-560.   PUBMED  

Dataset 3

This dataset is composed of two sub datasets.  One is the training dataset containing 339 mutants experimentally studied in nine proteins and the other is the test dataset containing 625 variants from ProTherm.

Training dataset: 339 variants from 9 proteins.  Download: Dataset 3(a)
Blind test dataset: 625 variants from 28 proteins. Download: Dataset 3(b)

Reference: Guerois R, Nielsen JE, Serrano L. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol. 2002, 320(2):369-387.   PUBMED  

Dataset 4

This dataset is derived from the July 2003 release of ProTherm and contains two sub datasets. The first one, S1615, was used for training/testing the neural network system. The second one, S388, was used as the test and contains 388 variations collected only at physiological conditions. S388 is a subset of S1615. Only single variations with ΔΔG in Protherm and structures deposited in PDB are present in the datasets.

  1. Training dataset: S1615 - 1615 variants from 42 proteins. Download: Dataset 4 (a)
  2. Test dataset - S388 (subset of the first) - 338 variants from 17 proteins. Download: Dataset 4(b)

References: Capriotti E, Fariselli P, Casadio R. A neural-network-based method for predicting protein stability changes upon single point mutations. Bioinformatics. 2004, 20 Suppl 1:i63-68.   PUBMED  

3. Variation datasets affecting transcription factor binding sites

This dataset of 21 experimentally proven variations was collected by a literature search for known variations affecting TF binding.  These variations were used to set a threshold for a relevant change in Transcription Factor binding affinity. The variation position mapping is available to the genomic (RefSeq), gene (RefSeq Gene) and cDNA (RefSeq mRNA) levels.

Download: TFBS data
Reference: Laurila K and Lähdesmäki H. Systematic analysis of disease-related regulatory variation classes reveals distinct effects on transcription factor binding. In Silico Biol. 2009, 9, 209-224.   PUBMED  

4. Variation datasets affecting mRNA splice sites

This dataset contains 13 MLH1 and 6 MSH2 gene variants identified by DHPLC and sequencing of MLH1 and MSH2 exonic regions in patients totaling to 19 variants.  The variation positions in this dataset have been mapped to RefSeq mRNA and RefSeq protein accessions when applicable.

Download: mlh1 msh2 variants
Reference: Arnold S, Buchanan DD, Barker M, Jaskowski L, Walsh MD, Birney G, Woods MO, Hopper JL, Jenkins MA, Brown MA et al. Classifying MLH1 and MSH2 variants using bioinformatic prediction, splicing assays, segregation, and tumor characteristics.  Hum. Mutat. 2009, 30, 757-770.   PUBMED  

5. Links to the DBASS3 and DBASS5 databases

DBASS3 is a database with information on the human disease-causing mutation induced aberrant 3' splice sites. This database contains 307 (152 in exons and 155 in introns). DBASS5 is a similar database, but with information on the human disease-causing mutation induced aberrant 5' splice sites. It contains 577 records (277 in exons and 300 in introns). Both of the databases are regularly updated and publicly accessible.

http://www.som.soton.ac.uk/research/geneticsdiv/dbass5/
http://www.som.soton.ac.uk/research/geneticsdiv/dbass3/

References

Buratti E, Chivers M, Kralovicova J, Romano M, Baralle M, Krainer AR, Vorechovsky I:Aberrant 5' splice sites in human disease genes: mutation pattern, nucleotide structure and comparison of computational tools that predict their utilization. Nucleic Acids Res. 2007, 35(13):4250-4263.   PUBMED  

Vorechovsky I. Aberrant 3' splice sites in human disease genes: mutation pattern, nucleotide structure and comparison of computational tools that predict their utilization. Nucleic Acids Res. 2006, 34(16):4630-4641.   PUBMED