Abstract / Latest Developments

HS3D (Homo Sapiens Splice Sites Dataset) is a data set of Homo Sapiens Exon, Intron and Splice regions extracted from GenBank Rel.123.

The aim of this data set is to give standardized material to train and to assess the prediction accuracy of computational approaches for gene identification and characterization.

From the complete GenBank (Primate Sequences Division) Rel.123 (162,557 entries), entries of Human Nuclear DNA including Complete CDS and more than one Exon have been selected, and 4523 exons and 3802 introns have been extracted from these entries.

Details about extracted exons and introns are reported (Locus, number, Start and End position in the entry, sequence, length, G+C content, presence of not AGCT data (nucleotide scan check)).

Statistics are also reported (overall nucleotides, average G+C content, nucleotide scan check results, number of not GT starting / AG ending introns, minimum /   maximum / average length, length standard deviation) .

3799+3799 donor and acceptor sites, as windows of 140 nucleotides around  each splice site have been extracted. After discarding sequences not including canonical GT–AG junctions (65+74),  including insufficient data (not enough material for a 140 nucleotide window) (686+589),  including not AGCT bases (29+30), and redundant (218+226) there are 2796+ 2880 windows.

Finally, there are 271,937+332,296 windows of false splice sites, selected by searching canonical GT–AG pairs in not splicing positions. The false sites in a range of +/- 60 from a true splice site  are marked as proximal.

HS3D is available in "Downloads" section of this site.

References

1. P.Pollastro, S.Rampone (2002). HS3D, a Dataset of Homo Sapiens Splice Regions, and its Extraction Procedure from a Major Public Database , International Journal of Modern Physics C, 13(8), 1105-1117. (please cite this paper)
2. P.Pollastro, S.Rampone (2003). HS3D: Homo Sapiens Splice Site Data Set , Nucleic Acids Research, 2003 Annual Database Issue.

 

^ top of the document ^

 Updates
Release 1.2
June 16, 2003
 released by
Pasquale Pollastro
Salvatore Rampone

Università del Sannio

Last update May, 3 2020