HS3D (Homo Sapiens Splice Sites Dataset) is a data set of Homo Sapiens Exon, Intron and
Splice regions extracted from GenBank Rel.123.
The aim of this data set is to give standardized
material to train and to assess the prediction accuracy of computational approaches for
gene identification and characterization.
From the complete
GenBank (Primate
Sequences Division) Rel.123 (162,557
entries), entries of Human Nuclear DNA including Complete CDS and more than one Exon
have been selected, and 4523 exons and 3802 introns have been extracted from
these entries.
Details about extracted exons and introns are
reported (Locus,
number, Start and End
position in the entry, sequence, length, G+C content, presence of not AGCT data
(nucleotide scan check)).
Statistics are also reported (overall nucleotides, average G+C content,
nucleotide scan check results, number of not GT starting / AG ending introns, minimum /
maximum / average length, length standard deviation) .
3799+3799
donor and acceptor sites, as windows of 140 nucleotides around each splice site have been extracted. After
discarding sequences not including canonical GTAG junctions (65+74), including insufficient data (not enough material
for a 140 nucleotide window) (686+589), including not AGCT bases (29+30), and
redundant (218+226) there are 2796+ 2880 windows.
Finally, there are 271,937+332,296 windows of false splice sites, selected by searching
canonical GTAG pairs in not splicing positions. The false sites in a range of +/- 60
from a true splice site are marked as
proximal.
HS3D is available in "Downloads" section of this site.
References
1. P.Pollastro, S.Rampone (2002).
HS3D, a Dataset of Homo Sapiens Splice Regions, and its Extraction Procedure from a Major Public Database
, International Journal of Modern Physics C, 13(8), 1105-1117. (please cite this paper)
2. P.Pollastro, S.Rampone (2003).
HS3D: Homo Sapiens Splice Site Data Set
, Nucleic Acids Research, 2003 Annual Database Issue.
|