ARCHIVED: How can I use SQL to access multiple data resources within CLSD?
The Indiana University Centralized Life Sciences Data (CLSD) service was retired June 18, 2010.
This document contains some example SQL queries that use the Centralized Life Sciences Data (CLSD) service at Indiana University. In particular, these examples show how having multiple data sources consolidated within the same system can facilitate exploration:
- The following example shows how to interrogate the National Center
for Biotechnology Information's (NCBI's) dbSNP database within the CLSD
collection of databases. This query should produce a list of all the
non-synonymous single nucleotide polymorphisms (SNPs) associated with TP53:
select
snp_id, allele, aa_position, residue
from
b126_SNPContigLocusId_36_1
where
locus_symbol like 'TP53' and fxn_class = 4
The result would be a table beginning with these rows:
SNP_ID ALLELE AA_POSITION RESIDUE 1042522 G 71 R 1800371 T 46 S 11540652 A 247 Q 11540654 T 109 L 17849781 G 277 A 17881470 G 365 A For more examples of querying a single database, see ARCHIVED: How can I use SQL in CLSD?
Note that NCBI does not allow users to query dbSNP using SQL queries. However, having the ability to issue SQL queries against dbSNP within CLSD should greatly facilitate some large-scale SNP studies. Studies using the other NCBI resources incorporated within CLSD (PubMed, UniGene, Nucleotide, Swiss-Prot) may also benefit from having direct SQL access.
- You can execute BLAST searches within CLSD. The next example shows
how to conduct a BLAST search with a fixed gapcost and a maximum E-value:
select
GB_ACC_NUM, DESCRIPTION
from
NCBI.BLASTN_NT
where
blastseq='AGTACTAGCTAGCTAGCTACTAGCTGACTGACTGACTGATGCATCGATGATGC'
and
GAPCOST = 8
and
E_VALUE < .05
The result of such a query would look like:
GB_ACC_NUM E_VALUE Description AE014134 0.0100839 Drosophila melanogaster chromosome 2L, complete sequence AC092228 0.0100839 Drosophila melanogaster, chromosome 2L, region 35X-35X, BAC clone BACR21J17, complete sequence AP008207 0.0398454 Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 1 AP003197 0.0398454 Oryza sativa Japonica Group genomic DNA, chromosome 1, BAC clone:B1015E06 AP003105 0.0398454 Oryza sativa Japonica Group genomic DNA, chromosome 1, PAC clone:P0489A05
- You can also access data from multiple databases
as part of a single query by using a SQL command that
joins separate tables or embeds a query within a query.
This example shows how to display a list of non-synonymous SNPs related to any gene that is implicated in Alzheimer's disease according to the DiseaseGeneNet schema. (Note that the DiseaseGeneNet schema is only provisionally included within CLSD, as of May 2008.) The following SQL query uses an embedded select query to get a list of genes implicated in Alzheimer's, and then displays SNPs whose
select locus_symbol, locus_id, snp_id, allele, aa_position, residue from dbsnp128_human.B127_SNPCONTIGLOCUSID_36_2 where fxn_class = 4 and locus_id in ( select gene_id from disease_gene_net.disease_gene_connector where disease_id = 98 ) order by locus_symbol, aa_positionlocus_idvalues match any element in that list of genes:Following are the first and last few rows of the result:
LOCUS_SYMBOL SNP_ID ALLELE AA_POSITION RESIDUE A2M 12425728 A 359 Q A2M 12425728 A 359 Q PSEN2 28936379 G 238 V PSEN2 28936379 G 238 V The following query will tell you how many SNPs are associated with each gene associated with Alzheimer's:
select locus_symbol, count(*) from dbsnp128_human.B127_SNPCONTIGLOCUSID_36_2 where fxn_class = 4 and locus_id in ( select gene_id from disease_gene_net.disease_gene_connector where disease_id = 98 ) group by locus_symbol order by locus_symbolFollowing is the result of this query:
LOCUS_SYMBOL Count A2M 14 ACE 82 APBB2 8 APOE 38 APP 12 BLMH 6 MPO 12 NOS3 27 PAXIP1 15 PLAU 11 PSEN1 6 PSEN2 4
This document was developed with support from the National Science Foundation (NSF) under Grant No. 0503697 to the University of Chicago and subcontracted to Indiana University. Additional support was provided by IU through its participation in the TeraGrid, which is supported by the NSF under Grants No. 0833618, SCI451237, SCI535258, and SCI504075. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
Last modified on June 18, 2010.







