Indiana University
University Information Technology Services
  
What are archived documents?
Login>>
Login

Login is for authorized groups (e.g., UITS, OVPIT, and TCC) that need access to specialized Knowledge Base documents. Otherwise, simply use the Knowledge Base without logging in.

Close

ARCHIVED: How can I use SQL to access multiple data resources within CLSD?

The Indiana University Centralized Life Sciences Data (CLSD) service was retired June 18, 2010.

This document contains some example SQL queries that use the Centralized Life Sciences Data (CLSD) service at Indiana University. In particular, these examples show how having multiple data sources consolidated within the same system can facilitate exploration:

  • The following example shows how to interrogate the National Center for Biotechnology Information's (NCBI's) dbSNP database within the CLSD collection of databases. This query should produce a list of all the non-synonymous single nucleotide polymorphisms (SNPs) associated with TP53: select snp_id, allele, aa_position, residue from b126_SNPContigLocusId_36_1 where locus_symbol like 'TP53' and fxn_class = 4

    The result would be a table beginning with these rows:

    SNP_ID ALLELE AA_POSITION RESIDUE
    1042522 G 71 R
    1800371 T 46 S
    11540652 A 247 Q
    11540654 T 109 L
    17849781 G 277 A
    17881470 G 365 A

    For more examples of querying a single database, see ARCHIVED: How can I use SQL in CLSD?

    Note that NCBI does not allow users to query dbSNP using SQL queries. However, having the ability to issue SQL queries against dbSNP within CLSD should greatly facilitate some large-scale SNP studies. Studies using the other NCBI resources incorporated within CLSD (PubMed, UniGene, Nucleotide, Swiss-Prot) may also benefit from having direct SQL access.

  • You can execute BLAST searches within CLSD. The next example shows how to conduct a BLAST search with a fixed gapcost and a maximum E-value: select GB_ACC_NUM, DESCRIPTION from NCBI.BLASTN_NT where blastseq='AGTACTAGCTAGCTAGCTACTAGCTGACTGACTGACTGATGCATCGATGATGC' and GAPCOST = 8 and E_VALUE < .05

    The result of such a query would look like:

    GB_ACC_NUM E_VALUE Description
    AE014134 0.0100839 Drosophila melanogaster chromosome 2L, complete sequence
    AC092228 0.0100839 Drosophila melanogaster, chromosome 2L, region 35X-35X, BAC clone BACR21J17, complete sequence
    AP008207 0.0398454 Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 1
    AP003197 0.0398454 Oryza sativa Japonica Group genomic DNA, chromosome 1, BAC clone:B1015E06
    AP003105 0.0398454 Oryza sativa Japonica Group genomic DNA, chromosome 1, PAC clone:P0489A05


  • You can also access data from multiple databases as part of a single query by using a SQL command that joins separate tables or embeds a query within a query.

    This example shows how to display a list of non-synonymous SNPs related to any gene that is implicated in Alzheimer's disease according to the DiseaseGeneNet schema. (Note that the DiseaseGeneNet schema is only provisionally included within CLSD, as of May 2008.) The following SQL query uses an embedded select query to get a list of genes implicated in Alzheimer's, and then displays SNPs whose locus_id values match any element in that list of genes:

    select locus_symbol, locus_id, snp_id, allele, aa_position, residue from dbsnp128_human.B127_SNPCONTIGLOCUSID_36_2 where fxn_class = 4 and locus_id in ( select gene_id from disease_gene_net.disease_gene_connector where disease_id = 98 ) order by locus_symbol, aa_position

    Following are the first and last few rows of the result:

    LOCUS_SYMBOL SNP_ID ALLELE AA_POSITION RESIDUE
    A2M 12425728 A 359 Q
    A2M 12425728 A 359 Q
    PSEN2 28936379 G 238 V
    PSEN2 28936379 G 238 V

    The following query will tell you how many SNPs are associated with each gene associated with Alzheimer's:

    select locus_symbol, count(*) from dbsnp128_human.B127_SNPCONTIGLOCUSID_36_2 where fxn_class = 4 and locus_id in ( select gene_id from disease_gene_net.disease_gene_connector where disease_id = 98 ) group by locus_symbol order by locus_symbol

    Following is the result of this query:

    LOCUS_SYMBOL Count
    A2M 14
    ACE 82
    APBB2 8
    APOE 38
    APP 12
    BLMH 6
    MPO 12
    NOS3 27
    PAXIP1 15
    PLAU 11
    PSEN1 6
    PSEN2 4

This document was developed with support from the National Science Foundation (NSF) under Grant No. 0503697 to the University of Chicago and subcontracted to Indiana University. Additional support was provided by IU through its participation in the TeraGrid, which is supported by the NSF under Grants No. 0833618, SCI451237, SCI535258, and SCI504075. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

This is document awxs in domains all and tgrid-all.
Last modified on June 18, 2010.

Comments/Questions/Corrections

Use this form to offer suggestions, corrections, and additions to the Knowledge Base. We welcome your input!

If you are affiliated with Indiana University and would like assistance with a specific computing problem, please use the Ask a Consultant form, or contact your campus Support Center.

Contact Information

Note: We will reply to your comment at this address. If your message concerns a problem receiving email, please enter an alternate email address.