ARCHIVED: How can I use SQL in CLSD?
The Indiana University Centralized Life Sciences Data (CLSD) service was retired June 18, 2010.
Centralized Life Sciences Data (CLSD) at Indiana University is housed in an IBM DB2 database, accessible via several clients and programming interfaces. To access CLSD data, you must prepare queries using Structured Query Language (SQL).
Following are some examples of using SQL to access data within CLSD. For a very brief introduction to SQL, see SQL and the relational algebra on the IU Research Technologies web site. For more detailed information about SQL, see these online tutorials:
For detailed information about using SQL to build queries for dbSNP data, see IU's Using dbSNP via SQL queries.
Note: Within DB2, CLSD appears as a collection of schemas, each of which includes a collection of tables. Many of the tables are virtual; they are not physically present on local hardware, but are accessed over the network as needed.
Getting general information about the CLSD database
- To get a list of all schemas and tables available within CLSD: select tabschema, tabname from syscat.tables
- To get all the information within
syscat.tablesabout each table: select * from syscat.tables - To find other information about the CLSD instance, consider this
material taken from the
Selector web site:
There are three default tablespaces (TABLESPACES):
- SYSCATSPACE
- TEMPSPACE
- USERSPACE1
For example:
select * from SYSCAT.TABLESPACESThere are four schemas within SYSCATSPACE:
- SYSIBM (System tables, read only)
- SYSCAT (Views of system tables, read only)
- SYSFUN (Stored functions)
- SYSSTAT (System use statistics)
Below are some useful tables. Note that for each entry in the SYSIBM column, there is a corresponding entry in the SYSCAT column.
Database object SYSIBM system table SYSCAT system table view SYSSTAT updateable view Tablespace SYSTABLES TABLESPACES Table SYSTABLES TABLES TABLES Schema SYSSCHEMATA SCHEMATA View SYSVIEWS VIEWS Column SYSCOLUMNS COLUMNS COLUMNS Below are examples from the SYSIBM system table area:
select * from SYSIBM.TABLES select * from SYSIBM.SYSTABLESFollowing are examples from the SYSCAT area:
select * from SYSCAT.SCHEMATA select * from SYSCAT.TABLES select typename from SYSCAT.DATATYPES - To get a list of all tables or tablenames in schema
dbSNP128_SHARED:
select * from syscat.tables where tabschema like '%DBSNP128%'
select tabname from syscat.tables where tabschema='DBSNP128_SHARED'
Note: The
likeoperand is case sensitive. - To get a list of rowcounts for all tables in schema dbSNP128_human: select * from dbsnp128_human.DN_TABLE_ROWCOUNT
- To get a description of each column in each table within schema dbSNP128_shared: select * from syscat.tables where tabschema = 'DBSNP128_SHARED'
Getting specific information from the database
Following are some queries that display table contents rather than table metadata:
- To get all columns of the alleles table of schema dbSNP128_shared:
select * from dbsnp128_shared.allele
Note: This contains more rows than allowed by the default maximum row count of 5,000 employed by the reference web interface. If you want to see all the entries, you will have to adjust the maximum row count parameter on the web-based form.
Alternatively, to limit the number of rows returned by the SQL command itself:
select * from dbsnp128_shared.allele fetch first 100 rows only - To count the number of entries in the allele database: select count(*) from dbSNP128_shared.allele
- To see the entire HapMap classification code table (nine entries): select * from dbsnp128_human.hapmapclasscode
- To get a list of just the codes and their abbreviations: select code,abbrev from dbsnp128_human.hapmapclasscode
- To run a BLASTN_NT search:
select
GB_ACC_NUM,DESCRIPTION from NCBI.BLASTN_NT
where
BLASTSEQ='AGTACTAGCTAGCTAGCTACTAGCTGACTGACTGACTGATGCATCGATGATGC'
Note: The string
BLASTNincludes a trailingN, and nucleotide and AA sequences specified in a query must be at least 15 bytes long.See Using BLAST via CLSD for more information.
- To get a list of BLAST results filtered by a Prosite pattern:
select
gb_acc_num,HSP_H_SEQ
from
ncbi.blastp_nr
where
blastseq='MSQICKRGLLISNRLAPAALRCKSTWFSEVQMGPPDAILGVTEAFKKDTNPKKI \
NLGAGAYRDDNTQPFVLPSVREAEKRVVSRSLDKEYATIIGIPEFYNKAIELALGKGSKRLAA \
KHNVTAQSISGTGALRIGAAFLAKFWQGNREIYIPSPSWGNHVAIFEHAGLPVNRYRYYDKDT'
and
DB2LS.LSPatternMatch(HSP_H_SEQ,
db2ls.LSPrositePattern('[GS]-[LIVMFYTAC]-[GSTA]-K-x(2)-[GSALVN].') ) > 0
Note: Use a period (
.) to terminate the Prosite pattern. The LSPatternMatch function returns the character position of the left-most substring matching the pattern, or zero if there is no match. - To get a list of genes containing "brain" in their LOCUS_NAME in
dbSNP128_shared:
select
*
from
DBSNP128_SHARED.GENEIDTONAME
where
locus_name like '%brain%'
Note: The string
brainis case specific. - To get a list of Bind Genes and their species: select GeneNameA,Organism from bind.bind_interaction
- To get a list of genes mentioning "HUMAN" in their descriptions in
KEGG:
select * from KEGG.GENE where description like '%HUMAN%'
Note: The
likeoperand is case sensitive. - To get a list of all columns in the "REAC%" tables: select * from syscat.columns where TABNAME like 'REAC%'
- To get a list of each reaction (showing reactants, enzymes and products from the ligand database) where the enzyme codes have been replaced with the enzyme names, as held in the enzyme database, for each pathway recorded in the ligand database: select pw.PATH_ID, pw.REAC_ID, re.COMP_ID, sx.DESCRIPTION, pr.COMP_ID from Ligand.PATHREAC pw, Ligand.REACREAC re, Ligand.REACENZY en, Ligand.REACPROD pr, Enzyme.general sx where en.REAC_ID=re.REAC_ID AND en.REAC_ID=pw.REAC_ID AND en.REAC_ID=pr.REAC_ID AND sx.EC_ID=en.EC_ID order by pw.PATH_ID,pw.REAC_ID
- To get information from PubMed: select PMID, ArticleTitle FROM NCBI.pmarticles where entrez.contains (ArticleTitle, 'granulation') = 1 AND entrez.contains (PubDate, '1992') = 1
- To get a list of sequences involving JAK3 from NCBI Nucleotide: SELECT PRIMARYACCESSION, SEQLENGTH, DEFINITION, SEQUENCE FROM NCBI.GBSEQ WHERE ( ENTREZ.CONTAINS( ORGANISM, 'HUMAN') = 1 ) AND ( ENTREZ.CONTAINS( DEFINITION, 'JAK3' ) = 1 )
Additional information
If you receive SQL error messages in response to SQL queries to CLSD, consult the IBM Warning, error and completion messages information.
This document was developed with support from the National Science Foundation (NSF) under Grant No. 0503697 to the University of Chicago and subcontracted to Indiana University. Additional support was provided by IU through its participation in the TeraGrid, which is supported by the NSF under Grants No. 0833618, SCI451237, SCI535258, and SCI504075. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
Last modified on June 18, 2010.







