For programmatic access, we provide an easy to use REST API documentation. We have also released libraries for both Python and R, in order to provide a native experience to those fluent in those languages. For those particularly interested in semantic web technologies - we also provide an Resource Description Framework (RDF) representation of the OMA data, which can be queried using the RDF query language SPARQL. This is, however, outside the scope of these exercises - for more information, see here. These exercises can be performed in the language of your choice, however the solutions will be provided in Python. The R package is available on Bioconductor, whilst the Python module is available from PyPI and installable using
pip install omadb
using a terminal. For these exercises we will also use the extra parts of PyOMADB for HOG analysis and GO enrichment analyses. To ensure you install the required extra packages, use the command
pip install "omadb[hog_analysis,goea]"
After installing the package, we can load it in Python (for instance, within a Jupyter Notebook), as follows
from omadb import Client
c = Client()
Note: there are a number of examples using the PyOMADB package available here, which may help when completing these exercises.
Your colleague provides you with the following excerpt of a protein sequence
SCHTGLRRTAGWNVPIGTLRPFLNWTGPPEPIEAAVARFFSASCVPGADKGQFPNLCRLCAGTGEN
Unfortunately they have misplaced the sequence identifier, but can remember that it was a Human protein. Using the API, identify which OMA entry this is.
Note: OMA IDs contain the UniProt 5 character species code as the first 5 characters of the ID.
unknown_seq = 'SCHTGLRRTAGWNVPIGTLRPFLNWTGPPEPIEAAVARFFSASCVPGADKGQFPNLCRLCAGTGEN'
results = c.proteins.search(unknown_seq)
print('Number of results:', len(results.targets))
print('Identified by:', results['identified_by'])
human_ids = list(filter(lambda target: target.omaid.startswith('HUMAN'), results.targets))
print('Number of human entries:', len(human_ids))
human_entry = human_ids[0]
print('OMA ID:', human_entry.omaid)
print('Number of orthologs:', len(human_entry.orthologs))
og = human_entry.oma_group_url
print('OMA group: ', og.group_nr)
print('Fingerprint:', og.fingerprint)
species = {m.omaid[:5] for m in og.members}
print('Number of species:', len(species))
print('HOG ID:', human_entry.oma_hog_id)
print('HOG level:', human_entry.oma_hog_members.level)
print('Top-level (root) HOG ID:', human_entry.roothog_id)
We can use the GOATOOLs library to perform the GOEA. There is handy functionality in the PyOMADB library to load this automatically for us - it can even filter to a desired aspect of the GO. For instance, the following loads for the biological process aspect only.
goea = c.entries.gene_ontology(background, # Background list of IDs
aspect='BP', # Filtering (optional)
as_goatools=True, # To return GOATOOLS GOEA object
progress=True, # Show progress of GO retrieval
methods=['fdr_bh']) # Passed to GOATOOLS (methods for GOEA)
hog_members = human_entry.oma_hog_members.members
foreground = [m.omaid for m in hog_members]
euk_hog_members = c.hogs.members('TRFL_HUMAN', level='Eukaryota')
background = [m.omaid for m in euk_hog_members]
import pandas as pd
res = pd.DataFrame.from_records((dict(zip(z.get_prtflds_default(),
z.get_field_values(z.get_prtflds_default())))
for z in results))
results = goea.run_study(foreground)
Using the OMA API, we have access to the pairwise relationships that OMA computes. In the Python client, this is available through the Client.pairwise
method. In this exercise, we will compare the distribution of the evolutionary distance (in PAM units) between the 1:1 pairwise orthologs between human and Mus musculus (Mouse) as well as human and Canis lupus familiaris (Dog).
Hint: If you find this difficult, look at example 4 here (for R), or the notebook of examples (for PyOMADB).
import numpy as np
human_mouse_iter = c.pairwise(genome_id1='HUMAN',
genome_id2='MOUSE',
rel_type='1:1',
progress=True)
human_mouse = np.array([x.distance for x in human_mouse_iter])
human_dog_iter = c.pairwise(genome_id1='HUMAN',
genome_id2='CANLF',
rel_type='1:1',
progress=True)
human_dog = np.array([x.distance for x in human_dog_iter])
%matplotlib inline
import seaborn as sns
sns.set()
ax = sns.distplot(human_dog, label='Canis lupus familiaris')
ax = sns.distplot(human_mouse, label='Mus musculus', ax=ax)
ax.set_title('Distribution of Evolutionary Distance between\n1:1 Orthologous Pairs HUMAN-{CANLF, MOUSE}')
ax.set_xlabel('Evolutionary Distance (PAM)')
ax.set_ylabel('Density')
ax.legend()
print('Median:',
'\nHUMAN-MOUSE', np.median(human_mouse),
'\nHUMAN-CANLF', np.median(human_dog))
print('Mean:'
'\nHUMAN-MOUSE', np.mean(human_mouse),
'\nHUMAN-CANLF', np.mean(human_dog))
scipy.stats.ks_2samp
.
from scipy import stats
z = stats.ks_2samp(human_mouse, human_dog)
print('p-value:', z.pvalue)