OMA

Introduction to Automatically Querying the OMA Database

For programmatic access, we provide an easy to use REST API documentation. We have also released libraries for both Python and R, in order to provide a native experience to those fluent in those languages. For those particularly interested in semantic web technologies - we also provide an Resource Description Framework (RDF) representation of the OMA data, which can be queried using the RDF query language SPARQL. This is, however, outside the scope of these exercises - for more information, see here. These exercises can be performed in the language of your choice, however the solutions will be provided in Python. The R package is available on Bioconductor, whilst the Python module is available from PyPI and installable using

`pip install omadb`

using a terminal. For these exercises we will also use the extra parts of PyOMADB for HOG analysis and GO enrichment analyses. To ensure you install the required extra packages, use the command

`pip install "omadb[hog_analysis,goea]"`

After installing the package, we can load it in Python (for instance, within a Jupyter Notebook), as follows

from omadb import Client
c = Client()

Note: there are a number of examples using the PyOMADB package available here, which may help when completing these exercises.

Characterising a Protein with a Missing ID

Your colleague provides you with the following excerpt of a protein sequence SCHTGLRRTAGWNVPIGTLRPFLNWTGPPEPIEAAVARFFSASCVPGADKGQFPNLCRLCAGTGEN

Unfortunately they have misplaced the sequence identifier, but can remember that it was a Human protein. Using the API, identify which OMA entry this is.

Note: OMA IDs contain the UniProt 5 character species code as the first 5 characters of the ID.

1. How many targets contain this excerpt? How were these identified?

unknown_seq = 'SCHTGLRRTAGWNVPIGTLRPFLNWTGPPEPIEAAVARFFSASCVPGADKGQFPNLCRLCAGTGEN'
results = c.proteins.search(unknown_seq)
print('Number of results:', len(results.targets))
print('Identified by:', results['identified_by'])

2. What is the OMA ID of the human protein which contains this subsequence?

Hint: the 5-letter UniProt species code for Homo sapiens is HUMAN.

human_ids = list(filter(lambda target: target.omaid.startswith('HUMAN'), results.targets))
print('Number of human entries:', len(human_ids))
human_entry = human_ids[0]
print('OMA ID:', human_entry.omaid)

3. How many orthologs does this entry have?

print('Number of orthologs:', len(human_entry.orthologs))

4. How many species have a protein member of the OMA Group of which this protein is part?

og = human_entry.oma_group_url
print('OMA group: ', og.group_nr)
print('Fingerprint:', og.fingerprint)
species = {m.omaid[:5] for m in og.members}
print('Number of species:', len(species))

5. Which HOG is this protein a member of? What is the top level family ID?

print('HOG ID:', human_entry.oma_hog_id)
print('HOG level:', human_entry.oma_hog_members.level)
print('Top-level (root) HOG ID:', human_entry.roothog_id)

Performing a GO Enrichment Analysis

In this exercise, we will retrieve the Gene Ontology (GO) annotations for the members of a particular gene family, at the Eukaryota level. We can then perform a GO enrichment analysis (GOEA) in order to see if a sub-family, defined at a lower level (Amniota), is enriched in any particular GO terms.

This will focus on the gene family (top level HOG) found in the previous exercise

1. Identify the members of the HOG that the protein in the previous exercise is part of.

hog_members = human_entry.oma_hog_members.members
2. Generate a list of OMA IDs for these members. This is the foreground set.

foreground = [m.omaid for m in hog_members]
3. Generate a list of OMA IDs for the gene family (HOG) defined at the Eukarayota level. This defines the background set of genes in our GOEA.
euk_hog_members = c.hogs.members('TRFL_HUMAN', level='Eukaryota') background = [m.omaid for m in euk_hog_members]

We can use the GOATOOLs library to perform the GOEA. There is handy functionality in the PyOMADB library to load this automatically for us - it can even filter to a desired aspect of the GO. For instance, the following loads for the biological process aspect only.

goea = c.entries.gene_ontology(background,         # Background list of IDs
                           aspect='BP',        # Filtering (optional)
                           as_goatools=True,   # To return GOATOOLS GOEA object
                           progress=True,      # Show progress of GO retrieval
                           methods=['fdr_bh']) # Passed to GOATOOLS (methods for GOEA)

4. Run the GOEA, on the biological process annotations, using the foreground selected earlier.

Hint: you can use this piece of code to load the results into a pandas DataFrame.

import pandas as pd
res = pd.DataFrame.from_records((dict(zip(z.get_prtflds_default(), 
                                      z.get_field_values(z.get_prtflds_default()))) 
                             for z in results))

results = goea.run_study(foreground)

5. What do you think this result means?

Visualising PAM Distances in Taxonomic Space

Using the OMA API, we have access to the pairwise relationships that OMA computes. In the Python client, this is available through the Client.pairwise method. In this exercise, we will compare the distribution of the evolutionary distance (in PAM units) between the 1:1 pairwise orthologs between human and Mus musculus (Mouse) as well as human and Canis lupus familiaris (Dog). Hint: If you find this difficult, look at example 4 here (for R), or the notebook of examples (for PyOMADB).

1. Do you expect humans to be more similar to mice or dogs?

2. Load the 1:1 pairwise orthologs between human and mouse.

import numpy as np

human_mouse_iter = c.pairwise(genome_id1='HUMAN',
                          genome_id2='MOUSE',
                          rel_type='1:1',
                          progress=True)
human_mouse = np.array([x.distance for x in human_mouse_iter])

3. Repeat, loading the distances between 1:1 orthologs for human and dog.

Hint: the 5-letter code for dog is CANLF.

human_dog_iter = c.pairwise(genome_id1='HUMAN',
                        genome_id2='CANLF',
                        rel_type='1:1',
                        progress=True)
human_dog = np.array([x.distance for x in human_dog_iter])

4. Using a plotting library (for instance, seaborn in Python), plot the two distributions PAM distances.

%matplotlib inline
import seaborn as sns
sns.set()

ax = sns.distplot(human_dog, label='Canis lupus familiaris')
ax = sns.distplot(human_mouse, label='Mus musculus', ax=ax)

ax.set_title('Distribution of Evolutionary Distance between\n1:1 Orthologous Pairs HUMAN-{CANLF, MOUSE}')
ax.set_xlabel('Evolutionary Distance (PAM)')
ax.set_ylabel('Density')
ax.legend()

5. Look at the median and mean of the two samples (human - mouse and human - dog)

print('Median:',
  '\nHUMAN-MOUSE', np.median(human_mouse),
  '\nHUMAN-CANLF', np.median(human_dog))

print('Mean:'
  '\nHUMAN-MOUSE', np.mean(human_mouse),
  '\nHUMAN-CANLF', np.mean(human_dog))

6. Use the two-sample Kolmogorov-Smirnov test, to test whether the two samples are drawn from the same distribution.
Hint: the two-sample Kolmogorov-Smirnov test is available in scipy, as scipy.stats.ks_2samp.
from scipy import stats z = stats.ks_2samp(human_mouse, human_dog) print('p-value:', z.pvalue)
7. Are the results to (5) and (6) what you expected in (1)? Does it correspond with the literature?

The KS test returns a near-zero p-value. The median distance between dog and human is shorter than that of mouse and human (8.7 vs. 11.8). This is consistent with previous observations that the rodent has a longer branch than humans and carnivores, in part due to their shorter generation time.

Category	Prefixes
Genes	id, go, ec, description, domain, sequence
HOGs	hog, sequence
OMA Groups	omagroup, fingerprint, sequence
Taxon	species, taxid, taxon

Category	Prefixes
Genes	id, go, ec, description, domain, sequence
HOGs	hog, sequence
OMA Groups	omagroup, fingerprint, sequence
Taxon	species, taxid, taxon

Module 5: Automated Queries

Introduction to Automatically Querying the OMA Database

Characterising a Protein with a Missing ID

Performing a GO Enrichment Analysis

Visualising PAM Distances in Taxonomic Space