LMD

Linear Motif Discovery - some help

About the pages The are two kinds of pages: the first is the simplest: find proteins that interact with many others, apply the filters and report the motifs. The second groups these things according to a common domain. For example, you could have a single protein with an SH3 domain, or you could put all proteins that have SH3 domains in one set and consider all the possible interactions. The first strategy will perhaps find specific ELMs, the second more general. For example, in Fly we identified the canonical TQT motif as the binding site for Dynein domains, but found the more specific pattern A(T/I)QT(D/E) for the specific Dynein homologue Cdlc2. Other motifs were seen only in one of the protein or domain sets. For example, in Yeast a correct SH3 motif was only found in the domain sets, as no single SH3 domain protein had a sufficient number of interaction partners for the motifs to be found. The reverse was true in the Fly where the correct SH3 motif was only found in the protein set, since the domain set had too many proteins lacking the cacanonical motif.

How to judge whether a motif is true There are several numbers to consider. The first is the binomial P. This is essentially a fairly robust statistic that measures the likelihood of seeing a motif of a particular kind, a particular number of times, in a particular total number of sequences. More specifically, this is a binomial probability where the prior probability is computed by counting how many times a motif occurs in a large background database (in these cases, the genomes of the relevant organism).
The second is S_cons, which is combined value of binomial probability for the interaction sets with conservation of the motif in closely related species. We considered orthologues in the four other completely sequenced yeast genomes (K.lactis, A.gossypii, D.hansenii, C.glabtata) for Yeast (S. cerevisia) motifs, D.pseudoobscura for Fly (D. melanogaster), C. briggsae for Nematode (C. elegans) and M. musculus, R. norvegicus, G.gallus and F.rubripes for motifs found in Human (H. sapiens) proteins. The S_cons can not be used in the strict statistical sense, but it greatly improve the sensitivity and selectivity of the results.
The third is the number of proteins that contain the motif within a set of interacting proteins. So for instance, if protein ABC1 interacts with 10 others, then one should only be interested in a motif if it is contained in several of these ten. We tend to consider motifs that occur in more than 3 sequences, and really one probably needs at least 5 to be sure, so in practice we will choose motifs with at least 4 or more conserve instances. Things that occur three times can, of course, be real and just limited by the data, but the false-positive rate goes up a lot, so be warned.
In practice, P smaller than 10^-13 are pretty reliable for motifs occuring 4 or more times. We have demonstrated this by creating random sets of proteins of a similar size and composition by extracting them from a concatonated proteome. Higher P or fewer sequences containing the motif puts you in a twilight zone where true and false hits intermingle. The threshold for S_cons is more difficult to define as it is depend on the number and evolutionary distance of species in use. Thus for the Human sets values are significantly lower because of the evolutionary proximity of genomes we used. Generaly the values of ~10^-17 for Yeast, ~10^-15 for Fly, ~10^-15 for Worm, and ~10^-38 for Human are pretty reliable.

Rediscovery of known motifs Among those that we found, are many knowns especially in the Human set. More then 42% of the Human motifs above the ~10^-38 are indeed known. In the yeast two-hybrid data there are several known motifs, for example, we knew about the Metallophos (PP1)->K.V.F (RFxF - Drosophila Protein lists) and the SH3->polyPro (Drosophila, Yeast and Worm lists) instances. But we did not know about the 2-Hacid_dh (CtBP) -> P.DLS, which was pointed out to us my Morten Mattingsdal (Bergen), Ewan Birney also told us about the SR/RS motifs (see Drosophila protein list) - these are phosphorylated serines/modified arginines involved in nuclear transport. They are known to be sometimes associated with RRM (RNP) domains (i.e. in the same polypeptide), and encouragingly we see these in the output.

There are most likle more real motifs in the data we present. If you have found new motif please contact us. We will be happy to know abut usefulness of our data and help you if necessary.

Back to the main page.