A set of protein fragments (e.g., sequence motifs) present in multiple proteins. Each of these protein fragments is called an “aligned feature”, and the process by which they are discovered is called “feature discovery”.
The problem of finding all aligned features in two sets of proteins is referred to as “alignment” or “feature alignment”.
What is an example of a feature?
For example, collagen contains the aligned features: GPGXXGXXX (where X represents any amino acid) and HXXXXXH (where X represents any amino acid). Here we can see that both GPGXXGXXX and HXXXXXH are part of the same aligned features.
What is the process of feature alignment?
Aligned Features can be either homologous or non-homologous. If they are homologous, we need to use a scoring function and do a BLAST search to find out which one has the highest score. Afterwards, we can add this protein fragment to our set. If they are non-homologous, we simply add them to the feature alignment algorithm.
Which two processes are involved in finding all of the aligned features?
The sequence of proteins is divided into “n”-sized sequences called “windows”, where n represents the length of each window. For each window, a list of all of the aligned features can be found by using a “Blast search”.
Proteins are broken up into smaller sets called “segments” based on sequence homology or sequence similarity.
How are the resulting protein fragments stored?
Each window is stored as a list of aligned feature strings, which contain information about each aligned feature within the window.
How are these results used to produce a phylogenetic tree?
The result of all of these separate lists can be used by an algorithm called “feature alignment”, which gives us many different possible trees (each with their own score) that show us where our proteins evolved. There are many reasons why we would want to do this (e.g., to identify homologous sequences, perform evolutionary inferences) but the generally accepted hypothesis is that our proteins evolved with their most closely related proteins.