The Mouse, the Millipede, and the Mushroom

Earth’s biosphere, the network of ecosystems across the globe, is incredibly diverse. Scientists have currently identified millions of species ranging from the hardiest bacteria inhabiting deep sea hydrothermal vents to the domesticated dog. The stunning variation among organisms can make it difficult to imagine how they may be related, but, as previously discussed, all known organisms are based on DNA and the cell. With life sharing these basal characteristics, we can deduce that modern organisms had a common origin, that is, they shared a common ancestor. Even such disparate life forms as the mouse, a mammalian vertebrate, the millipede, a myriapod arthropod, and the mushroom, a diverse group of fungi, share a common ancestry. While we know that they shared a common ancestor, how can scientists determine how closely related a set of organisms are?

Traditionally, morphological characteristics, or physical characteristics, of organisms were used to determine their evolutionary relationships. However, a variety of confounding factors can make classification based on morphology an unreliable method. Instead, scientists take advantage of the fact that all organisms share DNA, and use sequencing technologies to compare organisms’ genetic code. Comparative genomics involves comparing the sequenced DNA of organisms to determine evolutionary interrelations for various medical, environmental, and biological applications. Before exploring the finer details of comparative genomics, we’ll take a brief overview of the core of the discipline: evolutionary theory.

Evolution: The Basis of Comparative Genomics

Whether it is known for its controversy or for its role in shaping our modern understanding of the natural world, evolutionary theory is among the cornerstones of natural science. Evolution is generally defined as a series of slow, gradational changes that take place in a population of organisms over time due to selective pressures. Changes attained through evolution occur through natural selection, a process in which the organisms best adapted to their current environment are able to pass on their genetic material. As a result, the advantageous traits are passed on to the offspring and eventually are dispersed across the population. The forces of natural selection are constantly acting on all forms of life, with evolutionary processes occurring over short periods such as years or decades, or long periods up to thousands or even millions of years. Just as evolutionary processes can occur over different spans of time, the scale of evolution can also be highly variable. Natural variation and changes within one species or one population within a species can arise due to selective pressures. Over longer spans of time, evolutionary processes can lead to a significant divergence among the traits of two organisms. This process, known as speciation, is how new species originate.

Some scientists suggest that LUCA and other early organisms may have originated at hydrothermal vents (shown above). Image credit: NOAA.

As discussed in previous lessons, DNA consists of the four nitrogenous bases adenine (A), guanine (G), cytosine (C), and thymine (T). The specific order of nucleotides contained in a genetic sequence provides the information necessary to produce proteins through transcription and translation. The basis of evolutionary processes are the occurrence of random mutations, which are changes to the DNA sequence that lead to the production of proteins that are either nonfunctional or function differently. Mutations can occur through various processes including errors in DNA replication (the process in which DNA is copied during cell division) or interactions with various organic and inorganic substances. Mutations themselves are highly diverse and can occur either at the nucleotide or the chromosomal level. Below are a few groups that are used to classify mutations.

  • Point Mutations (Substitutions): When a singular nucleotide in the original genetic sequence is swapped out for a new nucleotide. Point mutations generally have a smaller impact on proteins since they only affect one constituent amino acid.

  • Frameshift Mutations: A type of mutation that consists of either the insertion or deletion of one or more nucleotides. Frameshift mutations have more significant impacts than point mutations since they change all amino acids following the location of the mutation.

  • Missense Mutations: A mutation that causes a new amino acid to be incorporated into a protein. Missense mutations often alter the function of the protein.

  • Nonsense Mutations: A mutation that causes a regular codon to be replaced with a stop codon (a codon that signals the end of protein synthesis). This typically leads to truncated nonfunctional proteins.

The fruit fly (Drosophila melanogaster) was among the first organisms to have a sequenced genome. Image credit: BBC.

Darwin’s finch, a native of the Galapagos Islands that was instrumental to the early development of evolutionary theory. Image credit: Shutterstock.

Speciation is responsible for the large diversity of organisms seen today from the mouse to the mushroom. Based on the mechanisms of natural selection and resulting speciation, modern evolutionary theory postulates that all living organisms have a common origin. This common ancestry is traced back to the Last Universal Common Ancestor (LUCA), and speciation over billions of years has resulted in modern levels of diversity.

When considering variation among modern organisms, it is easy to solely consider the observed, phenotypic characteristics that make them seem so different from one another. However, all physical traits are encoded in DNA, the molecule that serves as the origination point of this high level of diversity. Through the processes of transcription and translation, the genetic information encoded in DNA is used to produce proteins that generate the observed phenotypes. Before the discovery of DNA and the development of modern DNA sequencing techniques, evolution was often studied by observing the morphological characteristics of the organisms in question. With the availability of DNA sequencing technology, evolution can be studied by observing DNA and pinpointing the very genes that code for the observed characteristics. While we can take a look at DNA to study evolution, what are the changes in DNA that lead to physical changes in the organisms themselves?

A phenotypic change can result from mutations in DNA, such as that seen in this daisy flower. Image credit: Jake Wintermute.

Since mutations in DNA itself are the raw material worked by natural selection, studying sequenced DNA is among the most effective ways of investigating evolutionary processes. Comparative genomics compares the sequenced genomes of various organisms to determine their evolutionary relationships based on the genetic sequences they have in common. The field relies heavily on evolutionary theory to interpret the computational results obtained by analyzing raw genetic sequences. However, the ability to use DNA sequences and extract data through computational technologies has led to major improvements in evolution research.

A Deeper Look at Comparative Genomics

Computational analyses in comparative genomics rely on the sequenced genomes of two or more organisms that are being compared. Such organisms might be chimpanzees and humans for a phylogenetic analysis of human evolution, two varieties of corn for crop resistance research, or two infectious viruses for clinical applications. Before the sequences are compared, preparatory steps such as genome annotation (see lesson) are used to mark specific places on the genomes that are to be compared. Let’s take a look a some basic methods used in comparative genomics.

Sequence Alignment

One of the most fundamental techniques in comparative genomics is sequence alignment. Sequence alignment involves matching DNA sequences from two or more organisms to identify regions that are similar and dissimilar. Computational programs such as FASTA and BLAST are often used to match the sequences in such a way that similar regions can easily be identified. When sequence alignment has been performed, matrices or boxes are often constructed around segments that are identical in both sequences. Identical segments, known as conserved sequences, are parts of the genetic code that have persisted through time and that were present in the common ancestor of the organisms being compared. Those stretches of DNA that are not identical are generally considered to be responsible for differences between the organisms. Depending on the focus of the research and the quantity of organisms used, one of several sequence alignment techniques can be applied.

  • Global Alignment: Global alignment involves aligning the entire genome of an organism or all residues present from the sample taken. It is associated with the Needleman-Wunsch algorithm and is often used with organisms that are closely related.

  • Local Alignment: Local alignment involves aligning smaller portions of a DNA sequence to target specific portions of the genomes. This class of alignment is often achieved through use of the Smith-Waterman algorithm. Local alignment is often used when comparing genetic sequences that have fewer similarities but that contain known regions with suspected similarity.

  • Pairwise Alignment: A common type of sequence alignment that involves the comparison of the genomes of two organisms. Pairwise alignment seeks to find the most informative and accurate global or local alignment to identify similarities among the two genomes.

  • Multiple Sequence Alignment: A type of sequence alignment in which the genomes of three or more organisms are compared. Like pairwise sequence alignment, it aims to find the most informative alignment of all genomes involved. Multiple sequence alignment is commonly used when constructing phylogenetic trees.

Having identified some types of sequence alignment, let’s review an example to better understand the process. The example we’ll use will involve a local alignment and the diagram shown below.

Samples of locally aligned sequences of human and chimpanzee genomes. Note: the sequences above are not intended to be accurate and are only for demonstrative purposes.

In our example specific sections of the genomes of a human and a chimpanzee are being compared. Suppose that this particular section of the genomes codes for Protein X in chimpanzees and Protein Y in humans. The sequences have been aligned using local alignment techniques and a pairwise alignment has been achieved since only two sequences are being compared. The portions of the sequences that are identical (conserved) have been marked by grey boxes that serve as matrices. Since the sequences have been aligned, it can be inferred that the identical sequences correspond to one another and would have been found in the common ancestor of humans and chimpanzees. The sections of the genomes that are not enclosed by grey boxes indicate differences in the genomes that cause the changes seen in Proteins X and Y.

Evolution as a Bigger Picture

Sequence alignment techniques are fundamental to comparative genomics, being the very basis of how scientists are able to compare genomes and divine evolutionary relationships. The example of sequence alignment in the previous section outlined a local comparison of the genomes of chimpanzees and humans to highlight regions of similarity. When comparing two genomes, the goal is often to see how closely they match, or to quantify their similarity. While this numerical value is certainly useful, it means little if there is not a broader context for relative comparisons. In other words, if we obtain a quantitative value for how closely the chimpanzee and human genomes match one another, it is difficult to interpret in terms of evolutionary relatedness if we do not have other organisms for comparison. So, how do we determine the evolutionary relatedness of species and their relationship to a common ancestor?

The answer may seem simple: multiple sequence alignment. Multiple sequence alignment is critical for obtaining quantitative data and determining how closely the genetic sequences of various organisms match one another, but proper interpretation of this data relies on an understanding of systems of biological classification. Biological classification, also known as taxonomy, is a scientific framework used to group organisms based on their relatedness to one another. To understand how multiple sequence alignment is used in evolutionary research, let’s take a brief look at closer look at taxonomy.

Organizing Evolution: A Taxonomic Tour

Evolution is complex and nature generally resists our attempts to constrain it to neat, organized boxes; however, the field of taxonomy aims to provide an organizational hierarchy for biological life. One of the earliest forms of taxonomy that came into widespread use was developed by naturalist Carl Linnaeus and is known as the Linnean classification system. The Linnean system is organized into well-defined categories (though there have been modifications over time) starting from the broadest groups and increasing in specificity until it reached the organisms themselves. Traditionally, the Linnean system begins with Kingdom as the broadest group and descends the hierarchy with Phylum, Class, Order, Family, Genus, and Species. As one moves downwards from Kingdom, the organisms within each group become increasingly closely related until the individual species themselves are reached. The Linnean system has been in use for centuries and continues to be taught and recognized, but a number of shortcomings have led to significant changes in taxonomy. Firstly, many of the earliest valid classifications made with the system relied on a very limited knowledge of evolutionary processes and on the morphological characteristics of organisms. Secondly, the rigid grades erected by Linnaeus and others do not always properly convey the evolutionary relationships of various groups of organisms.

The Marabou stork is placed within the class Aves, but, like other birds, it is closely related to crocodilians.

The Linnaen system has been largely replaced by the field of cladistics, a taxonomic classification system that relies on evolutionary relationships. In cladistics, morphological features can be employed as they are in the Linnaen system, yet more recent studies have relied largely on DNA sequencing. When classifying organisms cladistically, their lineages are traced back to their most recent common ancestor. Starting from the common ancestor, the lineages of each organism are traced to where they are today. As the organisms acquire certain characteristics, they are said to ‘branch off’ from the organisms before them. Subsequently, more organisms acquire newer characteristics and branch off of the newly formed branches, thus forming a complex network often known as the Tree of Life. Each of these new branches are known as clades. While cladistics can be thought of as a gradational system such as Linnaen taxonomy, it lacks the fixed hierarchy seen in the latter. Clades, which are simply groups of organisms that have a shared ancestry, can be large and encompass a diverse group of distantly related organisms or they can be highly specific and only encapsulate a very small, closely related group. For example, clades can range from large clades such as Sarcopterygii that contain all lobe-finned fish and tetrapods (four-limbed vertebrates), while the specific clade Globidonta only contains modern alligators and a few of their extinct relatives. When expressing the evolutionary relationships in cladistics, diagrams known as cladograms are often constructed. Cladograms show the relationships between organisms and often can be used to trace how recently (relative to one another) organisms diverged and formed new clades. To better understand cladistics and the concept of cladograms, we will construct a simple cladogram.

The structure of a basic cladogram is shown in the diagram to the right side of the page. Each organism’s placement on the cladogram is marked by a distinct branch and the most recent common ancestor is labeled at the base of the diagram.

The American alligator is designated as Alligator mississippiensis under the Linnean system of classification. Image credit: Ianare Sevi

A common example of a pitfall of the Linnaen system is the classification of reptiles. The class Reptilia within the Linnaen system includes what we might typically think of as reptiles: snakes, lizards, crocodiles, tortoises and others. While many of these organisms are closely related to one another, crocodilians in particular reside within a group known as Archosauria. Evolutionary studies on crocodiles and the group Archosauria show that crocodilians are far more closely related to birds than other reptiles. Birds, however, are classified in the completely separate class Aves. To an observer, crocodilians may seem very similar to lizards and snakes, but components of their anatomy and genome reveal their shared ancestry with birds. As such, the Linnean classification system fails to portray true evolutionary relationships by grouping crocodilians with reptiles instead of their closest living relatives, the birds. With our modern understanding of evolution and living organisms, another method of classification has risen to dominance in current research.

With the outgroup labeled, we can observe the homologies of the remaining organisms and identify any derived traits. Looking at the table, the most basal characteristics are to the left and the most derived characteristics are to the right. Taking this into account, we can place the remaining organisms on the cladogram as shown in the diagram to the right.

A cladogram showing the evolutionary relationships between various species of sunfish. Image credit: (Smith & Bailey, 1961).

Constructing a Cladogram

To begin making a cladogram, a group of organisms must be selected for classification. These organisms can range from being closely related to very distantly related, yet it is simplest if they have easily observable characteristics. Historically, cladograms were often constructed using morphological characteristics due to technological limitations. In modern research, DNA sequencing and multiple sequence alignment is used for cladogram construction - an application of comparative genomics! Despite advances in DNA sequencing and interpretation, modern scientists continue to use morphological characteristics to classify extinct organisms through fossil remains. In our example, we will be using select morphological characteristics of 6 species: Species T, V, W, X, Y, and Z. To begin, we will construct a table that lists each of these species and 6 physical characteristics that are thought to be important to their evolutionary history.

The table above lists each of the species previously identified and six important morphological characteristics. Using these characteristics, the evolutionary relationships of these organisms can be inferred. However, it is important to note that physical characteristics are not always as reliable as sequenced DNA - this will be discussed later. To begin building our cladogram, we must identify the organism that differs the most from the rest, or the outgroup. Generally, the outgroup will possess basal (ancestral) characteristics and will be the first organism to branch off from the most recent common ancestor of all the organisms in question. In the case of our table, the most basal characteristics is the notochord, which is shared by all of the species listed. Moving past the notochord, we noticed that all of the organisms have the second most basal characteristic, jaws, with the exception of Species W. As such, we know that Species W is the outgroup and the first organism to diverge from the most recent common ancestor. Once the outgroup has been identified, we look for characteristics that are held in common among the organisms. When similar features are an artifact of shared evolutionary ancestry, they are termed homologies. In the table above, homologies can be identified by finding organisms that share the features. Lastly, we identify derived characteristics. As opposed to basal characteristics, derived characteristics are generally specially evolved features that appear near the end of branches on a cladogram. They are often highly specialized and can indicate that an organism has undergone extensive evolutionary change since it diverged from the original lineage. Now that we have outlined the steps, we can begin building our cladogram!

As noted above, we will start with the outgroup. Based on what we have discussed previously, we identified Species W as the outgroup in our example. Since the outgroup is the organism that diverged earliest in time, we can place Species W on the first branch of our cladogram as shown in the diagram on the left.

Congratulations, you have successfully constructed a cladogram! This example is clearly a simple cladogram and relies on morphological characteristics. As mentioned earlier, some modern studies continue to use morphological characteristics when constructing cladograms. This is particularly common when using fossil remains and significant improvements have been made in computational programs and machine learning algorithms that can accurately identify characteristics that are evolutionarily significant. Most modern studies, however, rely on sequenced DNA to build complex cladograms. Now that we have explored some taxonomy, let’s take a look at the role comparative genomics plays in the modern frontiers of taxonomy.

Comparative Genomics and Phylogenetic Trees

Comparative genomics dominates modern taxonomy, as comparison of DNA sequences is far more reliable than comparison of morphological features. In certain cases, features that seem similar are not truly homologous but are evolved independently in a case of convergent evolution. While convergence can also occur in DNA sequences, sequenced DNA is far more reliable for identifying evolutionary relationships between organisms. In phylogenetics, the broader field based on evolutionary relationships that includes cladistics, comparative genomics is used to make complex diagrams known as phylogenetic trees. Phylogenetic trees operate on the same principle as cladograms and seek to demonstrate the evolutionary relationships of organisms as they relate to their most recent common ancestor. However, phylogenetic trees are generally more reliable than cladograms and contain information about the genetic distance of the organisms involved.

A phylogenetic tree constructed for mosasaurs, an extinct group of marine reptiles. Image credit: (Makadi et al., 2012)

An extinct group of reptiles known as phytosaurs closely resemble crocodilians, but this is only a case of convergent evolution. Image credit: Nobu Tamura.

When constructing phylogenetic trees, multiple sequence alignment is used to compare the genomes of the various organisms being investigated. The genomes are compared using computational software and conserved sequences are identified. Once this has been done, DNA homology, or how closely the genomes match one another, is assessed to determine how closely related the organisms are relative to one another. Just like cladograms, the organism with the lowest homology or with the greatest number of ancestral sequences is classified as the outgroup. From there, homologies are compared to find where organisms fall relative to one another and how they branch out from their respective lineages. In some cladistical analyses, specific alleles (genetic variants of a specific trait that can be inherited) can be compared across genomes to identify which are most similar to the ancestral population and which are more derived. Phylogenetic trees are extremely useful for determining how organisms arose and how closely related different species are to be another. They can created for any class of organisms, even viruses (although they are typically not considered organisms), and, as a result, can be used for a wide variety of applications.

Applications of Comparative Genomics

Comparative genomics is among the key fields within the larger discipline of genomics. We have already explored the use of sequence alignment as a fundamental technique in the field and its application for the determination of evolutionary relationships and the construction of phylogenetic trees. While its use in phylogenetics alone demonstrates its importance as a field, the impacts of comparative genomics reach far beyond our study of the evolution of life on Earth. Below are a few fields that rely on comparative genomics to solve modern day issues and advance our scientific understanding of the natural world.

  • Agriculture: Comparative genomics has become increasingly important in agriculture, particularly concerning genetically modified organisms (GMOs). When a certain variety of crop or a wild plant shows increased resistance to disease or an environmental condition, its DNA can be sequenced. Sequenced DNA can be analyzed and compared to that of other plants to determine which genes are conferring the advantageous traits. Breeding programs can be used to further distribute these genes throughout the population, or modification techniques can be used to incorporate these genes into the genomes of existing crops. Advances in comparative genomics and its use in agriculture allows us to develop faster growing and more resistant crops to feed a growing world population.

  • Vaccines: Vaccines are among our most important defenses against deadly viruses, and their development is aided by comparative genomics. Although viruses are not defined as being alive, they do contain genetic material (either DNA or RNA) and are subject to evolutionary processes. Their genetic code can be sequenced and analyzed to identify certain key genes in their genetic sequences and to determine the evolutionary relationships among viruses. Identifying genes associated with certain key traits and assessing their presence among several viral genera is instrumental to developing vaccines that can target viruses. Due to the rapid mutation rates of certain viruses and the diversity among closely related viruses, developing vaccines that can target more than one virus or continue to work despite the threat of mutations is at the forefront of clinical science. The use of comparative genomics to engineer more effective vaccines will continue to shape modern medicine.

  • Personalized Medicine: One of the most difficult aspects of clinical science is the variation among populations and the effects that can have on medical treatments. Due to their genetic composition, certain people may experience severe side effects when treated with certain medications or the medications may not be effective. Personalized medicine is a discipline that tailors medical treatment to the specific needs of the patient depending on their genome. This can often be a difficult and cost-intensive task to complete for each individual patient, yet it can be made far more efficient using comparative genomics. DNA from individuals across the population can be sequenced and compared using multiple sequence alignment. Specific genes can then be identified that may make a person or multiple individuals across the population more susceptible to a particular treatment option. As such, comparative genomics is deeply intertwined with the future of personalized medicine.

Corn is among the most common genetically modified crops. Image credit: Katherine Volkovski.

The rapid mutation rates of certain viruses makes the development of more effective vaccines a focus of clinical science. Image credit: CDC.

Personalized medicine can help save many lives through safe and effective treatments. Image credit: Kayla Marauis.

From phylogenetic studies to vaccine development, comparative genomics is a scientific field essential to diverse areas of research. The possibility of comparing DNA sequences of any organisms has rapidly expanded our understanding of the natural world and has influenced our emerging biological technologies. With continued developments in computers and computational technology, our ability to practice comparative genomics will only expand to provide deeper scientific knowledge and increased accessibility to novel biological technology.