The Trenches of Discovery: The human machine: coding and decoding

The previous post in this series can be found here.

It's been an exciting week for molecular biologists, and should have been for everyone else too! This week, the Encyclopaedia of DNA Elements programme has revealed its first results about the role of the 99% of the human genome that has, until now, represented a fairly sizeable gap in our understanding of how DNA works. This has made big waves in biological circles and has to some extent penetrated the mainstream media, on the BBC for example, but I thought I'd herald this great work by giving you a brief explanation of what DNA is, how it works, and why so much of it was a bit of mystery until now.

From humble beginnings

DNA is unbelievably complex yet unbelievably simple at the same time. The principles upon which it is based are extremely simple: a string of code made up of four chemical units (called nucleotides: G, C, A and T) on two intertwined strands where units on opposite strands are paired either G:C or A:T. The complexity that arises from such a basic principle emerges much in the same way that vastly complicated computer programs can emerge from the binary 1 and 0 system of computer code; principally, that it is a code there to store information that, when read correctly, is vast. And when you have around 3 billion units of this code in every cell in your body, that vastness can quickly become unfathomable!

Nonetheless, we've come a hell of a long way in the last 60 years. It was only in 1952 that the Hershey-Chase experiment conclusively demonstrated that DNA and not protein, as had also been suggested, was the information-carrier of the cell. Just a year later Watson, Crick, and Franklin discovered the now famous 'double-helix' structure of DNA, and the race was well and truly under way to decipher this mysterious molecule.

Being the information-carrier of the cell, DNA must encode the instructions for the synthesis and organisation of all components of that cell - from the chemical structure of the fatty acids in its membranes, to the architecture for its own replication. Early researchers quickly realised that DNA could not be doing all this alone and instead must be delegating responsibility to other molecules: proteins. Proteins are similar to DNA in that they are made up of repeating units, however they are much more chemically complex as their units (amino acids) have much more varied chemical structures than the nucleotide units of DNA, and there are 20 of them rather than just the 4 units in DNA. Thanks to this chemical versatility, proteins are responsible for the majority of the active processes that occur in your cells; like molecule-sized robots fulfilling their pre-programmed function. Despite this, DNA is vital as it its code that 'programs' the proteins to do their jobs properly. Deciphering this was one of the most significant achievements in early biochemistry, and gave rise to the science of molecular biology.

Reading the code

So, how does DNA influence the activity of proteins and thereby control everything about us? The answer is what has become known as the 'central dogma' of molecular biology, a kind of holy grail for biologists! The central dogma was deciphered in the late 1950s and refined in the following decades. It states that DNA acts as a template for the synthesis of far smaller molecules called messenger RNA (mRNA for short) in a process known as 'transcription'. mRNA has an identical sequence to a short strand of the DNA in your genome (aside from one or two cosmetic chemical changes that we won't go into) and so is a kind of temporary copy of the DNA code. mRNA then undergoes a second process called 'translation', in which protein factories known as ribosomes convert the mRNA code into a protein code, spitting out a protein in the process. Translation works on the basis that every three nucleotides in the mRNA code is matched to one amino acid in the protein code. So, for example, a sequence that reads CCGAGAGGG on the mRNA would be translated into a protein sequence of proline-arginine-glycine (all amino acids) because CCG encodes proline, AGA arginine and GGG glycine. This is known as the 'genetic code' and is the way in which the nucleotide sequence of your mRNA is decoded into a string of amino acids. Since the original DNA sequence stored in your genome determines the transcribed mRNA code, it is therefore responsible for the sequence of every protein produced in your body, and consequently dictates the activity and regulation of nearly every process in every cell. Whenever you hear someone mention a gene, they are talking about a specific segment of DNA that encodes a single protein. A single mutation in a gene can have disastrous consequences for the unfortunate owner: sickle cell anaemia, for example, can be caused by the swapping of a single A nucleotide for T in the beta-globin gene of haemoglobin, resulting in a glutamate amino acid being replaced with a valine in the final protein and so giving rise to the problems associated with the disease.

DNA makes RNA makes protein: all hail the central dogma of molecular biology!

The genetic code - the Rosetta stone of biology (NB. in this figure, U=T)

DNA - why so much?

Having got this far, molecular biologists were giving themselves a well-deserved pat on the back in the latter half of the twentieth century. We had uncovered the secrets behind DNA's mysterious code and most definitely broken the back of the project that was understanding it! The next step was to determine the sequence of the entire human genome, which would then give us the sequence of all the proteins expressed in humans and we could then busy ourselves figuring out what all of the proteins actually do. This was the very profound goal of the famous Human Genome Project; biology's answer to the Large Hadron Collider! The HGP was launched in 1990 and completed in 2003 having sequenced over 99% of the 3 billion or so nucleotide pairs spread over the 23 chromosome pairs of the human genome. Its findings were announced with great fanfare to an eagerly waiting media desperate to know what the instruction manual for humanity had revealed. Well, in terms of its initial goals, the HGP was a resounding success. It did, indeed, map the entire human genome and reveal every single protein-encoding gene therein (around 23,000 of them in total). Having these sequences has revolutionised molecular biology - everyone working in the field (myself included) will use data collected by the HGP on a nearly daily basis and much current research would not be possible without it.

The HGP did, however, also throw up a very significant question: why do we have so much DNA? The 23,000-odd genes identified by the project accounted for a little over 1% of the total sequence that was decoded - what is the other 99% doing? Moreover, almost all of the genes we have are fairly similar to our nearest evolutionary neighbours (primates and other mammals) and about half are common to most complex organisms - why are all of these species so different when they have so much genetic common-ground?

Reading between the lines

Answering these questions is the not unambitious goal of the Encyclopaedia of DNA Elements (or ENCODE, as some media-savvy person has coined it). ENCODE aims to identify every part of the human genome that has some biological functionality and work out precisely what it's doing. For a long time DNA that did not encode proteins was lumped together under the highly unsatisfactory term 'junk DNA' even though most molecular biologists were of the opinion that it couldn't be junk at all and must be doing something. These days such DNA is instead referred to as 'non-coding' to reflect the fact that it doesn't encode protein.

The simultaneous publication this week of 30 papers by the ENCODE consortium has marked a profound moment in our understanding of what this non-coding DNA is up to and is, in my opinion, probably more significant in helping to explain the differences between species than the work of the HGP. As you may have guessed, ENCODE has indeed confirmed that 'junk' DNA is far from it, and in fact at least 80% of the human genome is there for a reason. So, what's it doing? The vast majority of it seems to be involved in the regulation of transcription of different genes at different times, to different extents, and in different cell types. This is because genes are not all expressed equally: they can be turned on and off depending on signals originating from within and without the cell, and most are not active at any one time in any one cell. For example, your pancreatic cells are happily churning out insulin protein because the INS gene that encodes it is active in those cells. The rate of insulin production is partly regulated by the rate at which mRNA is transcribed from INS and is subject to regulation from complex signalling within the cell that reports things such as glucose and energy levels within the cell. Conversely, the neurons in your brain, or retinal cells in your eye will never produce insulin because the INS gene is inactive in those cells, even though it is present. Such stimulus- and tissue-specific regulation of gene expression seems to be the job of the vast majority of the DNA in your genome.

This is achieved in a number of ways. Some non-coding bits of your DNA don't code for protein but still manage to get transcribed onto mRNA either before or after (or tucked inside in some cases) the gene that it encodes; an example being riboswitches. Once there they are able to bind to various proteins or other nucleotide-based regulators and either promote or repress translation of that mRNA into protein and so regulate the activity of the gene in question. Other segments of DNA are transcribed into a number of different types of RNA that directly intervene in the activity of protein-coding mRNA. For example, small interfering RNA (siRNA) is able to silence the activity of genes that share a similar sequence to it by activating a protein complex called 'dicer', which chops up any mRNA with a similar sequence to the siRNA. Similarly, microRNA is able to directly bind to mRNA and promote its degradation, thereby silencing the gene in question. Other examples, such as piwi-interacting RNAs and antisense RNAs also function via similar means.

The mechanisms described above are all post-transcriptional (i.e they effect changes after transcription from DNA to mRNA has taken place), but the majority of non-coding DNA is actually involved in pre-transcriptional regulation that influences the extent to which mRNA is produced in the first place. The most fundamental of these are called 'promoters' and they do exactly that: promote the expression of a specific gene. Promoters are segments of DNA that sit just upstream of protein-coding genes and bind to the cellular machinery responsible for the transcription of that gene into mRNA. Without a promoter, a gene will not be transcribed and so have no activity.

DNA is also capable of being bound directly by regulatory proteins and much of the genome is there to act as a scaffold for proteins that bind and influence gene expression. Proteins that directly influence the rate of transcription from genes are broadly termed 'transcription factors' and they can either promote or repress gene expression depending on what they are binding to. A very common mechanism by which cells enter different states of activity (such as the activation of a T cell, or the differentiation of a stem cell) is by having one key transcription factor that is activated by a certain stimulus and then promotes the expression of many other transcription factors that turn the relevant genes either on or off to achieve the desired outcome. The DNA segments that bind to transcription factors can are usually called either 'enhancers' or 'silencers' depending on their effect, and these make up yet more of the non-coding DNA of the genome.

The p53 transcription factor binding to an enhancer region (from Visual Life Sciences)

An interesting aspect of transcription factors is that they can influence genes a long way from the enhancer or silencer region that they bind to. This is because these DNA segments may be very distant in terms of the genomic sequence, but very close together in space because DNA is coiled and organised by its interactions with DNA-packing proteins called histones. For this reason, some of the DNA in your genome has a primarily structural role; ensuring that genes and their transcription factors are correctly positioned in space to be able to interact. Predictably, these structural regions can also regulate expression by altering shape due to altered interactions with histones and other proteins, thereby influencing which transcription factors have the greatest influence over a specific gene.

The shape of DNA also regulates gene activity.

On top of all of this complexity is an added layer of regulation in which histones can be chemically modified by a variety to proteins to either promote or repress gene expression from attached DNA. The language of histone modifications is extremely complicated and we are only just coming to terms with how it works, but needless to say it requires the recognition of specific segments of non-coding DNA to work properly. 'Junk' indeed?

Complexity out of simple rules

The sheer level of complexity of this whole system is quite overwhelming, but it does help to explain how species with seemingly similar genomes can be so strikingly different, and how such sophisticated biological machines such as you and I can emerge from just 23,000 odd active parts. It's much like the popular (in the UK at least) children's toy Meccano: the parts are relatively simple and limited in number, but they can be combined in an infinite variety of combinations to generate different cell types, organs, and species.

Where next?

First off, ENCODE is far from finished - it has only looked at 13 of its final 60 histone modifications, and 120 of about 1,800 transcription factors. Moreover, part of its goal is to map the activity of these different DNA segments in different tissues and under different conditions in order to help shed some light on how this network of regulation really works, and it still has a long way to go in that regard too! Nonetheless, it is striving to achieve the ultimate aim of molecular biology - to identify every interaction that occurs between molecules in the cell and how they each affect each other. Once we have that (if we ever do), we will truly understand how we work in a biologically complete sense. Uncovering the central dogma was the fundamental aim of molecular biology in the middle of the twentieth century, in the twenty-first it looks like it is going to be understanding the myriad accessories and accoutrements that enshroud this core system and give such beautiful complexity out of humble origins.

The next post in this series can be found here.

8 comments:

Shaun HotchkissSeptember 27, 2012 at 3:25 AM
I finally got around to reading this. I have some not so serious questions.

My distant, distant, great grandparent had gills. Is it possible that there a gene lurking somewhere in my DNA that knows how to grow gills, but it just isn't turned on any more? If the answer to that is yes, could one in principle insert the relevant DNA to somehow turn gills back on?

Essentially, the serious question behind that is how much of my DNA is vestigial and turned off as opposed to completely necessary but not yet understood?
James FelceOctober 1, 2012 at 10:47 AM
Well what these data seem to suggest is that very little of your DNA is vestigial and that pretty much your whole genome is doing something. This makes sense really because turning off a gene is not like turning off a light switch, it requires several levels of regulation to keep a gene from being expressed and these can't just evolve overnight. Alternatively a gene can lose activity in one fell swoop by mutating so that it's complete nonsense (what's called a frameshift mutation), but for the vast majority of genes this would be a bad thing because it's doing something and so losing it just like that would be an evolutionary disadvantage.

So instead what happens is that genes will evolve in tandem with the regulatory networks that control them. This can have several outcomes. Firstly, you can end up with a gene that is almost completely unchanged but is regulated in a different way in different species and so does a different job - for example, it may be expressed in a different cell type and so interact with a whole different set of proteins. Secondly, the gene itself may mutate to fill a different role within a similar regulatory network - GPCRs are a huge family of receptor proteins that basically all came from one parent gene that has been duplicated and mutated to create 800 or so to do a massive range of jobs in various tissues. Thirdly, the gene and the regulation may change so that the role of the gene is entirely unrelated in different species. This is basically what's happened throughout all of evolution to give the genomic complexity that we now have, since all genes came from a single parent gene that existed billions of years ago. You can think of genes in the same way you think of species in that they are all related along evolutionary lines.

James FelceOctober 1, 2012 at 10:47 AM
A good way to think of this is to imagine the genome as a language. The 'words' are the genes, but the grammar and structure of the language come from the regulatory elements in the genome. So the same word may mean entirely different things in different languages because of how they're organised, or entirely different words can mean exactly the same thing. Similarly you can make all sorts of different sentences out of the same selection of words, in the same way that you can make different tissues from the same genes.

So, to directly answer your question, no you couldn't stick in some DNA to reactivate a 'gill' gene. The genes involved in gill development do something else in you, most likely helping to control your foetal development to make sure your tissues were organised correctly. Trying to give a human gills would mean rewiring the whole system of signalling that controls development, which would mean messing about with a huge hunk of the genome!

Note: Only a member of this blog may post a comment.

The Trenches of Discovery

Pages

Monday, September 10, 2012

The human machine: coding and decoding

8 comments: