Saturday, July 14, 2007

Information in DNA

In a lecture by Richard Lewontin at Berkeley, he mentions the oft-ignored fact that DNA doesn't determine the organism: In fact, it is a combination of DNA and the environment - and these two factors can interact in paradoxical ways - that 'determines' an organism (actually the entire argument is more complex, but if anyone is interested they should see the lecture, I can't explain it better than him). But one specifically interesting matter was the way in which Lewontin put it:

You can't compute an organism from its DNA

That is, given the DNA of an organism, the naive view that a 'sufficiently-powerful computer' could calculate the organism that would develop from that DNA is, just as the name implies, naive and incorrect. To perform such a computation, the computer would also need the environment (and in addition there might be random factors due to quantum physics, but lets ignore that for now).

In other words, DNA contains only part of the information necessary to compute the resulting organism. But things are more complex still. DNA does not directly generate the organism; DNA is used by complex machinery in the cell to generate proteins. Now, that complex machinery itself is generated by the same process, i.e., some interaction between DNA and that machinery itself (or previous copies of that machinery). There is therefore a subtle question here.

The question can be posed using computer science metaphors. Let's say that DNA is 'data', specifically, compressed data (like files on a computer can be compressed: gif files, mp3, etc.). The cellular machinery is a 'program' that uncompresses the 'data'. Now, given a .gif image file, I can ask: Is there enough information in the gif file itself to generate the image (which was used to generate the gif file)? There isn't, in the sense that I need both the file and a program to uncompress the file. We can even pose this question in a quantitative way (sort of): How much of the information in an image is in the gif file generated from it, and how much is in the program used to uncompress it?

An immediate objection to this is that the same uncompressing program is used for all gif images. Yet, an example can perhaps make my point clear. Say that an image file format's uncompressing program contains a little picture of a red gradient. Compressing images then uses that fact, that is, gradients are removed from the images and just 'notes' appear, something metaphorically like "there should be a gradient here, at angle X and size Y". So the actual gradient appears in the program, not the compressed images. In that sense, when I compress a particular image, part of its 'information' is in the compressed image file, and part in the program. (Yet, despite this concrete example, I intend this idea in a more general way.)

So, we might ask,

How much of the information present in an organism's cells is in its DNA, and how much in the machinery that works on its DNA?

Here is one particular consequence of that question. Say that we recover the remains of an extinct animal, like the baby mammoth recently found in Siberia, and let's assume that its DNA is somehow miraculously preserved but the rest of its cells is too degraded to be of use. Do we then have any hope of creating a live mammoth, as in Jurassic Park, from the DNA alone? If there is a significant amount of information in the non-DNA portions of the cell, then we might have a problem. Now, the problem might be solved if the non-DNA portions of a modern elephant's cells are similar enough; metaphorically, that the same 'program' can be used to decompress both mammoth and elephant DNA. In fact this might be expected, if DNA is the primary vehicle of evolution, and the rest of the cellular machinery is more stable - but that is still a question I do not believe biology has yet answered.

No comments: