Biological Computing

CS 321 2007 Lecture, Dr. Lawlor

(Warning: Dr. Lawlor is pretty far outside his expertise here!)

Glossary

Nucleotide: the fundamental unit of biological information storage (DNA) and communication (RNA). Nucleotides are to biology as bits are to normal computing. A nucleotide is one of the four letters A, G, C, or T/U; these stand for the chemical compounds Adenine, Guanine, Cytocine, or Thymine (in DNA) / Uracil (in RNA). Since there are four nucleotides, one nucleotide requires two bits of binary storage, but it's more common to use one ASCII byte per nucleotide, and just give the letter, like 'A'.
Codon: a group of three nucleotides representing a command--either start, stop, or an amino acid. This is the biological equivalent of one machine-language instruction. The start codon is AUG. There are actually three stop codons:UAG, UAA, and UGA.
Amino Acid: one fairly simple biological molecule. Specified in DNA/RNA with one codon. Normally strung together into proteins/enzymes.
Gene: a sequence of codons representing amino acids that get assembled into a single protein. The biological equivalent of one complete program. The human genome contains only twenty or twenty-five thousand genes. A typical human gene is tens of thousands of nucleotides long.
Protein: a useful and complicated biological molecule constructed from a sequence of amino acids (and sometimes a few metal ion "cofactors", like the iron in hemoglobin). There areproteins that collect waste, tough watertight proteins that form a cell's waxy coat, proteins that help DNA
Enzyme: a protein that helps convert one chemical to another, or "catalyzes" a chemical reaction. For example, the enzyme Amylase, found in saliva, breaks down starches into sugars. Enzymes exist for all the important reactions in your body--the metabolism of sugar and oxygen into carbon dioxide and water, the reduction of free radicals, and so on.
DNA: the biological program storage mechanism. Consists of a string of matching ("complementary") nucleotides held together by a backbone. DNA doesn't actually do any work itself--it's the master copy of the genetic code, and it stays inside the cell's nucleus packaged in chromosomes. The only thing DNA does is get copied ("transcribed") to RNA, which migrates out into the cell proper to do work.
RNA: a biological loaded-and-runnable program. Like DNA, RNA consists of a string of nucleotides, but RNA is missing the complementary double-helix--it's a single helix, with the nucleotides dangling out ready for use. RNA is used in one of two ways. "Coding" DNA gets copied to Messenger RNA, which contains three-nucleotide codon groups representing a sequence of amino acids to be assembled into a protein. "Non-coding" DNA (formerly known as "junk" DNA!) does not contain codons, but instead seems to use RNA's nucleotides directly to do useful work. Many of the functions performed by non-coding RNA are still being worked out.

Example: DNA to Protein

Say in your cell's nucleus, your DNA contains a gene with this unusually-short sequence of nucleotides:
...TA ATG CAC GGG GGC GGG UGG GGG CAA CCA TAG AAA G...

This will get transcribed into a short string of Messenger RNA with this sequence (replacing T - Thymine, with U - Uracil):
UA AUG CAC GGG GGC GGG UGG GGG CAA CCA UAG AAA G

Using the cell's ribosomes, this string of Messenger RNA will get executed as follows:

U, A -> nothing happens
AUG -> valid START sequence codon. The ribosomes bind to this spot, and begin assembling a protein.
CAC -> codon for the Histidine amino acid (table), which is the first amino acid in the new protein.
GGG -> codon for the Glycine amino acid, which is the second amino acid in the protein. The Glycine sticks to the existing Histidine, forming a two-acid "polypeptide".
GGC -> another different codon for Glycine amino acid, which is the third amino acid in the chain. There are 64 possible codons, but only 20 used amino acids, so some amino acids are represented by several different codons.
GGG -> Glycine again, which gets added as the fourth amino acid.
UGG -> Tryptophan, which gets stuck on as the fifth acid.
GGG -> yet more Glycine.
CAA -> Glutamine.
CCA -> Proline.
UAG -> STOP codon. The ribosome lets go of the newly formed chain of amino acids, which is a new protein. The protein floats away.
A, A, A, G, etc. -> nothing happens. Stuff outside of START and END does not bind ribosomes, and so does not make proteins.

So this Messenger RNA has just created a new eight-amino-acid protein:
Histidine - (Glycine)₃ - Tryptophan - Glycine - Glutamine - Proline
(or HGGGWGQP using the confusing amino-acid-to-letter substitution).

(I'm skipping over lots of complexity here. Real genes start with a promotor sequence that tends to attract the RNA replication machinery, and often include introns that fold themselves out of the RNA before it's executed into a protein.)

Why You Care: Disease & Bioterror

One particular folding of the protein above is human prion protein 61-68, which is the cause of the Creutzfeldt-Jakob disease, a brain-destroying disease that can either be inherited from the bad genes listed above, or aquired by eating the poorly-cooked brains of infected "mad" cows. The problem with this protein is that it functions as an enzyme--it converts other useful proteins into more copies of itself. Such self-catalyzing proteins are called prions. Prions aren't nearly as infectious as viruses (they have to be eaten in large quantities to be infected, and take years to begin causing problems), but they're incurable and currently mostly undetectable.

Read that again. The the gene sequence above, when executed into a protein, can kill you. For under a hundred dollars, online you can mail-order physically expressed copies of that gene sequence from a gene synthesis lab. You can order the copies as fully-assembled proteins (peptides), short RNA or DNA snippets (oglios), or even as working DNA inside living (non-human) cells like bacteria.

Bacteria are just little independent single-cell organisms living in your body. Viruses are more interesting--they're just DNA in a cheap protein coat. When executed, the DNA codes for... more viruses. So a virus just hijacks the code of a working cell to start manufacturing viruses--nanotechnology used for evil.

Here's the nucleotide sequence for smallpox (variola virus). It's 185.5 thousand nucleotides long, or 46.4KB in binary form. Luckily, it's currently not possible to artificially synthesize such extremely long-chain sequences into working DNA (the per-nucleotide error rate is too high), but in a few years these 46.4KB of *binary* data could be converted to *physical* form and cause horrific human suffering!

Also, cancer. Cancer is very simple--it's when your body's normal cells stop doing what they're supposed to do, and change their own DNA to start reproducing without bound, like little single-cell organisms. Your genes contain all sorts of interesting hacks to prevent this, like the ticking time-bomb of telomeres at the end of each chromosome, but cancer (evolution at work!) is pretty good at changing the cell DNA to evade these defenses. A woman, Henrietta Lacks, who died in 1951, had a cervical cancer culture taken that still lives on to this day, having evolved into a successful experimental and wild single-celled organism, which to this day will occasionally infect other people's cancer biopsy results.

Why You Care: Information Density

Again, online you can order flourescent probe molecules to tag a particular protein or sequence you're interested in. These probes are short little proteins that have one glowy end (for example, that glows green under UV light), and one "sticky" end, where by "sticky" I mean that end is designed to bind to whatever biological object you like. For example, say you're interested in determining if a cow brain contains the prions above. So you design a probe that will stick to the prion. Then you just wash your cow brain (or plants, or toads, or whatever) with the probes, and then shine on a UV light--if it glows green, the probes have stuck to prions, so don't eat it!

How expensive are these useful little probes to fabricate? Well, there's a special where $100 will buy you 1 "nano-mol" of probes. 1 mol is 6.022 x 10²³ molecules (Avagadro's number). So 1 nano-mol is 10^-9 moles, or 6.022 x 10¹⁴ molecules. That's 6 trillion probe molecules per dollar!

This is really cheap compared with the price of, for example, cars (0.00009 Kia Rios per dollar) or even like fast food (2 Taco Bell tacos per dollar). It's still cheap compared to CPU transistors (300 million transistors/$100 = 3 million transistors per dollar) or even DRAM storage cells (1GB/$50 = 8 billion bits/$50 = 160 million bits per dollar).

Biological information storage is so cheap, in fact, that almost every cell in your body contains its own complete copy of your DNA. Human DNA has about 3 billion nucleotide pairs, or 6 billion bits, or 750MB of data--about one CD-ROM worth. There are something like 5 million cells per cubic centimeter of human flesh, which means (counting only the DNA) the information density of human flesh is over 3,000 terabytes per cubic centimeter! And that's not even trying very hard--pure DNA could be thousands of times more efficient than this, since DNA is only a tiny portion of the complete cell.

The bottom line is that DNA is a spectacularly awesome information storage mechanism--one pair of nucleotides is only a few dozen atoms across, and stores two bits. I feel like DNA and proteins represent amazing nanotechnology--atomic-scale fabrication done right.

Why You Care: Processing Speed

We saw above that $1 buys you six trillion (6 x 10¹²) probe proteins. At room temperature, they're all wiggling all over the place, at a speed of molecules, "trying" to react with something nearby.

For example, this page's NAMD simulations of the cell-wall protein aquaporin shows the crucial atoms inside the protein wiggling around. The atoms make complete wiggles on a timescale of picoseconds (10^-12 seconds).

Viewed as a computer, this means you've got trillion-way parallelism, and your clock rate is in the terahertz. This means you're doing trillions of trillions of total wiggles per second--in this case, something like 6 x 10²⁴ wiggles per second--per dollar! So if you can figure out how to express your computation in terms of atomic wiggles, you can get absolutely insane performance.

Ecosystem Design

A single cell uses a number of interesting design principles. First, because everything's on the scale of atoms (and wiggling around like crazy), it's quite easy for things to get knocked out of alignment, for crucial parts to break off, or for random unknown molecules to arrive and disrupt the functioning of the system. The cell has to work even in the face of all that, and it does a wonderful job of it. The main trick is simply replication--there's 500 copies of the Messenger RNA for every gene in the cell that matters, so losing one of the copies is no big deal. It's a totally different design philosophy than normal computers are based on.

Many of these same cell-design principles are shared by healthy ecosystems, economic markets, functioning democracies, and piles of gravel. I've come to call these principles "ecosystem design":

Many independent decision-makers. Examples: A cell is a complicated collection of separate but interacting proteins. A market is a collection of interacting buyers and sellers. A gravel pile is a collection of interacting pebbles. Results: no central decision-maker means no single point of failure, so the loss of any few small pieces doesn't affect the overall result. Anything important is decided by hundreds of independent parts--and it's just inconceivable that they'd all get it wrong.
Dynamic equilibrium--lots of small-scale stuff is changing all the time, but the overall averages remain quite constant. Examples: the metabolic rate of a cell is the result of all its pieces working, but it's quite predictable overall. A market's overall prices tend to remain fairly stable. A gravel pile's average slope can be predicted to within a few degrees.

As an example, I claim an automobile, CPU, dictatorship, and orderly stack of bricks use "machine design", not ecosystem design: these systems have crucial decisionmaker parts (e.g., the braking system, control unit, dictator, or bottom brick) whose failure can dramatically change the entire system. Machine design requires somebody to go in and fix these crucial parts now and then, which is inefficient and error-prone.

Ecosystems, by contrast, harness the power of probability--chaos--in order to get stuff done.

Fault Tolerance

A computer is really not at all a robust system--outside a very narrow temperature and electrical voltage range, it will stop working. Computers can be totally destroyed by dust, humidity, or even microscopic conductive "zinc whiskers".

Mammals are, of course, also quite easy to disrupt. Mammals depend on the circulation of air and blood to continue to operate, and contain these fluids within quite delicate structures, such that poking even a small .223 caliber hole in the aorta, for example, will cause virtually all mammals to stop working.

Even a single cell in some ways functions in a machinelike, non-ecosystem fashion--tearing a hole in a cell wall is called lysis, and results in death. However, note that many cells are quite difficult to kill.

For example, Deinococcus Radiodurans, also known as "Conan the Bacterium", can survive radiation sufficient to kill even cockroaches (by reassembling its own DNA), hard vacuum (by forming spores), and various noxious chemicals. A strain of this bacterium was recently engineered to reclaim mercury and toluene-contaminated nuclear waste.

Even the tiny, recently-emerged 5kbp canine parvovirus can survive alchol, acids, lye, freezing, and 120 degree water. The only known way to kill it is with bleach, which dissolves its tough coat.

But we can use ecosystem design to keep our machines running even in the face of these threats! For example, one beautiful design your body uses to repel invaders is a set of proteins with selectively sticky parts. These are designed to stick to foreign material such as a virus, and flag it for disposal by the immune system. When found, the immune system also creates more proteins with that kind of stickiness. Even better, each protein, called an antibody, has two identical sticky pads, which tends to bind together antibodies and viruses into long folded-up chains that can easily be identified and destroyed. Within a few days, the immune system cranks up antibody production to the point where viruses floating around in the blood almost immediately get stuck to antibodies and eliminated. This is why you're immune to a disease you've already been exposed to, either through catching the disease naturally, or by having the disease proteins artificially introduced into your body during vaccination.

So, the bottom line is that biology is nanotech on an amazing scale, and offers spectacular possibilities for information density, processing speed, and reliability. The downsides are that designing systems that work is a lot trickier on the small scale, and also the possibility of a mankind-killing genetically-engineered superflu.