Tuesday, November 27, 2007

Black Boxes

There are different schools of thought on the best way to build a model. One of these, commonly known as ‘black-box’ emphasizes accurate behavior of the model with respect to input and output, stimulus and response, etc., regardless of whether the internal structure or other nature of the model itself bears any resemblance to the system it mimics. You treat the system as if it is contained by a black box that prevents you from seeing inside it, and so choose not to care about what you can’t see, as long as what you can see works as expected. I have heard the alternative called ‘white-box’, although that clearly misses the point, since you can’t see inside a white box any better than you can a black one. It would probably make more sense to call the first approach ‘opaque box’ so the second could be ‘transparent box’, or ‘closed box’ and ‘open box’. Regardless, the second approach emphasizes the correctness of the model as a reflection of the system it represents. In building such a model, you are not allowed any ‘then a miracle occurs’ steps; you must be able to justify the microscopic structure of the model, as well as its macroscopic behavior.

Of course, this dichotomy is not unique to modeling biological systems. For instance, most economists make a fundamental distinction between macro-economics and micro-economics. The former is concerned with characterizing, and ultimately predicting, the behavior of large-scale economic systems, while the latter attempts to explain the myriad small-scale decisions made by the individual agents making up such an economy. Some tenets do appear to succeed on both fronts, such as the law of supply and demand explaining macro- level price fluctuations in terms of micro- level decisions by consumers to buy, or by producers to invest in production. But there’s obviously a lot more to economics than that, or they wouldn’t keep handing out Nobel prizes, right? The macro/micro split is driven by the fact that economic systems are hideously complex, which presents a choice: do you want to try to describe the behavior of the whole system while not being able to explain any of it, or would you prefer a believable explanation of some small and artificially isolated portion of that system, and still be left flipping a coin when it comes to the system as a whole?

I’m not sure whether to claim that biological systems are more or less complex than economies. But they certainly appear to be complex enough to present the same choice when attempting to model them. Which way you decide depends on your goal, that is, what is your motivation for building the model in the first place? Companies like Entelos excel at building large-scale models to simulate the behavior of complex biological systems. Their goal, or at least the goal of their customers, is explicitly to use these models to replace testing on live subjects, whether for reasons of morality or expense. The sheer complexity and incomplete understanding of the systems they model, coupled with the urgency of their goal, places them squarely in the macro/black-box camp. If a few gazillion euros and the lives of countless bunnies are on the line, what's the harm of a few fudge factors? Or rather, if your priority is a reasonably accurate facsimile of the behavior of a system that is not fully understood, you will inevitably be forced to invoke the black box at some point in your modeling. Note: I have no inside information on Entelos or its processes, beyond what I have seen in public presentations. I also mean no disrespect; I am simply observing the pressures involved in modeling beyond what is known.

So what then is the problem with black-box modeling? I can make two arguments, one of them practical, and the other more aesthetic. The practical argument comes up if you plan to use the model to make predictions outside of the data used to generate it. You can fit a model to the data without being able to validate its internal correctness, but once you stray from the region of the fit, you simply cannot be certain of anything. On the other hand, if a model accurately reflects the components and interactions that make up a system, it can be expected to behave pretty much the same as that system even in previously unexplored territory. My other argument is the naive notion that it is worth understanding how something works, even if you can't yet think of a practical justification. This is the essence of scientific pursuit. The public may support science--and governments may fund it--based on the promise of practical applications. But scientists throughout history have pursued it simply because they are driven to understand.

The astronomer Ptolemy resorted to black-box constructs such as epicycles and cosmic spheres to model the motion of the planets. His system worked quite well, and made predictions that were quite accurate. Eventually, centuries of detailed measurement revealed subtle discrepancies, prompting the addition of less and less justifiable black boxes to keep the model afloat. Ultimately, this ungainly system of epicycles on epicycles was replaced by a heliocentric system, better reflecting the physical reality of the solar system, as we now understand it. Why did Ptolemy's system survive for 17 centuries, and what really prompted its replacement? Did the minor inaccuracies in prediction really matter to the average person? Even now, would it really make a practical difference in your life if Jupiter were a hundredth of a degree away from where you thought it would be? I believe in the end it was not about the accuracy of the predictions, but the aesthetics of the model itself, and what it said about the place of humans in the universe. This was of course also the basis for resistance to the change.

Black-box models may be a practical or even necessary compromise when working with otherwise unapproachably complex systems. From an application or business point of view, that makes perfect sense. But the scientist in me will always find them unsatisfying.

Wednesday, November 21, 2007


I've been a computational biologist for a long time, a lot longer than I've called myself a computational biologist. Basically, I fell in love with computers the first time I saw one. I don't necessarily want to divulge exactly when that was, but suffice it to say that the average Googler was probably fully occupied in meiosis at the time. The computer itself was not very impressive by today's standards, of course, but that wasn't the point. Here was a box where my friends and I could type in simple instructions (in BASIC) and make things happen, like scrolling text or simple animations. I had a hard time explaining to other kids, or my parents for that matter, just what was so compelling about the word 'hello' scrolling back and forth across the screen. But I was hooked.

I did eventually move on to building more complex programs. But what I found I really enjoyed the most was debugging. This is a process of stepping through a program while it is running, line by line, and watching what happens as it unfolds. It's called debugging because usually the motivation is to track down bugs, by seeing exactly when something goes wrong so it can be fixed. Most programmers consider debugging to be, as wikipedia puts it, "a cumbersome and tiring task". But to me it was the ultimate chance to connect the microscopic line-by-line structure of the computer program to its macroscopic behavior as it runs.

I also had a long-standing interest in understanding how living things work. This translated into majoring in biology once I got to college, with a particular interest in genetics. Once again I found myself fascinated with how changes at the microscopic level--mutations in genes--caused visible differences at the macroscopic level. Something as simple as making the connection that blue eyes are the result of a change that gums up the function of a protein involved in making brown pigment. It wasn't much of a leap to connect how I thought about and explained phenotype in terms of genotype, and how I explained the visible behavior of a running program in terms of defects at the coding level.

However similar my thought processes in understanding computational or biological systems, there was always one huge difference: the heroic, often ingenious, but always labor-intensive and time-consuming means needed to get the smallest bit of information about what is going on inside a biological system while it is "running". This stood in stark contrast to the ease with which I could step through a program with a debugger, and monitor or even change any variable I wanted on the fly, all in the space of a few minutes. Whatever amazing progress has been made in modern understanding of biological systems, just what would be possible if we had access to a debugger? At some level, this question has been driving everything I've done since.