John Koza's August 17, 1995 Comments on Cross Paradigm Comparisons of Genetic Programming with existing machine learning paradigms (Distributed to GP Mailing List)


Recent postings to the genetic programming list have reopened the perennial issue of whether or not it is productive to do cross paradigm comparisons between genetic programming (GP) and other machine learning and artificial intelligence paradigms.

The idea of doing a controlled experiment that compares two or more different paradigms seems like such a good idea and sounds so "scientific;" however, this alluring idea has significant conceptual and practical difficulties when one attempts to apply it to the present-day situation in machine learning and artificial intelligence.

First, cross-paradigm comparisons can, by definition, only be done on problems that already can be solved by two or more different paradigms. In other words, once you make the management decision to work on a cross-paradigm comparison, you are making a decision not to try to push the frontiers of any one particular field, but, instead, to work on a problem that has already been solved in multiple ways. When viewed in this way, it can seen that a cross-paradigm comparison is a quasi-administrative task whose intellectual content consists almost entirely of applying the chi-squared test to the statistics from a series of runs of an problem that is already known to be solvable in two or more ways.

Second, the quality of the results of a cross-paradigm comparison and the eventual persuasiveness of the results will inevitably hinge on the choice of problems that are included in the testbed. The selection of a suitable testbed is usually THE critical problem in conducting a cross-paradigm comparison. Unfortunately, there is no sound theoretical basis for choosing the testbed problems. There is no way to support any claim that the experimental conclusions derived from any particular small finite testbed of problems have any degree of generality or applicability to other problems.

Third, cross-paradigm comparisons inevitably down scale the difficulty of the problems being tackled. This "dumbing down" typically amounts to about two orders of magnitude below the level of whatever can be solved with currently available computational resources. The reason is that it usually takes a minimum of around 50 or 100 independent runs of the same problem using the two competing methods in order to achieve even a minimal level of statistical significance for the experimental results (for the kind of differences that one usually encounters). Let's take a concrete example. It takes about 8 days of computer time to do 100 one-hour runs using paradigm "A" plus 100 one-hour runs using paradigm "B." The same 8 days can alternatively be used to do a single 200-hour (8-day) run of a much more difficult and challenging problem using either paradigm. If experimenters today had six or eight orders of magnitude greater computational resources than they actually now have, this scaling down by two orders of magnitude might not be significant. However, in today's environment, sacrificing two orders of magnitude of problem difficulty for the testbed problems is usually tantamount to "dumbing" the testbed down so much that the testbed lacks all challenging aspects of realistic problem-solving. The result is that the cross-paradigm comparison is made on untypically easy (and often exceedingly misleading) testbed problems. For example, one often sees cross-paradigm comparisons being done on Boolean function learning problems. However, favorite fast testbed problems such as Boolean function learning problems have many inherently misleading structural features. For Boolean problems, there are no true local optimum points in the search space (Juels and Wattenberg 1994). However, the problem of becoming entrapped on local optima points in the search space is common all interesting and non-trivial problems. In fact, having potentially enticing local optima is almost a precondition to non-triviality in problem-solving. Because of this, cross-paradigm comparisons using Boolean function learning problems can lead to manifestly incorrect conclusions if the ability of a search technique to overcome local optima is an issue. For example, one can easily mislead oneself into believing that hill-climbing can outperform almost any other technique by using such overly simplistic testbeds --- even though this claim is as manifestly false as the claim that the earth is flat. A glance at recent issues of the Machine Learning journal and the proceedings of the machine learning conferences shows that the machine learning community still uses the LED problem (a 7-argument Boolean problem --- known to be susceptible to linear separation --- known to be friendly to hill climbing), the congressional voting problem (whose kernel is less difficult than the Boolean exclusive-or problem), the sick soybeans problem, and various other highly constrained classification problems. Another example if the use of other overly simplistic "mutational" problems (such as the block stacking problem) in which there are few, if any, opportunities for progressive improvement toward a solution. When, after the expenditure of considerable time and effort, such "flat earth" results are obtained, the researcher is usually so emotionally committed to his early choice of the testbed that he is unwilling to draw the correct conclusion --- namely, that his testbed was bad. (An exception to this general rule is the technical report (Juels and Wattenberg 1994) that candidly realized that the anomalous results were the result of the inadequacy of the Boolean functions in the testbed).

Fourth, cross-paradigm comparisons inevitably suffer from the "least common denominator" problem. In my opinion, the main weakness of the currently popular "machine learning" paradigms is that they are highly specialized algorithms that cannot ever solve any breadth of problems (or, for that matter, any difficult problems). This is ironic since the term "machine learning" was coined in the 1950s by Arthur Samuel who clearly visualized computers learning to program themselves to solve problems. In fact, a major fraction of the "machine learning" community has successfully managed to attach the term "machine learning" to problems that involve the mere classification of instances into categories and the subsequent testing of whether the learned classification scheme generalizes to unseen instances of "the same" underlying problem. Once one embarks on a cross-paradigm comparison involving a group of deficient paradigms, one inevitably has to restrict the study to problems to the least common denominator of what can be solved by the most impotent of the paradigms. If the competing methods were not as limited and constrained as they are today, that might not be a problem. However, in practice, cross-paradigm comparisons among the currently popular machine learning paradigms necessarily gravitate to utterly trivial problems. For example, as soon as you talk about doing a cross paradigm comparison, the decision tree people (quite understandably) step forward to say that it would be "unfair" to include a problem in the testbed if it happens to have inter-related real-valued variables. Their complaint is motivated by the fact that decision trees decision trees only work well on attribute tests having small discrete number of outcomes applied to one variable at a time. Indeed, decision trees explode on problems as simple as deciding whether a point (x,y) lies inside a circle (inscribed in a square) because both x and y are relevant to deciding whether the point is inside the circle. The decision tree people make the compelling (and correct) argument that the selection of the testbed preordains the result of the cross-paradigm comparison and that it would be "unfair" to include problems in the testbed for which decision trees are known to be unsuited. Similarly, the reinforcement learning people (quite understandably) step forward to say that you should only include problems with a highly constrained number of possible future states for the system being studied. Meanwhile, the inductive logic programming people say "speak to me in Horn clauses." Similarly, the vast majority of neural net research activity avidly avoids recurrent neural nets (where back propagation and the other popular connectionist learning algorithms simply don't operate). Thus, it becomes "unfair" to consider problems requiring memory. Then, the people who work with genetic algorithms operating on fixed-length strings step forward and want to avoid any problem that cannot be efficiently encoded into a fixed and unchanging structure. In fact, the neural network people and the genetic algorithm people join forces in favor of limiting the testbed to problems where the size and shape of the ultimate solution is already known in advance. Of course, in real problem-solving, the size and shape of the solution is often a major part of the problem. Indeed, for many realistic problems, the size and shape of the solution to the problem may indeed be THE problem. And, last and not least, everyone (outside of GP) subscribes to the gentlemen's agreement of not uttering the forbidden "S-word" (scalability). There is unanimous agreement among everyone (outside of GP) that no one wants to even contemplate a problem that contains any significant modularity, symmetry, homogeneity, or regularity that might naturally be exploited by reuse (i.e., subroutines or macros) or any problem that would benefit from parameterized reuse (i.e., subroutines with arguments). Needless to say, this gentlemen's agreement also excludes any problem with dynamic variations in the high-level architectures of these reused parts, parameterized parts. And, of course, don't even think about iteration and other commonly used tools of ordinary computer programming.

Thus, by the time you have selected a "fair" testbed of problems for a "scientific" cross-paradigm comparison, you have already eliminated almost every aspect of realistic problem-solving by


The fact is that a realistic cross-paradigm comparison (i.e., one that doesn't make the above exclusions) between genetic programming and any of the currently popular machine learning and artificial intelligence paradigms is a comparison between some finite measured amount of computational for genetic programming and NEVER.

Fifth, even if you go ahead and do a cross-paradigm comparison, everyone will have a good, sincere, and legitimate reason not to pay any attention to your results. If you don't cater to all of the above "least common denominator" constraints on the choice of problems to include in your testbed, you will be correctly accused of prejudicing your testbed against certain paradigms. Because you will be guilty of "unfairness," no one will pay any attention to your results. If you do cater to these objections, you end up in some nanoworld that lies at the least common denominator of all the limitations of the various paradigms. Of course, once you surrender to the least common denominator, you will be correctly accused of triviality and, again, no one will pay any attention to your results. If you select realworld problems for your testbed (e.g., Walter Tackett's use of the Army's image recognition database), you will be correctly accused of using an idiosyncratic testbed that is not a "clean" and "simple" and susceptible to collateral mathematical analysis. Consequently, no one will pay any attention to your results. On the other hand, if you use purely abstract problems that are susceptible to mathematical analysis (e.g., Walter Tackett's inter-twined doughnut problem), you will correctly accused of using "toy" problems that are not "realworld." Again, no one will pay any attention to your results. The bottom line is that whatever you do, no one will pay any attention to your results.

Sixth, the proponents of whatever paradigm that doesn't fare well in your cross-paradigm comparison can be counted on to be especially vigorous in not paying attention to your results. After you finish your cross-paradigm comparison, the analysis inevitably descends to (completely sincere) cocktail-party chit-chat about how the testbed is (or is not) truly reflective of the infinity of other problems that were not in the testbed. Of course, regardless of what testbed you used, there will be no way to validate (or invalidate) its suitability. Everyone gets to ventilate their sincere (and plausible) opinions about the characteristics of the infinite set of problems that were not included in your study.

Seventh, any contemplating doing a cross-paradigm comparison should recognize that such a project requires a major commitment of time and effort for the researcher. Such comparisons invariably require that the experimenter try to become reasonably proficient on one or more unfamiliar paradigms. A cross-paradigm comparison requires writing, debugging, and running computer programs for each of these other paradigms involved (or, alternately, coming up with an all-encompassing analytic theory definitively modeling the performance of the different paradigms). By the way, this feature does make cross-paradigm comparisons a useful educational exercise for the proverbial underworked student (and I do recommend them for that purpose). If education is the goal, that's fine. But, aside from learning about a new paradigm, a cross-paradigm comparison is really just a quasi-administrative exercise whose emotional high point involves performing the chi-squared test on the statistics from a series of runs of an problem that has already been solved in multiple ways.

Eighth, it is extremely difficult (perhaps impossible) to do a cross-paradigm comparison even-handedly. In what I call Hinton's Law (because I heard it eloquently stated by Geof Hinton from Toronto a few years back), the comparison usually ends up favoring the user's favorite paradigm. The unintentional bias usually comes from the fact that the experimenter is usually highly conversant with only one of the methods he is testing. Even after reading the literature in the unfamiliar paradigm (which is, regrettably, all too often silent on many important practical implementation issues), the experimenter often makes ridiculous mistakes that no one who regularly uses the paradigm would ever make. All of these paradigms have numerous control parameters and subtle operational options which play an important (and usually poorly understood) role in determining the outcome. Even when these parameters and options are explicitly specified in the literature, there are often unarticulated assumptions and preconditions as to their applicability. For example, one sometimes sees the genetic algorithm with extraordinarily tiny populations (e.g., 100) applied to various monumentally hard problems (e.g., the protein folding problem). Or, one sometimes sees choices for the number of hidden layers (or the magnitude of the range of initial weights, the learning rate or momentum parameter, etc.) in a neural network in an inept way that no experienced neural net user would ever do. (Of course, in many cases, the problem with cross-paradigm comparisons isn't unintentional bias, but the fact that the person doing the comparison has a not-so-hidden agenda in the first place).

Ninth, any contemplating doing a cross-paradigm comparison should carefully consider the question of resource management. All too many researchers spend far too little time to making explicit and conscious decisions concerning the management of their own their own time and their own computational and other resources among competing research alternatives. There's an infinity of plausible inquires that one may make about the universe. But when it comes to spending ones time and effort on cross-paradigm comparisons, here is the high-level management question that I think researchers should be asking before embarking on a cross-paradigm comparison: What would be accomplished, at this point in time, by knowing the result of a cross paradigm comparison (even if there were no question about the validity or generality of the result) in the development of the fields of automated machine learning and artificial intelligence at this time? Is comparative speed the most important issue at this point in the development of automated learning? Are there so many good ways to master the problem of automated learning that comparative speed is even an important issue at this time? If you ask management questions such as this, you will immediately see that the major challenge for machine learning at the present time is how to get non-trivial results on non-trivial problems --- not the relative speed by which trivial and useless results on nano-problems can be simultaneously produced by different paradigms.


· The home page of Genetic Programming Inc. at www.genetic-programming.com.

· For information about the field of genetic programming in general, visit www.genetic-programming.org

· The home page of John R. Koza at Genetic Programming Inc. (including online versions of most papers) and the home page of John R. Koza at Stanford University

· Information about the 1992 book Genetic Programming: On the Programming of Computers by Means of Natural Selection, the 1994 book Genetic Programming II: Automatic Discovery of Reusable Programs, the 1999 book Genetic Programming III: Darwinian Invention and Problem Solving, and the 2003 book Genetic Programming IV: Routine Human-Competitive Machine Intelligence. Click here to read chapter 1 of Genetic Programming IV book in PDF format.

· For information on 3,198 papers (many on-line) on genetic programming (as of June 27, 2003) by over 900 authors, see William Langdon’s bibliography on genetic programming.

· For information on the Genetic Programming and Evolvable Machines journal published by Kluwer Academic Publishers

· For information on the Genetic Programming book series from Kluwer Academic Publishers, see the Call For Book Proposals

· For information about the annual Genetic and Evolutionary Computation (GECCO) conference (which includes the annual GP conference) to be held on June 26–30, 2004 (Saturday – Wednesday) in Seattle and its sponsoring organization, the International Society for Genetic and Evolutionary Computation (ISGEC). For information about the annual NASA/DoD Conference on Evolvable Hardware Conference (EH) to be held on June 24-26 (Thursday-Saturday), 2004 in Seattle. For information about the annual Euro-Genetic-Programming Conference to be held on April 5-7, 2004 (Monday – Wednesday) at the University of Coimbra in Coimbra Portugal.


Last updated on August 27, 2003