Recent
postings to the genetic programming list have reopened the perennial issue of
whether or not it is productive to do cross paradigm comparisons between
genetic programming (GP) and other machine learning and artificial intelligence
paradigms.
The idea of doing a controlled experiment that compares two or more
different paradigms seems like such a good idea and sounds so
"scientific;" however, this alluring idea has significant conceptual
and practical difficulties when one attempts to apply it to the present-day
situation in machine learning and artificial intelligence.
First, cross-paradigm comparisons can, by definition, only be done
on problems that already can be solved by two or more different paradigms. In
other words, once you make the management decision to work on a cross-paradigm
comparison, you are making a decision not to try to push the frontiers of any
one particular field, but, instead, to work on a problem that has already been
solved in multiple ways. When viewed in this way, it can seen that a
cross-paradigm comparison is a quasi-administrative task whose intellectual
content consists almost entirely of applying the chi-squared test to the
statistics from a series of runs of an problem that is already known to be
solvable in two or more ways.
Second, the quality of the results of a cross-paradigm comparison
and the eventual persuasiveness of the results will inevitably hinge on the
choice of problems that are included in the testbed. The selection of a
suitable testbed is usually THE critical problem in conducting a cross-paradigm
comparison. Unfortunately, there is no sound theoretical basis for choosing the
testbed problems. There is no way to support any claim that the experimental
conclusions derived from any particular small finite testbed of problems have
any degree of generality or applicability to other problems.
Third, cross-paradigm comparisons inevitably down scale the
difficulty of the problems being tackled. This "dumbing down"
typically amounts to about two orders of magnitude below the level of whatever
can be solved with currently available computational resources. The reason is
that it usually takes a minimum of around 50 or 100 independent runs of the
same problem using the two competing methods in order to achieve even a minimal
level of statistical significance for the experimental results (for the kind of
differences that one usually encounters). Let's take a concrete example. It
takes about 8 days of computer time to do 100 one-hour runs using paradigm "A"
plus 100 one-hour runs using paradigm "B." The same 8 days can
alternatively be used to do a single 200-hour (8-day) run of a much more
difficult and challenging problem using either paradigm. If experimenters today
had six or eight orders of magnitude greater computational resources than they
actually now have, this scaling down by two orders of magnitude might not be
significant. However, in today's environment, sacrificing two orders of
magnitude of problem difficulty for the testbed problems is usually tantamount
to "dumbing" the testbed down so much that the testbed lacks all
challenging aspects of realistic problem-solving. The result is that the
cross-paradigm comparison is made on untypically easy (and often exceedingly
misleading) testbed problems. For example, one often sees cross-paradigm
comparisons being done on Boolean function learning problems. However, favorite
fast testbed problems such as Boolean function learning problems have many
inherently misleading structural features. For Boolean problems, there are no
true local optimum points in the search space (Juels and Wattenberg 1994).
However, the problem of becoming entrapped on local optima points in the search
space is common all interesting and non-trivial problems. In fact, having potentially
enticing local optima is almost a precondition to non-triviality in
problem-solving. Because of this, cross-paradigm comparisons using Boolean
function learning problems can lead to manifestly incorrect conclusions if the
ability of a search technique to overcome local optima is an issue. For
example, one can easily mislead oneself into believing that hill-climbing can
outperform almost any other technique by using such overly simplistic testbeds
--- even though this claim is as manifestly false as the claim that the earth
is flat. A glance at recent issues of the Machine Learning journal and the
proceedings of the machine learning conferences shows that the machine learning
community still uses the LED problem (a 7-argument Boolean problem --- known to
be susceptible to linear separation --- known to be friendly to hill climbing),
the congressional voting problem (whose kernel is less difficult than the
Boolean exclusive-or problem), the sick soybeans problem, and various other
highly constrained classification problems. Another example if the use of other
overly simplistic "mutational" problems (such as the block stacking
problem) in which there are few, if any, opportunities for progressive
improvement toward a solution. When, after the expenditure of considerable time
and effort, such "flat earth" results are obtained, the researcher is
usually so emotionally committed to his early choice of the testbed that he is
unwilling to draw the correct conclusion --- namely, that his testbed was bad.
(An exception to this general rule is the technical report (Juels and
Wattenberg 1994) that candidly realized that the anomalous results were the
result of the inadequacy of the Boolean functions in the testbed).
Fourth, cross-paradigm comparisons inevitably suffer from the
"least common denominator" problem. In my opinion, the main weakness
of the currently popular "machine learning" paradigms is that they
are highly specialized algorithms that cannot ever solve any breadth of
problems (or, for that matter, any difficult problems). This is ironic since
the term "machine learning" was coined in the 1950s by Arthur Samuel
who clearly visualized computers learning to program themselves to solve
problems. In fact, a major fraction of the "machine learning"
community has successfully managed to attach the term "machine
learning" to problems that involve the mere classification of instances
into categories and the subsequent testing of whether the learned
classification scheme generalizes to unseen instances of "the same"
underlying problem. Once one embarks on a cross-paradigm comparison involving a
group of deficient paradigms, one inevitably has to restrict the study to
problems to the least common denominator of what can be solved by the most
impotent of the paradigms. If the competing methods were not as limited and
constrained as they are today, that might not be a problem. However, in
practice, cross-paradigm comparisons among the currently popular machine
learning paradigms necessarily gravitate to utterly trivial problems. For
example, as soon as you talk about doing a cross paradigm comparison, the
decision tree people (quite understandably) step forward to say that it would
be "unfair" to include a problem in the testbed if it happens to have
inter-related real-valued variables. Their complaint is motivated by the fact
that decision trees decision trees only work well on attribute tests having
small discrete number of outcomes applied to one variable at a time. Indeed,
decision trees explode on problems as simple as deciding whether a point (x,y)
lies inside a circle (inscribed in a square) because both x and y are relevant
to deciding whether the point is inside the circle. The decision tree people
make the compelling (and correct) argument that the selection of the testbed
preordains the result of the cross-paradigm comparison and that it would be
"unfair" to include problems in the testbed for which decision trees
are known to be unsuited. Similarly, the reinforcement learning people (quite
understandably) step forward to say that you should only include problems with
a highly constrained number of possible future states for the system being
studied. Meanwhile, the inductive logic programming people say "speak to
me in Horn clauses." Similarly, the vast majority of neural net research
activity avidly avoids recurrent neural nets (where back propagation and the
other popular connectionist learning algorithms simply don't operate). Thus, it
becomes "unfair" to consider problems requiring memory. Then, the people
who work with genetic algorithms operating on fixed-length strings step forward
and want to avoid any problem that cannot be efficiently encoded into a fixed
and unchanging structure. In fact, the neural network people and the genetic
algorithm people join forces in favor of limiting the testbed to problems where
the size and shape of the ultimate solution is already known in advance. Of
course, in real problem-solving, the size and shape of the solution is often a
major part of the problem. Indeed, for many realistic problems, the size and
shape of the solution to the problem may indeed be THE problem. And, last and
not least, everyone (outside of GP) subscribes to the gentlemen's agreement of
not uttering the forbidden "S-word" (scalability). There is unanimous
agreement among everyone (outside of GP) that no one wants to even contemplate
a problem that contains any significant modularity, symmetry, homogeneity, or
regularity that might naturally be exploited by reuse (i.e., subroutines or
macros) or any problem that would benefit from parameterized reuse (i.e.,
subroutines with arguments). Needless to say, this gentlemen's agreement also
excludes any problem with dynamic variations in the high-level architectures of
these reused parts, parameterized parts. And, of course, don't even think about
iteration and other commonly used tools of ordinary computer programming.
Thus, by the time you have selected a "fair" testbed of
problems for a "scientific" cross-paradigm comparison, you have
already eliminated almost every aspect of realistic problem-solving by
The fact is that a realistic cross-paradigm comparison (i.e., one that doesn't
make the above exclusions) between genetic programming and any of the currently
popular machine learning and artificial intelligence paradigms is a comparison
between some finite measured amount of computational for genetic programming and
NEVER.
Fifth, even if you go ahead and
do a cross-paradigm comparison, everyone will have a good, sincere, and
legitimate reason not to pay any attention to your results. If you don't cater
to all of the above "least common denominator" constraints on the
choice of problems to include in your testbed, you will be correctly accused of
prejudicing your testbed against certain paradigms. Because you will be guilty
of "unfairness," no one will pay any attention to your results. If
you do cater to these objections, you end up in some nanoworld that lies at the
least common denominator of all the limitations of the various paradigms. Of
course, once you surrender to the least common denominator, you will be
correctly accused of triviality and, again, no one will pay any attention to
your results. If you select realworld problems for your testbed (e.g., Walter
Tackett's use of the Army's image recognition database), you will be correctly
accused of using an idiosyncratic testbed that is not a "clean" and
"simple" and susceptible to collateral mathematical analysis.
Consequently, no one will pay any attention to your results. On the other hand,
if you use purely abstract problems that are susceptible to mathematical
analysis (e.g., Walter Tackett's inter-twined doughnut problem), you will
correctly accused of using "toy" problems that are not
"realworld." Again, no one will pay any attention to your results.
The bottom line is that whatever you do, no one will pay any attention to your
results.
Sixth, the proponents of whatever paradigm that doesn't fare well
in your cross-paradigm comparison can be counted on to be especially vigorous
in not paying attention to your results. After you finish your cross-paradigm
comparison, the analysis inevitably descends to (completely sincere)
cocktail-party chit-chat about how the testbed is (or is not) truly reflective
of the infinity of other problems that were not in the testbed. Of course,
regardless of what testbed you used, there will be no way to validate (or
invalidate) its suitability. Everyone gets to ventilate their sincere (and
plausible) opinions about the characteristics of the infinite set of problems
that were not included in your study.
Seventh, any contemplating doing a cross-paradigm comparison
should recognize that such a project requires a major commitment of time and
effort for the researcher. Such comparisons invariably require that the
experimenter try to become reasonably proficient on one or more unfamiliar
paradigms. A cross-paradigm comparison requires writing, debugging, and running
computer programs for each of these other paradigms involved (or, alternately,
coming up with an all-encompassing analytic theory definitively modeling the
performance of the different paradigms). By the way, this feature does make
cross-paradigm comparisons a useful educational exercise for the proverbial
underworked student (and I do recommend them for that purpose). If education is
the goal, that's fine. But, aside from learning about a new paradigm, a
cross-paradigm comparison is really just a quasi-administrative exercise whose
emotional high point involves performing the chi-squared test on the statistics
from a series of runs of an problem that has already been solved in multiple
ways.
Eighth, it is extremely difficult (perhaps impossible) to do a
cross-paradigm comparison even-handedly. In what I call Hinton's Law (because I
heard it eloquently stated by Geof Hinton from Toronto a few years back), the
comparison usually ends up favoring the user's favorite paradigm. The
unintentional bias usually comes from the fact that the experimenter is usually
highly conversant with only one of the methods he is testing. Even after
reading the literature in the unfamiliar paradigm (which is, regrettably, all
too often silent on many important practical implementation issues), the
experimenter often makes ridiculous mistakes that no one who regularly uses the
paradigm would ever make. All of these paradigms have numerous control
parameters and subtle operational options which play an important (and usually
poorly understood) role in determining the outcome. Even when these parameters
and options are explicitly specified in the literature, there are often
unarticulated assumptions and preconditions as to their applicability. For example,
one sometimes sees the genetic algorithm with extraordinarily tiny populations
(e.g., 100) applied to various monumentally hard problems (e.g., the protein
folding problem). Or, one sometimes sees choices for the number of hidden
layers (or the magnitude of the range of initial weights, the learning rate or
momentum parameter, etc.) in a neural network in an inept way that no
experienced neural net user would ever do. (Of course, in many cases, the
problem with cross-paradigm comparisons isn't unintentional bias, but the fact
that the person doing the comparison has a not-so-hidden agenda in the first
place).
Ninth, any contemplating doing a cross-paradigm comparison should
carefully consider the question of resource management. All too many researchers
spend far too little time to making explicit and conscious decisions concerning
the management of their own their own time and their own computational and
other resources among competing research alternatives. There's an infinity of
plausible inquires that one may make about the universe. But when it comes to
spending ones time and effort on cross-paradigm comparisons, here is the
high-level management question that I think researchers should be asking before
embarking on a cross-paradigm comparison: What would be accomplished, at this
point in time, by knowing the result of a cross paradigm comparison (even if
there were no question about the validity or generality of the result) in the
development of the fields of automated machine learning and artificial
intelligence at this time? Is comparative speed the most important issue at
this point in the development of automated learning? Are there so many good
ways to master the problem of automated learning that comparative speed is even
an important issue at this time? If you ask management questions such as this,
you will immediately see that the major challenge for machine learning at the
present time is how to get non-trivial results on non-trivial problems --- not
the relative speed by which trivial and useless results on nano-problems can be
simultaneously produced by different paradigms.
· The home page of Genetic Programming Inc. at www.genetic-programming.com.
· For information about the field of genetic programming in general, visit www.genetic-programming.org
· The home page of John R. Koza at Genetic Programming Inc. (including online versions of most papers) and the home page of John R. Koza at Stanford University
· Information about the 1992 book Genetic Programming: On the Programming of Computers by Means of Natural Selection, the 1994 book Genetic Programming II: Automatic Discovery of Reusable Programs, the 1999 book Genetic Programming III: Darwinian Invention and Problem Solving, and the 2003 book Genetic Programming IV: Routine Human-Competitive Machine Intelligence. Click here to read chapter 1 of Genetic Programming IV book in PDF format.
· For information on 3,198
papers (many on-line) on genetic programming (as of June 27, 2003) by over 900
authors, see William Langdon’s bibliography on
genetic programming.
· For information on the Genetic Programming and Evolvable Machines journal published by Kluwer Academic Publishers
· For information on the Genetic Programming book series from Kluwer Academic Publishers, see the Call For Book Proposals
· For information about the
annual Genetic and Evolutionary
Computation (GECCO) conference (which includes the annual GP
conference) to be held on June 26–30, 2004 (Saturday – Wednesday) in Seattle
and its sponsoring organization, the International Society for Genetic and
Evolutionary Computation (ISGEC). For
information about the annual NASA/DoD Conference on Evolvable Hardware Conference (EH)
to be held on June 24-26 (Thursday-Saturday), 2004 in Seattle. For
information about the annual Euro-Genetic-Programming Conference to be held on
April 5-7, 2004 (Monday – Wednesday) at the University of Coimbra in Coimbra
Portugal.
Last updated on August 27, 2003