In a paper entitled "Hill Climbing Beats Genetic Search on a Boolean Circuit Synthesis Problem of Koza's" that was accepted and published by the 1995 International Machine Learning Conference, Kevin Lang proposed a hillclimbing algorithm employing two kinds of random mutation of program trees that, he claimed, "beat" genetic programming (Lang 1995a).
"Lang's claim that his hillclimbing algorithm "beat" genetic programming was based on a comparative experiment based on 100 runs of all 80 distinct 3-argument Boolean functions for the two algorithms."
I responded in a 24-page document distributed, by hand, on July 11, 1995 at the
ML-95 conference in Tahoe City, California entitled "A Response to the
ML-95 Paper Entitled 'Hill Climbing Beats Genetic Search on a Boolean Circuit
Synthesis Problem of Koza's' " (Koza 1995).
On July 31, 1995, Lang wrote "Comments on 'A response to ... ' "
subtitled "Hill Climbing Still Wins" (Lang 1995b) and submitted these
"Comments" to the ML mailing list (and distributed them on another
mailing list).
Mike Pazzani, moderator of the ML mailing list, invited me to respond to Lang's
latest "Comments" claiming that "Hill Climbing Still Wins."
Readers already familiar with Lang's ML-95 paper and my hand-distributed July 11 "Response" can skip this summary of the "first round" in sections 2.1 and 2.2 below.
Lang's ML-95 paper presents a specific alternative to the
population-based search method employed by the genetic algorithm. His proposed
search algorithm starts with a single random individual and, at each time step,
remembers the single best individual ever seen. Then, on each time step, one
candidate new individual is probabilistically created and compared with the
current single best individual. If the new candidate is at least as good as the
current best individual, the candidate becomes the new best individual.
The candidate is created differently on alternate time steps. On the
even-numbered steps (starting with time step 0), the new candidate is a
randomly created entirely new individual. On the odd-numbered steps, the new
candidate is the offspring produced by "crossing" the current best
individual with a randomly created new individual. In this single-parent
"crossing" operation (which is equivalent to the mutation operation
of genetic programming), a randomly chosen subtree from an especially created
new random individual is inserted at a randomly chosen point of the current best
individual (thereby replacing the subtree currently located at that point).
One would think that most researchers in automated learning
would be very surprised to see a tabulation of 100 runs of the family of 80
3-argument Boolean functions proffered as a suitable testbed for reaching a
generalizable conclusion about machine learning.
My "Response" asserted that
"When a scientific claim must be established by experimental evidence (as opposed to proof), the persuasive quality of the experimental testbed is necessarily a threshold issue."
I went on to say,
"Boolean functions of ANY arity are ... of questionable value for
basing generalizations about the performance of learning algorithms.
(Emphasis in original).
The inadequacy of Lang's testbed consisting of 80 3-argument Boolean functions
can be seen merely by increasing the number of Boolean arguments to 4 and 5.
100% of the runs of genetic programming solved the only-slightly-less-trivial
Boolean even-4-parity and the even-5-parity problems (in an average of about a
half million and three million steps, respectively).
In contrast, only 75% of a dozen runs of the Lang's mutation-based hillclimbing
algorithm solved the even-4-parity problem and only 25% solved the
even-5-parity problem. The remaining runs of Lang's hillclimbing algorithm
wandered around at various suboptimal fitness levels for the final 99% or so of
the 4,800,000 steps allowed in each run. Lang's hillclimbing runs didn't
succeed after giving it a generous number of steps (4,800,000) because of the
well known tendency of probabilistic hillclimbing algorithms to not make any
progress for very long stretches of time.
(Note that stochastic hillclimbing algorithms don't ever get stuck. Given
enough time, Lang's random mutation of program trees will, eventually, solve
any problem whatsoever).
The even-3-parity problem dramatizes the fallacy inherent in the Lang's
bombastic proclamation that probabilistic hillclimbing "beats"
genetic programming.
Referring to a histogram showing how well 1,000,000 randomly created programs
perform the even-3-parity task (table 26.1 of Koza 1994), 67.5% have fitness 4
(i.e., produce the correct answer for half of the 8 combinations of the 3
Boolean inputs); 15.6% have fitness 5 and 3 each; 0.5% have fitness 2 and 6;
only 0.33% have fitness 7 or 1; and none (out of 1,000,000) have fitness 8 (the
best) or 0 (the worst).
Because two thirds of the random population have fitness 4, a run of Lang's
proposed hillclimbing algorithm will typically start at time step 0 with a
randomly created individual with a fitness of 4. On step 1 (and every
succeeding odd-numbered step), a random GP mutation of the current best
individual will be executed. On step 2 (and every succeeding even-numbered
step), a randomly created entirely new individual will be generated. This
entirely new random individual (and every succeeding entirely new random
individual generated on these even-numbered steps) will have a 2/3 probability
of having fitness 4 and a 1/6 probability of having fitness 5 (as shown in the
histogram).
After alternating these odd and even time steps a few times, the current best
individual will probably work its way up to a fitness value of 5 and later to
6.
Once the current best individual reaches a fitness level of 6, the
even-numbered steps starts to become irrelevant since there is only a 0.33% (1
in 300) chance of creating a randomly created entirely new individual with
fitness 7. The reason is that the even-numbered steps always draw from the same
well (i.e., the known random distribution).
When the fitness of the current best individual reaches the level of 7 (most
likely due to an odd-numbered step), the chance that an even-numbered step will
create an entirely new individual with fitness 8 is considerably less than one
in a million. These odds are essentially "never" in relation to the
total of small number of steps involved (i.e., 1,250).
In other words, as soon as the fitness of the current best individual moves
just a few standard deviations away from the mean value for randomly created
new individuals, the even-numbered steps in the proposed hillclimbing algorithm
become totally ineffective and irrelevant in advancing the progress of the
search. In fact, the even-numbered steps matter only when the best current
individual in this hillclimbing algorithm is very close to the mean (e.g., at
the very beginning of the run).
The irrelevance of the even-numbered steps would have been obvious to the
paper's author had not made the threshold error (uncritically accepted by the
1995 International Machine Learning Conference) of employing the 80 very
trivial 3-argument Boolean functions as its chosen testbed and had, instead,
studied the ever-so-slightly less trivial 4-argument or 5-argument Boolean functions.
For example, 100% of 1,000,000 randomly created individuals for the
even-4-parity problem have fitness between 4 and 12. None of these 1,000,000
randomly created individuals even come close to the best level of fitness of 16
because a 16 is many standard deviations away from the mean of 8.
Similarly, 100% of the 1,000,000 randomly created individuals for the
even-5-parity task have fitness between 12 and 20. Again, a perfect score of 32
is many, many standard deviations away from the mean value of 16.
Thus, once the proposed hillclimbing algorithm moves even slightly above the
mean, there is essentially no chance that the even-numbered steps will produce
a new individual that will dethrone the current best individual. The entire
burden of advancing the progress of the search falls, almost immediately, upon
the odd-numbered steps (i.e., the ordinary GP operation of mutation).
Having eliminated the even-numbered steps from the picture, the question then
becomes whether the odd-numbered steps can successfully search a complicated
search space. We need not speculate on the answer to this question because the
answer is already in Lang's ML-95 paper. Lang apparently first tried an
approach consisting only of the odd-numbered steps and apparently discovered
that this approach didn't work even in the nanoworld of 3-argument Boolean
functions.
The paper says,
"Fully half of our candidates circuits were random, drawn from the same distribution that yielded poor performance for [random-generation-and-test]."
Tellingly, the paper continues,
"While this seems like a waste of resources, WE NEEDED TO HAVE SOME MECHANISM for escaping from the local optima that would have resulted from keeping only one good circuit around at a time. (Emphasis added)."
In other words, Lang's paper concedes the inability of the
odd-numbered steps alone for searching a non-linear space because of the
well-known problem of lingering for long stretches of time at various
suboptimal levels of fitness.
Given that the paper concedes the inability of the odd-numbered steps to work
and given the provable inability of the even-numbered steps to play any role in
the search (once the search breaks out of the immediate neighborhood of the
mean value of the initial random distribution), it is clear that NEITHER
PART OF LANG'S PROPOSED HILLCLIMBING ALGORITM WORKS. The proposed
hillclimbing algorithm is a pasting together of an approach that Lang's paper
concedes doesn't work with an ineffective and irrelevant approach that provably
can't work.
Presumably, Lang came to the realization that he "needed to have some
mechanism" from his own preliminary experimentation on the family of
3-argument Boolean functions. Then, because of the threshold error of
conducting these experiments in the trivial testbed of 3-argument Boolean
functions (an glaringly obvious error accepted uncritically by the
International Machine Learning Conference), all the search activity of
consequence was concentrated in the highly confined interval between a fitness
level of 6 and 8. In this highly misleading and trivial world, the author of
the paper managed to convince himself that his even-numbered steps (i.e., the
resort to totally random "generation 0" individuals) actually
contributed something to the progress of the search.
I believe my July 11 "Response" succeeded in demonstrating the
fallacy of Lang's claim in his ML-95 paper that hillclimbing algorithm
"beat" genetic programming ("of Koza's," as he put it).
I believe my "Response" showed that, as the testbed is scaled up to
these only-slightly-harder Boolean problems, there is a decreasing ability of
the hillclimbing algorithm (on a dramatically decreasing fraction of the runs)
to work, within any reasonable amount of time (while recognizing that Lang's
algorithm will always work, on any problem, when given enough time).
The best evidence of the success of my July 11 "Response" is the fact
that Lang's July 31 "Comments" claiming that "Hill Climbing
Still Wins" start off by abandoning the specific algorithm described in
his published ML-95 paper.
Indeed, in "Hill Climbing Still Wins," Lang unveils two new
algorithms not even hinted at in his published ML-95 paper.
Lang's first new algorithm is an "iterative" probabilistic
hillclimbing algorithm that runs for 75,000 steps and is then restarted. Note
that time is the essence in this debate. The only claim made in Lang's ML-95
paper is that hillclimbing "beats" genetic programming. And, the only
claim made by Lang's July 31 iterative algorithm is that "Hill Climbing
Still Wins" as to time. There is no dispute that both of Lang's algorithms
will eventually work, on any problem, when given enough time. Thus, the essence
of Lang's new "iterative" algorithm is the number 75,000 that defines
a series of jack-rabbit starts (instead of letting the algorithm run to its
natural and eventual guaranteed successful conclusion at some very distant
time).
What is the general principle by which one could possibly apply Lang's new
iterative algorithm to an arbitrary problem in order to yield this time
advantage?
Does the cutoff number of 75,000 work for the 6-parity or 7-parity problem? The
robotic box-pushing problem? The intertwined spiral problem (Lang and Witbrock
1989)? The transmembrane segment identification problem?
Is 75,000 a universal constant - much like e? Where does it come from?
The number 75,000 apparently comes from the fact that the three (25%) of MY
dozen reported runs of the 5-parity problem each happened to solve just before
step 75,000! The fact is that the only thing that permits Lang's new iterative
algorithm to "still win" over genetic programming is a hand-crafted
problem-specific cutoff number that saves his new algorithm from consuming what
would otherwise be an exceedingly large amount of time to reach its guaranteed
eventual successful conclusion.
Skipping past the origin of the magic cutoff number of
75,000, there is another important conceptual error in Lang's July 31
"Comments."
Lang says, "by restarting each run of RMHC after 75,000 candidates, one
could obtain a candidate-to-solution ratio of about 300,000." Apparently
he arrives at 300,000 because of his prior statement that "RMHC is four
times less likely than genetic search to find a {solution]" (with this
"four" coming from dividing 100% by 25%).
But, the number of steps needed to "solve a problem" using a
probabilistic algorithm is not obtained by multiplying this "4" by
75,000! A probabilistic algorithm that yields a solution 25% of the time does
not necessarily produce at least one solution in 4 tries.
One reasonable way (Koza 1992, 1994) (but, by no means, the only way) to
compute the needed number of steps for a probabilistic algorithm is to ask the
following question: "How many steps must be performed to get a 99%
probability of yielding at least one solution to the problem? The answer to
this particular question is 16 times 75,000 (i.e., 1,200,000). This 1,200,000,
of course, depends directly on the choice of the percentage (99%) and also
assumes the statistical validity of the observed 25% success rate (i.e., 3 out
of 12 runs).
Before we get yet another missive from Lang (perhaps to be entitled "Hill
Climbing Beats GP Yet Again") pointing out that 16 times 75,000 is less
than 3 million, let me just say that I think that it is clear that Lang's new
"iterative" algorithm based on the fixed magic number of 75,000 will
also again crumble as the number of Boolean arguments is again slightly scaled
up.
5. Returning to "The Threshold Issue"
In any event, the more important issue here still concerns what I correctly
called "the threshold issue" of the suitability of Boolean functions
as the tested in the first place.
As I said in my July 11 "Response,"
"Boolean functions of ANY arity are ... of
questionable value for basing generalizations about the performance of learning
algorithms."
"Boolean functions are unusual in that both the range and domain are DISCRETE
and their performance can be efficiently represented by FINITE TRUTH TABLES.
"
(Emphasis in original).
There is nothing new about observed good performance by stochastic hillclimbing
algorithms on problems from the discrete and idiosyncratic world of Boolean
functions.
Juels and Wattenberg (1994) describe a stochastic hillclimbing algorithm
(operating, effectively, in the same way as Lang's hillclimbing algorithm,
after removal of Lang's largely useless even-numbered steps). Their stochastic
hillclimbing algorithm outperforms genetic programming on the Boolean
11-multiplexer problem.
However, Juels and Wattenberg provide a clear explanation and insight as to
what is really happening when Boolean functions are involved:
"It is interesting to note -- perhaps partly in
explanation of the [stochastic hillclimbing] algorithm's success on this
problem -- that the [stochastic hillclimbing] algorithm formulated here defines
a neighborhood structure in which there are NO STRICT LOCAL MINIMA."
(Emphasis in original).
Juels and Wattenberg continue,
"This property holds not only for the
11-multiplexer problem, BUT FOR ANY PROBLEM WHICH INVOLVES THE REALIZATION
OF A SPECIFIC BOOLEAN FORMULA."
(Emphasis in original).
Juels and Wattenberg then prove this property in a proof that is specifically
tailored to the search space of computer programs used by genetic programming
(and by Lang's ML-95 hillclimbing algorithm).
So the bottom line is that hillclimbing may indeed sometimes work well WHERE
THE FITNESS LANDSCAPE HAS NO STRICT LOCAL MINIMA.
That is to say, the threshold failure of Lang's ML-95 paper (and his July 31
"Comments") is his continued use of a manifestly trivial testbed.
(Juels and Wattenbergs' proof can be easily reworded in terms of finite truth
tables to prove that a hillclimbing algorithm can learn any 5-argument Boolean
function in only 32 steps using the tabular representation peculiar to
discrete-valued functions over discrete-valued domains).
Click to get a Post Script copy of University of California - Berkeley Computer Science Technical Report CSD 94-834 by Marty Wattenberg and Ari Juels dated September 1994 entitled Stochastic Hillclimbing as a Baseline Method for Evaluating Genetic Algorithm
Bill Macready (on August 16, 1995 on the Connectionists mailing list) has raised the issue of the no-free-lunch principle (Wolpert and Macready 1995; Macready and Wolpert 1995) concerning any claims as to whether one machine learning algorithm can be said to "beat" another:
"In his recent posting, Kevin Lang continues a contest with John Koza of the 'my search algorithm beats your search algorithm ...' variety."
Barak Pearlmutter (on August 17 on the Connectionists mailing list) then
defended Lang from the no-free-lunch criticism by saying,
"It might be true that in a universe where everything was equally likely, all search algorithms would be equally awful. But that does not appear to be THE UNIVERSE THAT WE LIVE IN, so it is not unreasonable to ask whether some particular search algorithm performs well in practice." (Emphasis added).
But, regardless of whether you agree with Macready or Pearlmutter, Lang's
originally published ML-95 claim that hillclimbing "beat" genetic
programming was based on a comparative experiment based on 3-argument Boolean
functions and Lang's July 31 claim that "Hill Climbing Still Wins" is
based on the testbed of Boolean 4-parity and 5-parity functions. None of these
Boolean problems are "the universe that we live in."
Lang closes his July 31 "Comments" by introducing
yet another subject that was not mentioned at all in his published ML-95 paper,
namely multi-layer perceptrons.
He tells us
"My positive advice is this: when learning functions, use multi-layer perceptrons"
because this approach, in conjunction with the conjugate gradient search,
"has no control parameters to tweak."
Then, in almost the next sentence, he mentions that the multi-layer perceptron
had 5 hidden units; that the hyperbolic tangent function was used; and that the
random initial weights drawn uniformly from the interval [-1,+1] !
Presumably, all of this discussion about perceptrons has been introduced to
recast Lang's published ML-95 paper as something that it never was, namely,
now, a discussion about how perceptrons can also "beat" genetic
programming.
The fact is that Lang's ML-95 paper made a very strong (and wrong) claim that
hillclimbing "beats" genetic programming based on an inadequate
testbed. Lang's July 31 "Hill Climbing Still Wins"
retrospectively recasts his first strong (and wrong) claim with a second strong
(and wrong) claim.
The fact that Lang's ML-95 paper was uncritically accepted
and published by the Machine Learning conference is, in my opinion, a symptom
of a serious problem in the field of machine learning.
In my view, the present challenge for machine learning is how to get
non-trivial results on non-trivial problems -- NOT the relative speed by which
trivial and useless results can be obtained by various alternative paradigms.
Cross-paradigm comparisons (if they are ever useful) can necessarily only be
done on problems that can be solved by several different paradigms. Given the
severe structural limitations inherent in many commonly-studied machine
learning paradigms and their acknowledged inability to solve any significant
breadth of problems, cross-paradigm comparisons are necessarily limited to the
"least common denominator" of problem domains. In practice, that
means some completely trivial problem domain.
Once we get away from the trivial problem domains, the fact is that
cross-paradigm time comparisons between genetic programming and other methods
are (at least at the present time) a comparison between some finite amount of
time and NEVER.
Even if it were true that genetic programming were provably slower than N
other machine learning methods on M toy problems (and, I strongly
emphasize, I don't the slightest reason to believe that GP is either at the
slow OR FAST end of such a scale), the point is that GP is capable of working
on a breadth of non-trivial problems that cannot even be touched by other ML
methods. The reason for this is that GP operates in the right search space
(namely, the search space of computer programs). I believe that time will prove
out my expectation that operating in the right search space will give GP the
ability to solve some real problems. As I said in my first book (in what I
repeatedly tell people is simultaneously the most trite and most important
sentence in the book),
" ... if we are interested in getting computers to solve problems without being explicitly programmed, the structures that we really need are COMPUTER PROGRAMS"
(Emphasis in original).
Juels, Ari and Wattenberg, Martin. 1994. Stochastic Hillclimbing as a Baseline Method
for Evaluating Genetic Algorithms. University of California Computer Science
Department technical report CSD-94-834. September 28, 1994.
Koza, John R. 1992. Genetic Programming: On the Programming of Computers by
Means of Natural Selection. Cambridge, MA: The MIT Press.
Koza, John R. 1994. Genetic Programming II: Automatic Discovery of Reusable
Programs. Cambridge, MA: The MIT Press.
Koza, John R. 1995. "A response to the ML-95 paper entitled 'Hill climbing
beats genetic search on a Boolean circuit synthesis problem of Koza's' ".
Document distributed, by hand, on July 11, 1995 at the 1995 International
Machine Learning Conference in Tahoe City, California.
Lang, Kevin J. 1995a. Hill climbing beats genetic Search on a Boolean circuit
synthesis problem of Koza's. Proceedings of the Twelfth International
Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann.
Lang, Kevin J. 1995b. "Comments on 'A response to the ML-95 paper entitled
"Hill climbing beats genetic search on a Boolean circuit synthesis problem
of Koza's" ' ". Dated July 31, 1995 and distributed on Connectionist
mailing list on August 16, 1995 and on ML mailing list on August ---, 1995.
Lang, Kevin J., and Witbrock, Michael J. 1989. Learning to tell two spirals
apart. In Touretzky, David S., Hinton, Geoffrey E., and Sejnowski, Terrence J.
(editors). Proceedings of the 1988 Connectionist Models Summer School. San
Mateo, CA: Morgan Kaufmann Pages 52-59.
Macready, W.G. and Wolpert, D. H. 1995. What Makes an Optimization Problem
Hard? Santa Fe Institute Working Paper SFI-TR-95-05-046.
Wolpert, D. H., and Macready, W.G. 1995. No Free Lunch Theorems for Search.
Santa Fe Institute Working Paper SFI-TR-95-02-010.
· The home page of Genetic Programming Inc. at www.genetic-programming.com.
· For information about the field of genetic programming in general, visit www.genetic-programming.org
· The home page of John R. Koza at Genetic Programming Inc. (including online versions of most papers) and the home page of John R. Koza at Stanford University
· Information about the 1992 book Genetic Programming: On the Programming of Computers by Means of Natural Selection, the 1994 book Genetic Programming II: Automatic Discovery of Reusable Programs, the 1999 book Genetic Programming III: Darwinian Invention and Problem Solving, and the 2003 book Genetic Programming IV: Routine Human-Competitive Machine Intelligence. Click here to read chapter 1 of Genetic Programming IV book in PDF format.
· For information on 3,198
papers (many on-line) on genetic programming (as of June 27, 2003) by over 900
authors, see William
Langdon’s bibliography on genetic programming.
· For information on the Genetic Programming and Evolvable Machines journal published by Kluwer Academic Publishers
· For information on the Genetic Programming book series from Kluwer Academic Publishers, see the Call For Book Proposals
· For information about the
annual Genetic and
Evolutionary Computation (GECCO) conference (which includes the annual
GP conference) to be held on June 26–30, 2004 (Saturday – Wednesday) in Seattle
and its sponsoring organization, the International Society for Genetic and
Evolutionary Computation (ISGEC).
For information about the annual NASA/DoD Conference on Evolvable
Hardware Conference (EH) to be held on June 24-26 (Thursday-Saturday),
2004 in Seattle. For information about the annual Euro-Genetic-Programming
Conference to be held on April 5-7, 2004 (Monday – Wednesday) at the
University of Coimbra in Coimbra Portugal.
Last updated on August 27, 2003