John Koza's August 17, 1995 Response (Distributed to the ML mailing list) to Kevin Lang's July 31, 1995 "Comments" (Distributed to the ML mailing list)

1. Background

In a paper entitled "Hill Climbing Beats Genetic Search on a Boolean Circuit Synthesis Problem of Koza's" that was accepted and published by the 1995 International Machine Learning Conference, Kevin Lang proposed a hillclimbing algorithm employing two kinds of random mutation of program trees that, he claimed, "beat" genetic programming (Lang 1995a).

"Lang's claim that his hillclimbing algorithm "beat" genetic programming was based on a comparative experiment based on 100 runs of all 80 distinct 3-argument Boolean functions for the two algorithms."

I responded in a 24-page document distributed, by hand, on July 11, 1995 at the ML-95 conference in Tahoe City, California entitled "A Response to the ML-95 Paper Entitled 'Hill Climbing Beats Genetic Search on a Boolean Circuit Synthesis Problem of Koza's' " (Koza 1995).

On July 31, 1995, Lang wrote "Comments on 'A response to ... ' " subtitled "Hill Climbing Still Wins" (Lang 1995b) and submitted these "Comments" to the ML mailing list (and distributed them on another mailing list).

Mike Pazzani, moderator of the ML mailing list, invited me to respond to Lang's latest "Comments" claiming that "Hill Climbing Still Wins."

2. Summary of Lang's ML-95 Paper and My July 11 "Response"

Readers already familiar with Lang's ML-95 paper and my hand-distributed July 11 "Response" can skip this summary of the "first round" in sections 2.1 and 2.2 below.

2.1 Lang's ML-95 Paper

Lang's ML-95 paper presents a specific alternative to the population-based search method employed by the genetic algorithm. His proposed search algorithm starts with a single random individual and, at each time step, remembers the single best individual ever seen. Then, on each time step, one candidate new individual is probabilistically created and compared with the current single best individual. If the new candidate is at least as good as the current best individual, the candidate becomes the new best individual.

The candidate is created differently on alternate time steps. On the even-numbered steps (starting with time step 0), the new candidate is a randomly created entirely new individual. On the odd-numbered steps, the new candidate is the offspring produced by "crossing" the current best individual with a randomly created new individual. In this single-parent "crossing" operation (which is equivalent to the mutation operation of genetic programming), a randomly chosen subtree from an especially created new random individual is inserted at a randomly chosen point of the current best individual (thereby replacing the subtree currently located at that point).

2.2 My July 11 Response

One would think that most researchers in automated learning would be very surprised to see a tabulation of 100 runs of the family of 80 3-argument Boolean functions proffered as a suitable testbed for reaching a generalizable conclusion about machine learning.

My "Response" asserted that

"When a scientific claim must be established by experimental evidence (as opposed to proof), the persuasive quality of the experimental testbed is necessarily a threshold issue."

I went on to say,

"Boolean functions of ANY arity are ... of questionable value for basing generalizations about the performance of learning algorithms.
(Emphasis in original).

The inadequacy of Lang's testbed consisting of 80 3-argument Boolean functions can be seen merely by increasing the number of Boolean arguments to 4 and 5. 100% of the runs of genetic programming solved the only-slightly-less-trivial Boolean even-4-parity and the even-5-parity problems (in an average of about a half million and three million steps, respectively).
In contrast, only 75% of a dozen runs of the Lang's mutation-based hillclimbing algorithm solved the even-4-parity problem and only 25% solved the even-5-parity problem. The remaining runs of Lang's hillclimbing algorithm wandered around at various suboptimal fitness levels for the final 99% or so of the 4,800,000 steps allowed in each run. Lang's hillclimbing runs didn't succeed after giving it a generous number of steps (4,800,000) because of the well known tendency of probabilistic hillclimbing algorithms to not make any progress for very long stretches of time.

(Note that stochastic hillclimbing algorithms don't ever get stuck. Given enough time, Lang's random mutation of program trees will, eventually, solve any problem whatsoever).

The even-3-parity problem dramatizes the fallacy inherent in the Lang's bombastic proclamation that probabilistic hillclimbing "beats" genetic programming.

Referring to a histogram showing how well 1,000,000 randomly created programs perform the even-3-parity task (table 26.1 of Koza 1994), 67.5% have fitness 4 (i.e., produce the correct answer for half of the 8 combinations of the 3 Boolean inputs); 15.6% have fitness 5 and 3 each; 0.5% have fitness 2 and 6; only 0.33% have fitness 7 or 1; and none (out of 1,000,000) have fitness 8 (the best) or 0 (the worst).

Because two thirds of the random population have fitness 4, a run of Lang's proposed hillclimbing algorithm will typically start at time step 0 with a randomly created individual with a fitness of 4. On step 1 (and every succeeding odd-numbered step), a random GP mutation of the current best individual will be executed. On step 2 (and every succeeding even-numbered step), a randomly created entirely new individual will be generated. This entirely new random individual (and every succeeding entirely new random individual generated on these even-numbered steps) will have a 2/3 probability of having fitness 4 and a 1/6 probability of having fitness 5 (as shown in the histogram).
After alternating these odd and even time steps a few times, the current best individual will probably work its way up to a fitness value of 5 and later to 6.
Once the current best individual reaches a fitness level of 6, the even-numbered steps starts to become irrelevant since there is only a 0.33% (1 in 300) chance of creating a randomly created entirely new individual with fitness 7. The reason is that the even-numbered steps always draw from the same well (i.e., the known random distribution).

When the fitness of the current best individual reaches the level of 7 (most likely due to an odd-numbered step), the chance that an even-numbered step will create an entirely new individual with fitness 8 is considerably less than one in a million. These odds are essentially "never" in relation to the total of small number of steps involved (i.e., 1,250).
In other words, as soon as the fitness of the current best individual moves just a few standard deviations away from the mean value for randomly created new individuals, the even-numbered steps in the proposed hillclimbing algorithm become totally ineffective and irrelevant in advancing the progress of the search. In fact, the even-numbered steps matter only when the best current individual in this hillclimbing algorithm is very close to the mean (e.g., at the very beginning of the run).

The irrelevance of the even-numbered steps would have been obvious to the paper's author had not made the threshold error (uncritically accepted by the 1995 International Machine Learning Conference) of employing the 80 very trivial 3-argument Boolean functions as its chosen testbed and had, instead, studied the ever-so-slightly less trivial 4-argument or 5-argument Boolean functions.

For example, 100% of 1,000,000 randomly created individuals for the even-4-parity problem have fitness between 4 and 12. None of these 1,000,000 randomly created individuals even come close to the best level of fitness of 16 because a 16 is many standard deviations away from the mean of 8.
Similarly, 100% of the 1,000,000 randomly created individuals for the even-5-parity task have fitness between 12 and 20. Again, a perfect score of 32 is many, many standard deviations away from the mean value of 16.
Thus, once the proposed hillclimbing algorithm moves even slightly above the mean, there is essentially no chance that the even-numbered steps will produce a new individual that will dethrone the current best individual. The entire burden of advancing the progress of the search falls, almost immediately, upon the odd-numbered steps (i.e., the ordinary GP operation of mutation).
Having eliminated the even-numbered steps from the picture, the question then becomes whether the odd-numbered steps can successfully search a complicated search space. We need not speculate on the answer to this question because the answer is already in Lang's ML-95 paper. Lang apparently first tried an approach consisting only of the odd-numbered steps and apparently discovered that this approach didn't work even in the nanoworld of 3-argument Boolean functions.

The paper says,

"Fully half of our candidates circuits were random, drawn from the same distribution that yielded poor performance for [random-generation-and-test]."

Tellingly, the paper continues,

"While this seems like a waste of resources, WE NEEDED TO HAVE SOME MECHANISM for escaping from the local optima that would have resulted from keeping only one good circuit around at a time. (Emphasis added)."

In other words, Lang's paper concedes the inability of the odd-numbered steps alone for searching a non-linear space because of the well-known problem of lingering for long stretches of time at various suboptimal levels of fitness.

Given that the paper concedes the inability of the odd-numbered steps to work and given the provable inability of the even-numbered steps to play any role in the search (once the search breaks out of the immediate neighborhood of the mean value of the initial random distribution), it is clear that NEITHER PART OF LANG'S PROPOSED HILLCLIMBING ALGORITM WORKS. The proposed hillclimbing algorithm is a pasting together of an approach that Lang's paper concedes doesn't work with an ineffective and irrelevant approach that provably can't work.
Presumably, Lang came to the realization that he "needed to have some mechanism" from his own preliminary experimentation on the family of 3-argument Boolean functions. Then, because of the threshold error of conducting these experiments in the trivial testbed of 3-argument Boolean functions (an glaringly obvious error accepted uncritically by the International Machine Learning Conference), all the search activity of consequence was concentrated in the highly confined interval between a fitness level of 6 and 8. In this highly misleading and trivial world, the author of the paper managed to convince himself that his even-numbered steps (i.e., the resort to totally random "generation 0" individuals) actually contributed something to the progress of the search.

3. Tracking the Moving Target of Lang's July 31 Comments

I believe my July 11 "Response" succeeded in demonstrating the fallacy of Lang's claim in his ML-95 paper that hillclimbing algorithm "beat" genetic programming ("of Koza's," as he put it).

I believe my "Response" showed that, as the testbed is scaled up to these only-slightly-harder Boolean problems, there is a decreasing ability of the hillclimbing algorithm (on a dramatically decreasing fraction of the runs) to work, within any reasonable amount of time (while recognizing that Lang's algorithm will always work, on any problem, when given enough time).

The best evidence of the success of my July 11 "Response" is the fact that Lang's July 31 "Comments" claiming that "Hill Climbing Still Wins" start off by abandoning the specific algorithm described in his published ML-95 paper.

Indeed, in "Hill Climbing Still Wins," Lang unveils two new algorithms not even hinted at in his published ML-95 paper.

Lang's first new algorithm is an "iterative" probabilistic hillclimbing algorithm that runs for 75,000 steps and is then restarted. Note that time is the essence in this debate. The only claim made in Lang's ML-95 paper is that hillclimbing "beats" genetic programming. And, the only claim made by Lang's July 31 iterative algorithm is that "Hill Climbing Still Wins" as to time. There is no dispute that both of Lang's algorithms will eventually work, on any problem, when given enough time. Thus, the essence of Lang's new "iterative" algorithm is the number 75,000 that defines a series of jack-rabbit starts (instead of letting the algorithm run to its natural and eventual guaranteed successful conclusion at some very distant time).

What is the general principle by which one could possibly apply Lang's new iterative algorithm to an arbitrary problem in order to yield this time advantage?

Does the cutoff number of 75,000 work for the 6-parity or 7-parity problem? The robotic box-pushing problem? The intertwined spiral problem (Lang and Witbrock 1989)? The transmembrane segment identification problem?

Is 75,000 a universal constant - much like e? Where does it come from?

The number 75,000 apparently comes from the fact that the three (25%) of MY dozen reported runs of the 5-parity problem each happened to solve just before step 75,000! The fact is that the only thing that permits Lang's new iterative algorithm to "still win" over genetic programming is a hand-crafted problem-specific cutoff number that saves his new algorithm from consuming what would otherwise be an exceedingly large amount of time to reach its guaranteed eventual successful conclusion.

4. Erroneous Calculation

Skipping past the origin of the magic cutoff number of 75,000, there is another important conceptual error in Lang's July 31 "Comments."

Lang says, "by restarting each run of RMHC after 75,000 candidates, one could obtain a candidate-to-solution ratio of about 300,000." Apparently he arrives at 300,000 because of his prior statement that "RMHC is four times less likely than genetic search to find a {solution]" (with this "four" coming from dividing 100% by 25%).

But, the number of steps needed to "solve a problem" using a probabilistic algorithm is not obtained by multiplying this "4" by 75,000! A probabilistic algorithm that yields a solution 25% of the time does not necessarily produce at least one solution in 4 tries.

One reasonable way (Koza 1992, 1994) (but, by no means, the only way) to compute the needed number of steps for a probabilistic algorithm is to ask the following question: "How many steps must be performed to get a 99% probability of yielding at least one solution to the problem? The answer to this particular question is 16 times 75,000 (i.e., 1,200,000). This 1,200,000, of course, depends directly on the choice of the percentage (99%) and also assumes the statistical validity of the observed 25% success rate (i.e., 3 out of 12 runs).

Before we get yet another missive from Lang (perhaps to be entitled "Hill Climbing Beats GP Yet Again") pointing out that 16 times 75,000 is less than 3 million, let me just say that I think that it is clear that Lang's new "iterative" algorithm based on the fixed magic number of 75,000 will also again crumble as the number of Boolean arguments is again slightly scaled up.
5. Returning to "The Threshold Issue"
In any event, the more important issue here still concerns what I correctly called "the threshold issue" of the suitability of Boolean functions as the tested in the first place.

As I said in my July 11 "Response,"

"Boolean functions of ANY arity are ... of questionable value for basing generalizations about the performance of learning algorithms."

"Boolean functions are unusual in that both the range and domain are DISCRETE and their performance can be efficiently represented by FINITE TRUTH TABLES. "
(Emphasis in original).

There is nothing new about observed good performance by stochastic hillclimbing algorithms on problems from the discrete and idiosyncratic world of Boolean functions.

Juels and Wattenberg (1994) describe a stochastic hillclimbing algorithm (operating, effectively, in the same way as Lang's hillclimbing algorithm, after removal of Lang's largely useless even-numbered steps). Their stochastic hillclimbing algorithm outperforms genetic programming on the Boolean 11-multiplexer problem.

However, Juels and Wattenberg provide a clear explanation and insight as to what is really happening when Boolean functions are involved:

"It is interesting to note -- perhaps partly in explanation of the [stochastic hillclimbing] algorithm's success on this problem -- that the [stochastic hillclimbing] algorithm formulated here defines a neighborhood structure in which there are NO STRICT LOCAL MINIMA."
(Emphasis in original).

Juels and Wattenberg continue,

"This property holds not only for the 11-multiplexer problem, BUT FOR ANY PROBLEM WHICH INVOLVES THE REALIZATION OF A SPECIFIC BOOLEAN FORMULA."
(Emphasis in original).

Juels and Wattenberg then prove this property in a proof that is specifically tailored to the search space of computer programs used by genetic programming (and by Lang's ML-95 hillclimbing algorithm).

So the bottom line is that hillclimbing may indeed sometimes work well WHERE THE FITNESS LANDSCAPE HAS NO STRICT LOCAL MINIMA.

That is to say, the threshold failure of Lang's ML-95 paper (and his July 31 "Comments") is his continued use of a manifestly trivial testbed.

(Juels and Wattenbergs' proof can be easily reworded in terms of finite truth tables to prove that a hillclimbing algorithm can learn any 5-argument Boolean function in only 32 steps using the tabular representation peculiar to discrete-valued functions over discrete-valued domains).

Click to get a Post Script copy of University of California - Berkeley Computer Science Technical Report CSD 94-834 by Marty Wattenberg and Ari Juels dated September 1994 entitled Stochastic Hillclimbing as a Baseline Method for Evaluating Genetic Algorithm

6. The "No Free Lunch" Issue

Bill Macready (on August 16, 1995 on the Connectionists mailing list) has raised the issue of the no-free-lunch principle (Wolpert and Macready 1995; Macready and Wolpert 1995) concerning any claims as to whether one machine learning algorithm can be said to "beat" another:

"In his recent posting, Kevin Lang continues a contest with John Koza of the 'my search algorithm beats your search algorithm ...' variety."

Barak Pearlmutter (on August 17 on the Connectionists mailing list) then defended Lang from the no-free-lunch criticism by saying,

"It might be true that in a universe where everything was equally likely, all search algorithms would be equally awful. But that does not appear to be THE UNIVERSE THAT WE LIVE IN, so it is not unreasonable to ask whether some particular search algorithm performs well in practice." (Emphasis added).

But, regardless of whether you agree with Macready or Pearlmutter, Lang's originally published ML-95 claim that hillclimbing "beat" genetic programming was based on a comparative experiment based on 3-argument Boolean functions and Lang's July 31 claim that "Hill Climbing Still Wins" is based on the testbed of Boolean 4-parity and 5-parity functions. None of these Boolean problems are "the universe that we live in."

7. "Parameter-Free" Multi-Layer Perceptrons

Lang closes his July 31 "Comments" by introducing yet another subject that was not mentioned at all in his published ML-95 paper, namely multi-layer perceptrons.

He tells us

"My positive advice is this: when learning functions, use multi-layer perceptrons"

because this approach, in conjunction with the conjugate gradient search,

"has no control parameters to tweak."

Then, in almost the next sentence, he mentions that the multi-layer perceptron had 5 hidden units; that the hyperbolic tangent function was used; and that the random initial weights drawn uniformly from the interval [-1,+1] !

Presumably, all of this discussion about perceptrons has been introduced to recast Lang's published ML-95 paper as something that it never was, namely, now, a discussion about how perceptrons can also "beat" genetic programming.

The fact is that Lang's ML-95 paper made a very strong (and wrong) claim that hillclimbing "beats" genetic programming based on an inadequate testbed. Lang's July 31 "Hill Climbing Still Wins" retrospectively recasts his first strong (and wrong) claim with a second strong (and wrong) claim.

8. The Problem with Cross-Paradigm Comparisons in Nanoworlds

The fact that Lang's ML-95 paper was uncritically accepted and published by the Machine Learning conference is, in my opinion, a symptom of a serious problem in the field of machine learning.

In my view, the present challenge for machine learning is how to get non-trivial results on non-trivial problems -- NOT the relative speed by which trivial and useless results can be obtained by various alternative paradigms.

Cross-paradigm comparisons (if they are ever useful) can necessarily only be done on problems that can be solved by several different paradigms. Given the severe structural limitations inherent in many commonly-studied machine learning paradigms and their acknowledged inability to solve any significant breadth of problems, cross-paradigm comparisons are necessarily limited to the "least common denominator" of problem domains. In practice, that means some completely trivial problem domain.

Once we get away from the trivial problem domains, the fact is that cross-paradigm time comparisons between genetic programming and other methods are (at least at the present time) a comparison between some finite amount of time and NEVER.

Even if it were true that genetic programming were provably slower than N other machine learning methods on M toy problems (and, I strongly emphasize, I don't the slightest reason to believe that GP is either at the slow OR FAST end of such a scale), the point is that GP is capable of working on a breadth of non-trivial problems that cannot even be touched by other ML methods. The reason for this is that GP operates in the right search space (namely, the search space of computer programs). I believe that time will prove out my expectation that operating in the right search space will give GP the ability to solve some real problems. As I said in my first book (in what I repeatedly tell people is simultaneously the most trite and most important sentence in the book),

" ... if we are interested in getting computers to solve problems without being explicitly programmed, the structures that we really need are COMPUTER PROGRAMS"

(Emphasis in original).

References

Juels, Ari and Wattenberg, Martin. 1994. Stochastic Hillclimbing as a Baseline Method for Evaluating Genetic Algorithms. University of California Computer Science Department technical report CSD-94-834. September 28, 1994.

Koza, John R. 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge, MA: The MIT Press.

Koza, John R. 1994. Genetic Programming II: Automatic Discovery of Reusable Programs. Cambridge, MA: The MIT Press.

Koza, John R. 1995. "A response to the ML-95 paper entitled 'Hill climbing beats genetic search on a Boolean circuit synthesis problem of Koza's' ". Document distributed, by hand, on July 11, 1995 at the 1995 International Machine Learning Conference in Tahoe City, California.

Lang, Kevin J. 1995a. Hill climbing beats genetic Search on a Boolean circuit synthesis problem of Koza's. Proceedings of the Twelfth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann.

Lang, Kevin J. 1995b. "Comments on 'A response to the ML-95 paper entitled "Hill climbing beats genetic search on a Boolean circuit synthesis problem of Koza's" ' ". Dated July 31, 1995 and distributed on Connectionist mailing list on August 16, 1995 and on ML mailing list on August ---, 1995.

Lang, Kevin J., and Witbrock, Michael J. 1989. Learning to tell two spirals apart. In Touretzky, David S., Hinton, Geoffrey E., and Sejnowski, Terrence J. (editors). Proceedings of the 1988 Connectionist Models Summer School. San Mateo, CA: Morgan Kaufmann Pages 52-59.

Macready, W.G. and Wolpert, D. H. 1995. What Makes an Optimization Problem Hard? Santa Fe Institute Working Paper SFI-TR-95-05-046.

Wolpert, D. H., and Macready, W.G. 1995. No Free Lunch Theorems for Search. Santa Fe Institute Working Paper SFI-TR-95-02-010.

· The home page of Genetic Programming Inc. at www.genetic-programming.com.

· For information about the field of genetic programming in general, visit www.genetic-programming.org

· The home page of John R. Koza at Genetic Programming Inc. (including online versions of most papers) and the home page of John R. Koza at Stanford University

· Information about the 1992 book Genetic Programming: On the Programming of Computers by Means of Natural Selection, the 1994 book Genetic Programming II: Automatic Discovery of Reusable Programs, the 1999 book Genetic Programming III: Darwinian Invention and Problem Solving, and the 2003 book Genetic Programming IV: Routine Human-Competitive Machine Intelligence. Click here to read chapter 1 of Genetic Programming IV book in PDF format.

· For information on 3,198 papers (many on-line) on genetic programming (as of June 27, 2003) by over 900 authors, see William Langdon’s bibliography on genetic programming.

· For information on the Genetic Programming and Evolvable Machines journal published by Kluwer Academic Publishers

· For information on the Genetic Programming book series from Kluwer Academic Publishers, see the Call For Book Proposals

· For information about the annual Genetic and Evolutionary Computation (GECCO) conference (which includes the annual GP conference) to be held on June 26–30, 2004 (Saturday – Wednesday) in Seattle and its sponsoring organization, the International Society for Genetic and Evolutionary Computation (ISGEC). For information about the annual NASA/DoD Conference on Evolvable Hardware Conference (EH) to be held on June 24-26 (Thursday-Saturday), 2004 in Seattle. For information about the annual Euro-Genetic-Programming Conference to be held on April 5-7, 2004 (Monday – Wednesday) at the University of Coimbra in Coimbra Portugal.

Last updated on August 27, 2003