Wow, 8 responses in 2 hours! Thanks people.
speedchase: I agree with your approach. As you say, testing can be difficult because people’s ranks keep changing. (Here I’m thinking not of the newbies themselves but of the 21k’s we’re comparing them to -- as we go through the process of testing the newbies, the 21k’s get better, which distorts the test.) But if we design a computer program to be 23k (and have it play against a bunch of humans in the 14-16k range to confirm/fine-tune its rank), then we can pit the 23k program against the newbies, and the program won’t get any better. (Unfortunately I don't have the time or resources to carry this out myself...)
snorri: The Kano Yoshinori book you’re referring to originally said “35 Kyu to 25 Kyu” in its title, but then was changed to say “30 Kyu to 25 Kyu”. I’ve always imagined that the purpose of the change was to sacrifice accuracy for an increase in marketability. (That is, if you say that new players start out weaker than 30k, you don’t get taken seriously.) So where you imagine Go culture taking a cue from KY, I imagine KY taking a cue from Go culture.
hyperpape: Be careful about assuming you know what a person’s real question is.

* * * * *
So what evidence can *I* provide? Well, my rank is somewhere around 19k AGA, and I’m confident that I could beat most first-time players with a 17-stone handicap, 3 games out of 3. So that suggests that the average new player is weaker than 36k. But my argument is weak, because I don't actually know that I'm 19k, and I haven't actually played a bunch of first-timers with 17-stone handicaps to see whether I win 3 games out of 3. (Also, the handicap system is generally considered to get less accurate after 9 stones.) It goes back to the general principle that using yourself as an example to prove a point is dangerous, because it’s hard to see yourself accurately and there's ego involved.