Incremental network depth and AI training speed

billyswong · #1

From AlphaGo Zero paper, one can see that less "blocks", i.e. a shallower neural network, leads to a faster learning process but plateau faster too. Later on I read somewhere about the nature of the residual network in use and notice one thing: by default a residual "block" will copy data from previous block and transmit to next block as is. So in theory one can interleave new residual blocks into an existing residual network and function the same.

So here comes a thought, what if we train a shallower network first, and add blocks only after its improvement slows? Could we achieve a similarly strong AI player at the end while skipping computation time and resources?

Bohdan · #2

No, we can't. It just not working in that way. There is a specific branch in NN learning called "Transfer Learning" and the main outcome it is very difficult.

billyswong · #3

But transfer learning is all about using a trained NN for related new tasks (e.g. play go under another board size, ruleset or komi), while I'm talking of expanding a NN for the same task.

dfan · #4

I see no reason that you couldn't do this, but I'm not sure how much gain you'd get from it. You need the power of the full residual network eventually anyway, so my intuition is that you might as well start using it right away, rather than spending some early training time working on a simplified network that you know doesn't have the capacity of your eventual network and might have to change in some fundamental ways; given that your residual blocks are certainly going to end up doing something, it means that downstream layers are going to get different inputs in your residual net than in your original dense net, and are going to have to do some "unlearning" to figure out how to handle them, so I'm not sure whether the "dense net jumpstart" actually helps overall. It is an interesting idea, though!

You may be interested in another approach with similar motivation: Deep Networks With Stochastic Depth. They keep the same residual net from beginning to end, but randomly bypass some fraction of residual layers during training to speed things up. It sounds crazy but it is basically the same idea of dropout (which also sounds crazy at first), but magnified.

moha · #5

Seems hard to tell without actually trying, but I wouldn't expect it to work well. (This is generally true for most ideas in similar areas: 99% of them doesn't improve performance or outright fail.)

Bootstrapping a learning system is possible in a lot of ways, but the gains achievable varies. In this case, if you look at the strength graph, it changes fast at the beginning but becomes flatter soon. So most of the performance is used when strengthening an already strong system, which you cannot save (needs the full network). And you also introduce an extra phase, when the network adjusts itself for the structural changes - further performance loss. It's also unclear how much information can the bigger network use from earlier state - may even need complete relearning. And there are opinions that Zero ended up stronger than Master precisely because of the "tabula rasa" approach - so starting from nonzero may even hurt the final strength.

On the other hand, neural networks are still relatively new, and a lot of improvement will surely be made. The inefficiency of the learning process does seem an open area for such improvements.

Bill Spight · #6

One idea may be to do what the brain does. Instead of adding "nueurons" or connections, subtract them. Below some activation threshold, just eliminate them over time. The result will be a sculpted, structured system, maybe eve a modular one. OC, that process is intolerant of errors.

fwiffo · #7

I'm guessing that adding new layers would initially cause performance to drop to basically zero, but it would probably train back to something similar to its old performance somewhat quickly. This is similar to pre-training. It's often helpful to pre-train a model on some simple task (e.g. autoencoding) prior to training on the more complex task (e.g. objection recognition).

There's no way to know what it does without trying, but I highly doubt you'd get any benefit. The shortest path to high performance would be to make the new layers simply an identity function. They'd just turn into a really expensive nop.

The fact that smaller models train faster but larger models have better final performance is totally normal and expected. Large models are more computationally expensive, making them slower in real time. Also, the gradients (model adjustments from training) are spread out over a larger number of trainable weights, so may train slower in terms of number of training cycles.

There is the possibility of doing the reverse - this is known as distilling. You train a big, computationally expensive model, then use the output of that model to train (or pre-train) a small, fast model. This sometimes results in better performance than training the small model from raw training data because the smarter model ends up removing some of the noise in the training data.

dfan · #8

fwiffo wrote:

I'm guessing that adding new layers would initially cause performance to drop to basically zero, but it would probably train back to something similar to its old performance somewhat quickly.

Adding new intermediate residual layers that are initialized to do nothing (don't add any perturbation to the result of passing the inputs straight through) would cause the network to perform exactly as it did before (just with extra no-ops), until you start training the new system.

Incremental network depth and AI training speed

Who is online