Life In 19x19http://lifein19x19.com/ Neural networks optimising mapping functionshttp://lifein19x19.com/viewtopic.php?f=18&t=18414 Page 1 of 1

 Author: lightvector [ Tue Oct 19, 2021 9:15 am ] Post subject: Re: Neural networks optimising mapping functions Yep, that's a pretty nice overview. One quibble though:Quote:Even with linear activation functions, it doesn't have to follow that (A is a good move) + (B is a good move) = (playing A and B is good). At the input to the network, we enter the state of the game. We then multiply the representation of that state against various weights and sum into the next layers. It would be completely reasonable for a network to have learned "if I played at A, then the positions close by are less valuable now", which might include B.If you are talking about the "policy" function (i.e. predicting where a next good move is), then sure, you could have linear weights such that a stone at either of A or B reduces the policy output value at the other location for it being the next move. Each location can have an initial positive value for being empty, and a negative weight for the presence of a stone at the other location. But dhu163 didn't use the phrase "good move", he instead talked about about whether having one or more stones is better or worse for a player, which is easier to interpret as a statement about the value function.And for the value function, dhu163 is completely right. Suppose we have board states X, and A (which is X + one stone) and B (which is X plus a different stone) and C (which is X plus both of the stones from A and B), and suppose we use a normal input encoding consisting of the presence/absence of a stone of each color on each location of the board, such that when encoded as vectors/tensors, C = X + (A-X) + (B-X)Then if the value function f is linear, we have f(C) - f(X) = f(X + (A-X) + (B-X)) - f(X) = f(A-X) + f(B-X) = (f(A) - f(X)) + (f(B) - F(X)). In other words, the amount of value by which C is better than X is precisely the sum of the amount that A is better than X and B is better than X.

 Author: Polama [ Thu Oct 21, 2021 12:03 pm ] Post subject: Re: Neural networks optimising mapping functions Excellent point lightvector, you are correct that a linear value network won't represent the idea that stones can be good or bad in combination with each other, and that that's closer to dhu163's point. In essence we're saying A XOR B is good (exclusive or), and one of the key observations during the development of neural networks was that linear functions can't represent XOR.

 Author: dhu163 [ Fri Oct 22, 2021 6:07 am ] Post subject: Re: Neural networks optimising mapping functions Thanks for all the explanations.

 Author: dhu163 [ Tue Dec 28, 2021 4:50 pm ] Post subject: Re: Neural networks optimising mapping functions I would like to note:If no bias is introduced, then with n input bits and only RELU activation, all zero input is always mapped to zero output in every neuron. And I think it turns out that however many hidden neurons you have, you always get an output where the cut-off hyper-planes go through zero. Any two such planes only intersect at zero so the gradient at x only depends on the ray from the origin (0 to x to infinity). But to produce such a neuron only one hidden layer is necessary. That is, a sequence of nets each with only one hidden layer can produce in the limit any output neuron that can be produced by an arbitrary number of hidden layers.I don't know what happens with a bias, but I assume that changes things completely so that it is completely general. I can see you can create any function if you can create a delta function around zero (and then shift it with a bias), and something close to a delta function can be created in the limit in the 2nd hidden layer with only 2n offset neurons in the first hidden layer. If correct, it seems that theoretically, only 2 hidden layers are required to produce any neuron you want. I have oversimplified, but perhaps this is not far from the truth, needs more thought.edit: (more thought)THM: with RELU, we cannot create an arbitrary neuron with just 1 hidden layerPRoof:hmm, actually the cut-off (n-1)-plane includes an (n-2)-plane at infinity and so do all the elements close to that. But the value at (i.e. in the limit) each point on the (n-2) plane is either infinity (it is in the non-zero region of a contributing neuron with cut-off plane not parallel to it) or it is the sum of the finite contributions from neurons with cut-off plane parallel to its plane. Hence for the limit of an output neuron to be zero far enough away, i.e. tending to zero at infinity in every direction would imply that the sum of finite contributions from neurons with cut-off plane parallel to its plane is zero for every such (n-1)-plane. However, the output value at a point in the n-space is the sum of these contributions from all (n-1) planes that go through it, which is therefore zero.(in the above replace zero with non-positive as required if considering activation in the 2nd layer)Hence we have proved that any non-zero output neuron in the 2nd hidden layer cannot tend to zero at infinity in all directions. QED (hopefully).This could perhaps be proved by inversion, consider 1/(output). I suspect this to be true of any activation function that produces a kind of 1d result, but it might merely be RELU?Note that we can certainly make a bounded non-zero neuron by placing two neurons with parallel cut-offs but different sign next to each other. Call these pipe neurons.This result implies we cannot create any general function with just 1 hidden layer. And yet we clearly can with 2 hidden layers, by adding two pipe neurons, each only non-zero around the zero point of their own (n-1) planes, but with non-parallel (n-1) planes. Then this only reaches its highest point where the two (n-1) planes intersect and the rest can be discarded and set to zero with a bias. By taking multiples, we can make the highest point arbitrarily high, and can also make it nonzero only in an arbitrarily small n-volume around the point we want.soTHM: we can create any delta function in the limit with 1 hidden layer of RELUhenceTHM: we can create any general non-negative function in the limit with 2 hidden layers of RELU. We can create any general function in the limit with 2 hidden layers of RELU if no activation function is applied to the output neuron.THM: if there is only one input neuron, we can create any general non-negative function with 1 hidden layer of RELU (exercise for reader)Does this matter for training?Probably not much, since we normally only care about the output on a small box and not at infinity, but we do care that there are holes in function space that can't be mapped.But even then, I think the metric/topology of neural nets and how different activation functions/architectures influence training is surely still an interesting topic of study, because it helps map to what you want to use it for, but also very hard (but non-equilibrium thermodynamics seems hard and also to have been a hot topic for a while and has produced some results).For example:What is the effect on training of having a large net with a hidden layer with a number of neurons much less than input and output? Does this slow down evolution to match that layer?Does it help to think of a net in entropy or energy terms with forces at the input as well as forces at the output, propagating through the net? What do we optimise? Training time, network size, accuracy etc?For a given accuracy, what architecture minimises (number of neurons) * (training time)? Given just two hidden layers of arbitrary size and a bias, how well does a net train? Does it always converge to global optima? My intuition is that with 1 input neuron and 1 output neuron and 1 hidden layer of arbitrary size initialised randomly, and sufficiently small increment and say a weighted least squares cost function, then given any positive function that is the "goal" to match (on a bounded region), this net should train to perfection. My intuition says what the output orders is what it gets, and all orders can be met.I wanted to say that least squares cost means that weights being equal, there is greatest pressure on neurons in the layer before to be a multiple of the ideal output function, but this is just false.Well, since we are using gradient descent, if we use a sufficiently large batch and small enough increment, the cost has vanishingly small chance of increasing. Hence, the goal is to show that there are no local optima except when the output matches the goal. This depends on the hidden layer having a sufficiently random set of weights and for it to stay that way during training.We can see that local optima will occur if gradient is zero (so all hidden layer neurons are orthogonal to the difference between output and ideal after applying a filter to remove where either is zero) and Hessian is positive definite (saddles are possible but vanishingly low probability so ignore for now).Perhaps the asymmetry that priority is given to neurons with higher weights can be beneficial here to avoid symmetry and maintain randomness.With randomised training, which local optima are reached more often? How do we optimise training to allow us to reach a good local optima most of the time?jokes, but I can't help but thinking of a neural net as a poor life-form that is part of a circuit every time we train it. Backpropagation is a Huxley style electric shock therapy to train it to act as we wish. We add a connection to the input and the output and hit the power on.The only difference perhaps to humans is that human body architecture has a lot more inputs and outputs than engineers can access as well as internal complexity (not repeating patterns) and was "designed" for different purposes, so we can't alter the internals so easily.

 Author: dhu163 [ Wed Dec 29, 2021 8:49 am ] Post subject: Re: Neural networks optimising mapping functions Some tensorflow experiments seemed initially to show I am wrong about convergence with 1 input and 1 hidden layer. For example, I tried an ideal function of sin(5x) and it still mapped 0 to 1. It didn't get much success mapping near x=0 nor x=1 though it was fairly accurate in the middle.My guess is a sort of prisoner's dilemma issue is occurring because two hidden layer neurons have to work together in order to get the initial kink, but each such neuron is close to orthogonal to the error term. Alternatively, we can say that sum of at least two neurons is required, but training causes the coefficient of one to decrease, which is the wrong direction.This is presumably when a 2nd hidden layer will help a lot, by encouraging teamwork like a team captain. A caveat is that competition between team captains for team members attention can cause trouble if they pull in opposing directions since that makes the neuron freeze.Again, bad metaphors, but I can't help but wonder if forcing teamwork is part of the reason sexual reproduction has had more visible success than asexual reproduction.My experiments show given n neurons per layer, there is a good minimum number of hidden layers h that works well.n h<4 h<10: RMS error of 0.2 is possible but it can't fit the initial kink and maps 0 to 1, h>10: struggled to reduce RMS error below 0.6 even with many layers. I think there may be a vanishing gradient issue with more layers as the input area gets "colder", so it doesn't notice what the input even is and just outputs a constant value. Is this akin to temperature?5 830 5 (6 converges faster)100 310000 1 (with 5 times more rounds of training data) but even then RMS error still 0.08, several orders of magnitude worse than other tries.I think that more neurons per layer always improves accuracy (if same training data and increment small enough), but more layers can make things worse especially with few neurons per layer.My explanation is:The net struggles most when the error function is orthogonal to the neurons in the previous layer, but even this can be offset by changes to the neurons in the previous layer. When there are 10,000 neurons, you would expect that it can't be orthogonal to all of them even after the chaos of training. So I think it also struggles when the error is of the form n1+n2+c, where n1,n2 are neurons in the previous layer but due to the c term, the inner product with n1 and n2 is negative.This is true for my problem above with 1 layer. The required offset to fix the problem around x=0 requires a sum of neurons like ideal(x)-model(x)=(RELU(x)-RELU(x-b))/b-1 with b~0.3. However, the inner product of this with RELU(x) with this is like -b/2<0.Since the gradient looks like the (inner product)*(weights), this reduces the contribution from RELU(x) which can't be good.The inner product with RELU(x-b) is zero since each is non-zero at different points ("they have independent support").How does another hidden layer greatly reduce the problem? It makes it more likely that a neuron of the form n1+n2 is already floating around, at which point that will just plug nicely into the error term, with maximal covariance (=inner product /norms). The connections also mean that such a neuron will readjust coefficients to be more like n1+n2 if necessary.