Life In 19x19

Neural networks optimising mapping functions
Page 1 of 1

Author:  dhu163 [ Sun Oct 17, 2021 6:10 am ]
Post subject:  Neural networks optimising mapping functions

This post is designed for a mathematical + comsci audience only. It is to demonstrate my understanding, and share some thoughts.

As far as I recall:
There are around 3^361 possible Go board position, but more like 10^170 according to Tromp.
The number of weights in a 40 block neural net based on Alphago (like katago) is around 10^8.
The number of neurons in a human brain is around 10^11.

Do the dimensions match? Is dimensions the right way to think about the problem?

For example given n input bits (e.g. to represent board positions), you want to output 1 number to represent the value of that position. The dimension of the function space is 2^n * 1 = 2^n.
However, does the number of weights in a neural network roughly correspond to dimensions. The key thing is that neural network outputs are not linear in their weights or inputs, so thinking of the functions as a vector space probably doesn't lead very far. Non-linearity is critical because otherwise linearity would imply that if a stone A helps player X, and a stone B helps player X, then having both stones A and B on the board will help player X. This is false, for example if A and B together is self-atari or suicide.

On the other hand, matrices still play a key role in neural networks. The linearity point is why I was surprised when I was first told by a Deepmind employee that the AI was based on matrices. However the study of matrices is very deep, with highly optimised (e.g. parallelised) algorithms to work with them - adding, multiplying etc.
It is also easier to think in terms of linear things (since we are familiar with adding up normal numbers and matrices are just collections of numbers), and the calculus formulae are much simpler to write down with matrices (merely the chain rule: dAx/dx = A).

Neural networks in 5 paragraphs:

If there are n input bits these n numbers are called the zeroth layer (with n input neurons). Each next layer consists of say m neurons, each of which is a function of the neurons in the previous layer. After a good number of layers (I think it is 80 in a 40block net?), the output layer consists of your output, e.g. the value and policy numbers (just 2 neurons). There is a weight (a real number) for every relation between a neuron in one layer and a neuron in the next indicating how they are correlated.

In image processing, CNNs (which alphago uses), not all possible neurons are linked between neighbouring layers. Instead "nearby" neurons are linked. There are well developed mathematical tricks to make the processing as simple as possible while still preserving good enough links. In Go, this helps mirror the ruleset, the fact that a stone's immediate impact on the board is only on the liberty count of its 4 neighbours. Only through the connectedness of the board does this influence slowly propagate out. Note that a ladder is the only Go shape that can propagate influence arbitrarily far through empty space without decreasing in value, but that is a special tactical case where both sides have only one reasonable move to play in order to win that fight. Normally the influence of a stone/position decreases exponentially with distance. In this way, Go is a primarily local game.

The non-linearity is inserted by an "activation function" F at each neuron. So each neuron is a linear sum of the previous layer's neurons (sum of weight w_k times neuron k's output), which is then passed F to produce the neuron's output. There are fancy functions such as tanh, sigmoid (for which the derivative is very simple, this is used in ELO as well), RELU and much more variants.

The training is done by using a loss function (e.g. cross entropy for sigmoid or least squares) on training games and gradient descent. If a position resulted in a loss, the value output for that position should be closer to 0% than 100%. The loss function tends to be linear, so that if a position results in a win p% of the time, the value output for that position should converge to p% (if only training on that position). The weights are adjusted by a size and direction (+ or -) according to how much they contribute in the current net to changing the value output for that position. Since large weights will have most impact on the output, they will also be modified the most. This leads to problems such as gradient explosion or vanishing gradient, and dead neurons. However, normally no neuron is completely dead (that is too focused on the discrete dimension we can see and forgetting the relations between the neurons). With different inputs, different neurons will be more or less important.

There is a constant feedback loop of adapting neurons as more training is done, learning and forgetting. Sometimes good simple rules learnt previously are replaced by more efficient better neuron rules. The random history of the net's past weights can affect how it fits the mapping function so keeping several different nets at one time can be a good idea rather than pure self-play. The idea is to get the net to converge to optimal play as best as possible.

RELU, gradients
Thinking in terms of sigmoid activation functions is quite confusing. However AFAIK, RELU seems to work just as well or even better, often with faster training times.
RELU(x) = x if x>0 else 0
The point is that this keeps most of the linearity but provides a cut-off point below which everything is a flat zero. This is a lot easier to analyse carefully.

We can think of each neuron as a function of the input space (extend {0,1}^n to R^n since the functions can take any number in R as input). It is convenient to represent this scalar function as a gradient field in R^n.

In the zeroth layer, each neuron represents a dimension, so the n neurons represent the n basis vectors.

In the first layer, if activation is not used, then each neuron is first a linear sum of the inputs, so as a function it is a hyperplane in the input x output space. The gradient is constant in the whole space. (picture n=2, a plane with where every point maps to a fixed vector). A RELU activation introduces a cutoff line, perpendicular to the gradient below which the gradient is zeroed out. Crossing a cutoff line leads to a fixed jump in gradient depending on which direction you cross it in. When n>2, the cutoff is a hyperplane of dimension n-1, but for now we stick to visualising n=2.

In the second layer, you have a linear combination of outputs in the first layer with some cutoff. In this way (pre-activation), you can have a function with several different cutoff lines inherited from the first layer. These don't have to be parallel, and the gradient can point the opposite direction as well, cancelling out gradients. The activation step leads to a new cutoff "curve" which is always perpendicular to the gradient there, but which separates the spaces into >0 and <0, replacing the gradient with zero if the function is <0. Many segments of this curve will be parallel to cutoff lines in the first layer, but where nonzero gradients from different neurons overlap, the curve may bend to smooth out the differences in gradient.

Is there still a sense in which if a neuron is created with m weights, then varying those m weights means the function space has dimension m? I think so yes. Especially with the standard technique by adding a constant bias neuron at each layer, and connecting it with a weight to all neurons in the next layer, this means that addition by an arbitrary constant is possible in a neuron, meaning the cutoff line can be shifted back and forth as well.

Metrics and simplifying models
Go has its ELO ratings as a model, but as katago helps demonstrate, good strategy against weaker players can be very different from good strategy against stronger players.
This is because it is a two player game and your results depend very much on your opponent. When you have three player games, things tend to get even more complicated, and standard game theory notions of "optimal play" regardless of your opponent's actions stop making sense because if your opponents always cooperate, you always lose.

In a one player game, the final score works as a good metric.

Let us a consider a one player perfect information game with two choices at each of n-1 moves with no repeated positions. The game ends after those n-1 moves and is scored. Then there are 2^0+2^1+...+2^(n-1)=2^n-1 possible states. It is natural to use n bits to encode the state.

How much different does a well designed encoding make compared to an arbitrary one? For example, on a go board, similar local shapes crop up a lot, but is this enough to make a big impact?

To simplify things, we assume that the neural network only outputs a value number in R which is trained to match expected score. During game play, the bot reads ahead one move and uses cross entropy to calculate the probability of playing the next move. Say the two possible expected scores from the current position are a and b. Then the probability of playing to a is e^a/(e^a+e^b). A multiplier alpha can be inserted so it is e^(alpha*a)/(e^(alpha*a)+e^(alpha*b)).

If the network is big enough (say 2^n weights), then you expect this to train to perfect accuracy with zero loss (but not necessarily perfect play). If there were 2^n independent weights training on each position, this is guaranteed to converge to perfection (if the weights don't bomb out and die with zero gradient). This is because all positions are reached with nonzero probability, and the 2^(n-1) final scored states will be trained to perfection since they are trained to match a fixed number. Other states will have more fluctuations since the expected score from such positions will depend on the probability distribution of moves chosen from that position which depends on the imperfect neural network. However, once the 2^(n-1) final states are perfectly scored (within eps<<1), distribution is fairly fixed, so the 2^(n-2) states that are one move before the end of the game will be training to match a fixed number and so on, gradually making the expected score for previous states all the back to the start more accurate. It is unclear to me how the non-independence of the weights in a neural network affects training accuracy.

A simple example
Using the assumptions above, lets set n=3 and use a very simple network with no hidden neurons, just 3 weights from input to output. (in theory we should need 7 weights to map all 7 states in general).

If we ignore (0,0,0), let's say the game tree is

root = (1,0,0)
(parent: children) pairs:
(1,0,0) : (0,1,0) (1,1,0)
(0,1,0) : (0,0,1) (0,1,1)
(1,1,0) : (1,0,1) (1,1,1)

The rightmost nonzero bit indicates the level, and the bits to the left of that indicate the parent.
with final scores A,B,C,D in (0,0,1),(0,1,1),(1,0,1),(1,1,1) respectively.

I wrote a simple program (I just started learning tensorflow yesterday) to train on this model. RELU lead to a dead network around 10% of the time, so I switched to LeakyReLU. I can't make any general observations yet, but I do note that on nodes which lead to bad scores, they don't get many playouts, and tend to have much more inaccurate value estimates. This is because they contribute less to the loss function, with less occurrences.

Have I really learnt anything yet? I seem to have made a small step towards answering my question on dimensions but otherwise, I still don't really understand the interaction between network architecture and ability to train to the right answer. I also don't much understand what changes for a 2 player game, or how the input data format can help or hinder information content/flow.

I suppose that strength in a 2 player game is about the confidence and skill of choose a line where you have to play m moves perfectly in a row in order to profit (compared to a simple variation), as well lots of ways to filter out bad moves to increase the lower bound for how bad a move you might play. Much of "aggression" is probably to assume that you will play such tight (紧凑) sequences more perfectly than your opponent. In particular, being able to live in your opponent's moyos and save weak cutting stones.

What I really want to see is some relation between the network structure (e.g. neurons in a bottleneck) with a small number of key concepts (human or otherwise) that we can measure and interpret with some kind of symbolic logic/arithmetic. But I don't know if that is reasonable or what that would look like or how to make progress.

Author:  lightvector [ Sun Oct 17, 2021 4:23 pm ]
Post subject:  Re: Neural networks optimising mapping functions

It sounds like you've dug into this quite a lot! There are lots of minor things that are stated a little inaccurately, or ways things are phrased that are oversimplified, but also yes a lot of your understanding is reasonable and good.

The most notable thing of these minor things: the total number of possible Go board positions is not a relevant number, it's a vast, vast overestimate that you shouldn't pay much attention to.

Note: technically we should talk about game tree complexity, not state space size, but for simplicity I'm going to conflate and ignore that for this post.

For starters, even if you're trying to rigorously prove a solution to the game or to play optimally, with good algorithms you generally only care about the union of the set of positions reachable by either:

* Black plays optimal or near-optimal moves, while white plays anything. (To prove black's best result >= x, you must consider every white resistance, but for each of black's turn you need only demonstrate ONE good enough move).
* White plays optimal or near-optimal moves, while black plays anything. (To prove black's best result <= x, you must consider every black resistance, but for each of white's turns you need only demonstrate ONE good enough move).

You probably have this intuition as a Go player too - when analyzing a position, subvariations that can only result from multiple obvious mistakes by both sides are irrelevant to determining the overall result. So at any given moment we are entirely losing the branching for from black, or losing entirely losing the branching factor from white, we roughly square-root the number of reachable positions that we care about (and good algorithms can obtain nearly a sqrt speedup as they go, even if they don't know for sure at the outset what moves will be optimal).

Now we can go further. In the context of neural nets, normally we also abandon the idea of rigorously solving the game, because with a neural net, even if it does learn to play optimally, you don't have much ability to rigorously prove that those particular millions of numeric parameters happen to encode an optimal player other than to just resort back to exhaustive search anyways. So rather than rigorous solution, we simply care that *in practice* it appears to be unbeatable vs anyone or anything. Or sometimes we care only about the lower standard of merely beating the best human players.

Then you can get another *vast* improvement beyond even the square root. If the game tree were absolutely arbitrary with no correlation whatsoever between the outcomes of different moves, no repeated patterns or strategic regularities, etc, then square root via exhaustive search is best you can do. But game trees for real life games are not arbitrary - they arise out of particular rules and mechanics, resulting in exponentially vast amounts of repeated patterns, correlations, regularities, strategic principles, etc. Just like humans, neural nets will learn these principles ("eyes", "good shape vs bad shape", etc) that generalize across vast numbers of positions, and then refine them and learn the exceptions, and then refine them further and learn the exceptions to the exceptions, and so on.

Again, this should be intuitive to you as a Go player. Consider a 5x5 board. The optimal result is black opening in the center and then owning the whole board. If we do the same ultra-crude estimation, we have sqrt(3^25) ~= 900k. Do you have to memorize the status of 900k completely distinct and independent things in order to always win the entire board on 5x5 as black? No, of course not. While fully exhaustive proof of black's win would require showing how black refutes :w2: A1, :w2: A2, :w2: A3,... all separately and exhaustively, and for each of them, how you refute every :w4: and then every :w6: and so on... if we only want optimal-in-practice rather than rigorous proof, we only need to read out or study about a limited tree of critical lines where White resists strongly, and for refuting all the other obviously bad moves like A1, A2,... it's enough to play according to simple strategic shapes and principles that *generalize* across all these exponentially many possibilities, and by even the most pessimistic ways of quantifying and counting this fuzzy value, there are going to be far fewer than 900k shapes and principles to learn.

I would guess with a good algorithm and architecture, you could get to optimal play on 5x5 from the opening position with a net that has somewhere between 100 and 5000 parameters. Which is a pretty massive improvement over sqrt(3^25), much less 3^25.

Anyways, the point is - the size of the state space or game tree is really more of an idle curiosity than a useful number, if all you care about is practical play. It's such an absurdly vast overestimate, and the improvements that neural nets can give you are so dependent on the nature of the game and the details of your algorithm and architecture, that even using it as a starting number from which you quantify what improvements you can get is not useful - you're better off ignoring it entirely and just looking at or trying small practical experiments to empirically measure what you care about directly.

Author:  Polama [ Tue Oct 19, 2021 7:38 am ]
Post subject:  Re: Neural networks optimising mapping functions

Even with linear activation functions, it doesn't have to follow that (A is a good move) + (B is a good move) = (playing A and B is good). At the input to the network, we enter the state of the game. We then multiply the representation of that state against various weights and sum into the next layers. It would be completely reasonable for a network to have learned "if I played at A, then the positions close by are less valuable now", which might include B.

Imagine an input neuron X, and then a complicated network, and then an output neuron Y. The value of X (1 or 0), gets multiplied by various weights and summed into other values and multiplied and summed and smears out through the network and eventually influences the value of Y. But since everything is linear, we can look at all the paths X took to get to Y, and sum them up, and get a function like Y = 5X - 8Z + 3W + ...

So essentially, a multi-layer network with linear weights ends up collapsing down into a simple linear equation anyways.

Once you add the nonlinearities, the paths the weight X takes are all complicated, and don't just sum directly into a final influence weight, so you can represent much more intricate decision boundaries.

Do the dimensions match? Is dimensions the right way to think about the problem?

Not really? First, the weights in a network are analogous to synapses in the brain. So although the "neuron" counts are getting close, it's not a meaningful comparison.
Second, you only need to represent the full dimensionality of your problem if your problem is pure noise. If every Go position was a 100% unique problem where you could apply absolutely nothing from your previous experience, you'd need enough capacity to memorize the right answer to every problem. In actuality, there are general concepts, and common shapes, and locality where a corner group is dead even if we change stones around in the far corner. So playing at some level X (even if X is perfect play) needs some tiny fraction of the space it would take to encode the right answer for every single possible position.

I still don't really understand the interaction between network architecture and ability to train to the right answer.

Nonlinearities make the math work. Empirically, the larger the network the better it performs. The theoretical underpinnings of that aren't great, but it shows up in practice. If a network was just barely large enough to solve a problem, then only a very specific, small number of weights would work. If the network is far larger then needed, then there's lots of ways to parametrize it to solve the problem, and so it's "easier" to train. Why this continues to matter so much at ever increasing sizes is harder to explain.

The network is also the way you connect up the layers. This usually encodes some human intuition about a problem, like: "if I rotate a picture, it should still be the same picture" or "it should be easy for a network to throw away unneeded information as it processes the words in a sentence". You could think about information theory, and the complexity needed to represent various decision boundaries in a given architecture, but there's quite a bit of "throw stuff at the wall and see what sticks"

What I really want to see is some relation between the network structure (e.g. neurons in a bottleneck) with a small number of key concepts (human or otherwise) that we can measure and interpret with some kind of symbolic logic/arithmetic. But I don't know if that is reasonable or what that would look like or how to make progress.

Explainable AI, and connecting symbolic logic with deep learning, are major outstanding challenges in the field. People try lots of things (like looking for statistical correlations with individual neurons firing and some feature of the board, like the presence of a dead group), but it's really still early days. So if you care enough to dig to the cutting edge, you can advance human knowledge! But on the flip side, it's not the sort of question where there's ready answers.

Author:  lightvector [ Tue Oct 19, 2021 9:15 am ]
Post subject:  Re: Neural networks optimising mapping functions

Yep, that's a pretty nice overview. One quibble though:

Even with linear activation functions, it doesn't have to follow that (A is a good move) + (B is a good move) = (playing A and B is good). At the input to the network, we enter the state of the game. We then multiply the representation of that state against various weights and sum into the next layers. It would be completely reasonable for a network to have learned "if I played at A, then the positions close by are less valuable now", which might include B.

If you are talking about the "policy" function (i.e. predicting where a next good move is), then sure, you could have linear weights such that a stone at either of A or B reduces the policy output value at the other location for it being the next move. Each location can have an initial positive value for being empty, and a negative weight for the presence of a stone at the other location. But dhu163 didn't use the phrase "good move", he instead talked about about whether having one or more stones is better or worse for a player, which is easier to interpret as a statement about the value function.

And for the value function, dhu163 is completely right. Suppose we have board states X, and A (which is X + one stone) and B (which is X plus a different stone) and C (which is X plus both of the stones from A and B), and suppose we use a normal input encoding consisting of the presence/absence of a stone of each color on each location of the board, such that when encoded as vectors/tensors, C = X + (A-X) + (B-X)

Then if the value function f is linear, we have f(C) - f(X) = f(X + (A-X) + (B-X)) - f(X) = f(A-X) + f(B-X) = (f(A) - f(X)) + (f(B) - F(X)). In other words, the amount of value by which C is better than X is precisely the sum of the amount that A is better than X and B is better than X.

Author:  Polama [ Thu Oct 21, 2021 12:03 pm ]
Post subject:  Re: Neural networks optimising mapping functions

Excellent point lightvector, you are correct that a linear value network won't represent the idea that stones can be good or bad in combination with each other, and that that's closer to dhu163's point. In essence we're saying A XOR B is good (exclusive or), and one of the key observations during the development of neural networks was that linear functions can't represent XOR.

Author:  dhu163 [ Fri Oct 22, 2021 6:07 am ]
Post subject:  Re: Neural networks optimising mapping functions

Thanks for all the explanations.

Author:  dhu163 [ Tue Dec 28, 2021 4:50 pm ]
Post subject:  Re: Neural networks optimising mapping functions

I would like to note:

If no bias is introduced, then with n input bits and only RELU activation, all zero input is always mapped to zero output in every neuron. And I think it turns out that however many hidden neurons you have, you always get an output where the cut-off hyper-planes go through zero. Any two such planes only intersect at zero so the gradient at x only depends on the ray from the origin (0 to x to infinity). But to produce such a neuron only one hidden layer is necessary. That is, a sequence of nets each with only one hidden layer can produce in the limit any output neuron that can be produced by an arbitrary number of hidden layers.

I don't know what happens with a bias, but I assume that changes things completely so that it is completely general. I can see you can create any function if you can create a delta function around zero (and then shift it with a bias), and something close to a delta function can be created in the limit in the 2nd hidden layer with only 2n offset neurons in the first hidden layer. If correct, it seems that theoretically, only 2 hidden layers are required to produce any neuron you want. I have oversimplified, but perhaps this is not far from the truth, needs more thought.

edit: (more thought)
THM: with RELU, we cannot create an arbitrary neuron with just 1 hidden layer
hmm, actually the cut-off (n-1)-plane includes an (n-2)-plane at infinity and so do all the elements close to that. But the value at (i.e. in the limit) each point on the (n-2) plane is either infinity (it is in the non-zero region of a contributing neuron with cut-off plane not parallel to it) or it is the sum of the finite contributions from neurons with cut-off plane parallel to its plane. Hence for the limit of an output neuron to be zero far enough away, i.e. tending to zero at infinity in every direction would imply that the sum of finite contributions from neurons with cut-off plane parallel to its plane is zero for every such (n-1)-plane. However, the output value at a point in the n-space is the sum of these contributions from all (n-1) planes that go through it, which is therefore zero.

(in the above replace zero with non-positive as required if considering activation in the 2nd layer)

Hence we have proved that any non-zero output neuron in the 2nd hidden layer cannot tend to zero at infinity in all directions. QED (hopefully).

This could perhaps be proved by inversion, consider 1/(output). I suspect this to be true of any activation function that produces a kind of 1d result, but it might merely be RELU?

Note that we can certainly make a bounded non-zero neuron by placing two neurons with parallel cut-offs but different sign next to each other. Call these pipe neurons.

This result implies we cannot create any general function with just 1 hidden layer. And yet we clearly can with 2 hidden layers, by adding two pipe neurons, each only non-zero around the zero point of their own (n-1) planes, but with non-parallel (n-1) planes. Then this only reaches its highest point where the two (n-1) planes intersect and the rest can be discarded and set to zero with a bias. By taking multiples, we can make the highest point arbitrarily high, and can also make it nonzero only in an arbitrarily small n-volume around the point we want.

THM: we can create any delta function in the limit with 1 hidden layer of RELU
THM: we can create any general non-negative function in the limit with 2 hidden layers of RELU. We can create any general function in the limit with 2 hidden layers of RELU if no activation function is applied to the output neuron.

THM: if there is only one input neuron, we can create any general non-negative function with 1 hidden layer of RELU (exercise for reader)

Does this matter for training?
Probably not much, since we normally only care about the output on a small box and not at infinity, but we do care that there are holes in function space that can't be mapped.

But even then, I think the metric/topology of neural nets and how different activation functions/architectures influence training is surely still an interesting topic of study, because it helps map to what you want to use it for, but also very hard (but non-equilibrium thermodynamics seems hard and also to have been a hot topic for a while and has produced some results).

For example:
  1. What is the effect on training of having a large net with a hidden layer with a number of neurons much less than input and output? Does this slow down evolution to match that layer?
  2. Does it help to think of a net in entropy or energy terms with forces at the input as well as forces at the output, propagating through the net?
  3. What do we optimise? Training time, network size, accuracy etc?
  4. For a given accuracy, what architecture minimises (number of neurons) * (training time)?
    1. Given just two hidden layers of arbitrary size and a bias, how well does a net train? Does it always converge to global optima?
    2. My intuition is that with 1 input neuron and 1 output neuron and 1 hidden layer of arbitrary size initialised randomly, and sufficiently small increment and say a weighted least squares cost function, then given any positive function that is the "goal" to match (on a bounded region), this net should train to perfection. My intuition says what the output orders is what it gets, and all orders can be met.
      I wanted to say that least squares cost means that weights being equal, there is greatest pressure on neurons in the layer before to be a multiple of the ideal output function, but this is just false.

      Well, since we are using gradient descent, if we use a sufficiently large batch and small enough increment, the cost has vanishingly small chance of increasing. Hence, the goal is to show that there are no local optima except when the output matches the goal. This depends on the hidden layer having a sufficiently random set of weights and for it to stay that way during training.

      We can see that local optima will occur if gradient is zero (so all hidden layer neurons are orthogonal to the difference between output and ideal after applying a filter to remove where either is zero) and Hessian is positive definite (saddles are possible but vanishingly low probability so ignore for now).

      Perhaps the asymmetry that priority is given to neurons with higher weights can be beneficial here to avoid symmetry and maintain randomness.
  5. With randomised training, which local optima are reached more often? How do we optimise training to allow us to reach a good local optima most of the time?

jokes, but I can't help but thinking of a neural net as a poor life-form that is part of a circuit every time we train it. Backpropagation is a Huxley style electric shock therapy to train it to act as we wish. We add a connection to the input and the output and hit the power on.

The only difference perhaps to humans is that human body architecture has a lot more inputs and outputs than engineers can access as well as internal complexity (not repeating patterns) and was "designed" for different purposes, so we can't alter the internals so easily.

Author:  dhu163 [ Wed Dec 29, 2021 8:49 am ]
Post subject:  Re: Neural networks optimising mapping functions

Some tensorflow experiments seemed initially to show I am wrong about convergence with 1 input and 1 hidden layer. For example, I tried an ideal function of sin(5x) and it still mapped 0 to 1. It didn't get much success mapping near x=0 nor x=1 though it was fairly accurate in the middle.

My guess is a sort of prisoner's dilemma issue is occurring because two hidden layer neurons have to work together in order to get the initial kink, but each such neuron is close to orthogonal to the error term. Alternatively, we can say that sum of at least two neurons is required, but training causes the coefficient of one to decrease, which is the wrong direction.

This is presumably when a 2nd hidden layer will help a lot, by encouraging teamwork like a team captain. A caveat is that competition between team captains for team members attention can cause trouble if they pull in opposing directions since that makes the neuron freeze.

Again, bad metaphors, but I can't help but wonder if forcing teamwork is part of the reason sexual reproduction has had more visible success than asexual reproduction.

My experiments show given n neurons per layer, there is a good minimum number of hidden layers h that works well.
n h
<4 h<10: RMS error of 0.2 is possible but it can't fit the initial kink and maps 0 to 1, h>10: struggled to reduce RMS error below 0.6 even with many layers. I think there may be a vanishing gradient issue with more layers as the input area gets "colder", so it doesn't notice what the input even is and just outputs a constant value. Is this akin to temperature?
5 8
30 5 (6 converges faster)
100 3
10000 1 (with 5 times more rounds of training data) but even then RMS error still 0.08, several orders of magnitude worse than other tries.

I think that more neurons per layer always improves accuracy (if same training data and increment small enough), but more layers can make things worse especially with few neurons per layer.

My explanation is:
The net struggles most when the error function is orthogonal to the neurons in the previous layer, but even this can be offset by changes to the neurons in the previous layer. When there are 10,000 neurons, you would expect that it can't be orthogonal to all of them even after the chaos of training. So I think it also struggles when the error is of the form n1+n2+c, where n1,n2 are neurons in the previous layer but due to the c term, the inner product with n1 and n2 is negative.

This is true for my problem above with 1 layer. The required offset to fix the problem around x=0 requires a sum of neurons like ideal(x)-model(x)=(RELU(x)-RELU(x-b))/b-1 with b~0.3. However, the inner product of this with RELU(x) with this is like -b/2<0.
Since the gradient looks like the (inner product)*(weights), this reduces the contribution from RELU(x) which can't be good.

The inner product with RELU(x-b) is zero since each is non-zero at different points ("they have independent support").

How does another hidden layer greatly reduce the problem? It makes it more likely that a neuron of the form n1+n2 is already floating around, at which point that will just plug nicely into the error term, with maximal covariance (=inner product /norms). The connections also mean that such a neuron will readjust coefficients to be more like n1+n2 if necessary.

Author:  dhu163 [ Wed Dec 29, 2021 11:54 am ]
Post subject:  Re: Neural networks optimising mapping functions

Back to

Is it useful to apply "energy/entropy" to neural net training?

I try to compare some metaphors/analogies to familiar physics. Please note this is merely an outsider's perspective.

Firstly, some popular science style thoughts on life/earth energy/entropy. Slightly scatter-minded.

Equilibrium entropy is much easier to understand but this requires no interaction with the environment.

On Earth, we have a periodic energy source of sunlight which drives activity on Earth, as well as heat radiation back into space. Other than key laws of energy and momentum conservation (connected to space-time symmetries of physics by Noether), we don't have many other key laws. Momentum conservation suggests more heat will be radiated from Earth away from the Sun than towards the Sun, but the Earth itself could also absorb the momentum by shifting its orbit slightly so we can't be sure.

Now, entropy theory tends to suggest that all else equal, energy will spread to all possible dimensions evenly. My interpretation is that this suggests that even with a non-equilibrium source of momentum (sunlight), there will be a natural balance between how much sunlight is reflected back to the Sun, how much is absorbed on Earth, and how much is reemitted in various directions. Quantum physics can study how special "potentials" interact and solve equations in extreme/ideal limits (complete reflection, zero reflection etc.), showing how to design equipment that has such properties, as well as helping intuition interpolate for what happens in less extreme conditions.

Life is part of this cycle as Sun energy goes into plants, heating the ocean and land, turning CO2 into O2 etc. . Animals add equilibrium with their job of storing and recycling this excess O2 back into CO2, releasing heat in the process. Equilibrium thermodynamics will call us part of the dissipation system of the excess energy. A stronger understanding of a periodic (and essentially fixed) energy source would probably understand the whole system as a series of transitions of structures that enable and react to the inbalance caused by the low entropy property that the Sun has so much more energy than the Earth. Perhaps a better understanding of the time arrow of entropy and memory would give us even more insight.

Why does life seem to contradict the 2nd law of thermodynamics that says energy becomes more even over time? The 2nd law suggests things become more chaotic and random, which normally is viewed as less repeating patterns.

Well the duplication property of DNA seems to be strange since there is some inbalance between the development of increasingly complex instructions and the 2nd law. Dawkins has compared life to crystal (non-living) growth. My interpretation is that sometimes repeating patterns such as in a solid are the lowest energy forms of that material (when like attracts like etc.), but it very much depends on the environment. Knocking a bottle of super-cooled water can cause it to freeze because it was already cold, but simply hadn't connected into crystal form yet, but once a small part of it forms into ice due to random chance (caused by heat from knocking), other molecules slot into place naturally (and the kinetic energy of liquid molecules goes into vibrations of the ice as heat/sound that escapes through the bottle into the environment). Probably there are many similar patterns that we don't have the senses to notice unless we pay careful attention, and I'm sure DNA is far from the only way hereditary complex instructions are passed down, even though we are most aware of its long and successful history. Also, extinctions show how fragile life is, so we don't expect never-ending increase in complexity. It depends on which dimensions you measure in, and that is biased by how you are designed and trained.

Humans are small enough compared to the Sun, that you can probably expect something to come of the Sun's energy, and since the time scale of the Sun's life-cycle is billions of years, the energy source is unlikely to change much, so at the human scale, entropy effects on the Sun are not noticeable without a long time-scale of data.

However, the origin and nature of our life/intelligence and potential other life-forms remains very puzzling. As well as how nature has guided humanity to technology and population leaps over the last few hundred years. Can we interpret this situation as having been set up for us millenia ago? What are the driving factors? Access to oil, electricity, global warming, globalisation, trade/information networks?

Entropy in a computer
A computer is a machine carefully set up with an essentially infinite energy source so that it responds (like lego) to tiny button presses that flip a bit. I suppose that other than the designer and manufacturer of the parts (including how nature set up the silicon states), the main low entropy source here primarily comes from the user/programmer and the choice of programs they run. We say the user has a motivation/intent/goal so that even if the computer has issues/errors, the user works through them to get what they want. This includes encouraging the manufacturer to make longer lasting designs which protect key components from environmental interference.

In this way, the entropy that really counts (when energy is infinite) is information. In a neural net, this comes from the input/output. But I don't really understand how to measure this or it if even makes sense globally. Probably some kind of basis is needed.

Energy/power in a neural net
Training a net takes work. This matters physically since we need to change the weights and compute the network's output. However, in an abstract sense, we can perhaps sense that the gradient descent method means that when we train, we apply a certain "power" to shift the weights that is proportional to the weights, and this flows in one direction (guided by the method and not a natural force field).

Since gradient descent means the change in weights per step (like time, so this is like velocity) is proportional to the gradient of the loss function (of the weights), this is different to having acceleration proportional to the gradient (of a potential energy field). So we don't exactly have energy conservation. We can perhaps think of it as time goes faster when the gradient is higher, whereas low weight connections are in hibernation. Perhaps it is slightly like the opposite of getting sucked into a black hole. As the weights spiral into a local minimum, step size gets vanishingly small so that it never reaches the minimum.

It isn't energy flow as such since the activity is dominated by a driving mechanism that brings in enough energy as needed.

Loss satisfies a heat equation
The closest thing I can think of that gradient descent on a quadratic well satisfies is the heat equation, which says velocity is proportional to difference. The (weight) space velocity is proportional to the difference in loss to nearby points. This is different to the standard picture of the temperature change at a point is proportional to the difference in temperature to nearby points. In our picture, the loss as a function of weights doesn't change, but instead the position in weight space changes. In some sense, the loss absorbed by the weights.

Entropy is still relevant in this picture since we can treat the weight space as our (possibly infinite) state space with various deterministic trajectories during training. However, given initialisation is random, if many trajectories die or otherwise end up at local optima, entropy can help explain why.

Circuit analogy of a neural net
Well a neural net is a bit like a circuit diagram, though flow is discrete and only occurs in one direction at a time. One end (the input) is held fixed, and the other nodes must balance the "voltage difference" between the output and the input by shifting connections accordingly.

If we talk about training "power", is there a corresponding current and resistance?

Supply/demand in a neural net
Perhaps it is more like economic flow?

If neurons in successive layers can get more and more complex, we can think of a net as "mining" and "refining" the natural resource of functions of various shapes trying to meet demand as best as possible. More layers allow more refinement. More neurons in a layer allow more types of resources and hence more useful combinations. Too many layers means there is a disconnect between demand and supply.

During backpropagation (of gradient during training), a neuron strengthens/weakens connections to inputs based on the feedback it gets from it's "client" neurons (that take its output) AND gives feedback on what it demands to its input neurons. Sometimes this demand can be met. Sometimes it can't. If weights are randomly initialized, this randomness is a source of all sorts of material, but with more training, input neurons can become stale and clump together and become hard to change. Most obviously is if all input neurons tend towards the same function, after which this reduces the dimension to 1, and the symmetry of training may mean this situation continues forever. So it is possible for resources can dry up, though it is hard to explicitly describe them or measure which have and haven't. (It depends on the training process).

In a way, we can view backpropagation as client neurons paying input neurons in gradient tokens that they can then use themselves to pay their own input neurons. However these tokens are a bit weird in that their magnitude matters most (positive or negative), and is converted by a multiplier to the next neuron along.

The most connected neurons are most in demand and are under greatest pressure when a change (say in ideal function) occurs, gradient flow is highest through them. This can break them if the learning rate of the optimiser (i.e. step size) is too high, and they would only be in such demand if they were useful, so breaking them is the worst case. Smaller neurons shift slower and are ignored by the controller but may gradually rise too later.

Nns respond to the slightest training and bad training can break them.

Gradient vanishing
In Go and other combinatorial games, to solve the game tree, we can in theory slowly work backwards from endgame with little reference to the opening. However, in a neural network, optimising the weights must happen by coordinating between input and output simultaneously. If gradient vanishes, then flow of communication is blocked and no more change can occur. We only want this to happen when loss is very low and not at a bad local optimum.

We get congestion when all input neurons are orthogonal to the gradient or a non-zero number have negative covariance despite their contribution needed to increase (or vice versa), since their covariance with the contribution needed from other input neurons counters their own. One idea is to unblock by adding a new random connection at this point, a new connection (perhaps a new neuron) that connects backwards many layers before to a more general state. I understand this sort of thing has been tried and "block" networks are a version of this.

Like a surgery, we don't want to damage more than help. Letting the net sort itself out is normally automatic and requires least care. Some neurons will go extinct if other neurons match result perfect to the point where it doesn't contribute. But that is helpful for trimming the size of the net.

Contribution of a neuron
How do you measure the contribution of a neuron to success in the output? The gradient gives a nice measure of this and there must be some balance equations here. Some neurons go to sleep with some inputs just like day/night.

For example, the chain rule offers a balance equation that says (contribution from mth layer)+(contribution from weights between mth and (m+1)th layers) = (contribution from (m+1)th layer)

But perhaps a better formula for a neuron is the sum of contributions from all the weights connected to it (what happens if they are zeroed out?)

ReLU vs Sigmoid
My impression:
ReLU is flexible in creating tools you need from scratch even if they can't be interpreted. It is simple nuts and bolts with decision boundaries.
Sigmoid has more of a probability (distribution) or yes/no flavour and might be more easily interpreted. It is more logic based with pretty maths.

What happens at a bottleneck in a neural net?

Author:  dhu163 [ Wed Dec 29, 2021 12:35 pm ]
Post subject:  Re: Neural networks optimising mapping functions

Architecture: 1 neuron per layer, 1 input, 1 output
THM: Suppose an NN has 1 neuron per layer with ReLU activation as well as a bias input. The set of possible outputs are functions f which have are linear on some possibly infinite interval (x1,x2) and constant elsewhere.
Proof: If the input to RELU is of this form f, so output is RELU(wf+b), then as RELU has derivative zero or one, the chain rule implies that the output is constant except on the intersection of the input's linear interval (x11, x12) and where the input is positive. On this intersection the output has gradient w times the non-zero gradient of the input there. As the multiplication by w and addition by b is monotonic, hence this intersection must also be an interval. In fact, this interval must have at least one endpoint being x11 or x12. QED

We can see that the dependence of such an output on weights remains non-zero for all weights only in the (possible empty) linear interval. In the constant region, near to the interval, it might have non-zero dependence on the weight immediately before but likely has zero dependence on weights before that.

It is plain that
THM: Suppose an NN has 1 neuron per layer with ReLU activation. Suppose we randomly initialise weights (in a bounded interval WLOG [-1,1]). As n increases, the probability that the entire output is constant tends towards 1.

The point is that so many filters have been applied that in total they likely obscure the entire input. The power of entropy is too high here compared to the driving motor.

In contrast, if the filters are all aligned so they don't obscure, then they are also not very useful as layers, and their job is merely to carry the input from the previous layer to the next layer. Preservation, not computation.

Suppose we randomly initialise weights (in a bounded interval WLOG [-1,1]). If we fix each hidden layer to have k neurons, then as n the number of layers increases, the probability the entire output is constant tends towards 1.

Some quantifying results

Suppose an NN has 1 neuron per layer with ReLU activation. Then the outputs of such a net are bounded between the minimum and maximum [l,h] (which may be infinite). As ReLU is non-negative, we have h>=l>=0. After another layer of ReLU with random weights initialised uniformly in the range [-1,1], then [l,h] maps to [wl+b, wh+b]. So if w,b<0 which happens with 1/4 probability, both endpoints will be negative and hence the whole interval maps to something negative which is mapped to zero by ReLU. So the output will be constant.

This implies that after n layers, the probability the output is constant is at least 1-(3/4)^n which tends to 1 as required. Even if w,b aren't both negative, the probability the interval will shrink is very high, and this is another way it tends towards a constant.

THM: The probability the output is unbounded (so h=infinity) is 2^(-n).
Proof: This occurs if and only if every weight w is positive.

The conjecture is similarly true by basic Markov chain theory. Explicitly we have:
With k neurons a layer, the probability a neuron in the next layer is dead is at least the probability that all relevant weights (including bias) are negative which is 2^(-k-1). With k neurons, the probability of one dead neuron is higher (close to k 2^(-k-1)), and similarly, the probability of all dead neurons is non-zero, around 2^(-k(k+1)).

The probability of the entire layer being constant after n rounds is at least 1-(1-2^(-k(k+1))^n. Again, this tends (very slowly) to 1.

In practice
Although we are only considering initialisation and training can move the weights out of [-1,1], this analysis remains relevant for entropy purposes. If there are a lot of states with dead neurons hanging around, the gradient vanishing problem will be significant and training is likely to be slower or even stall.

However, please note while a network with all weights zero is completely dead and unresponsive to training, as long as there is any non-zero weight connected to the output neuron, this can be revived as the bias neurons are non-zero so connections can slowly be made again (after a cold winter).

Entropy and non-equilibrium
If we continue the analogy to supercooled water freezing at a knock, this seems to be like a situation where we have a fairly flat set of nearby energy states (with many liquid molecules running around bumping into each other) which probably has higher entropy (more possible states) than the fixed frozen state (where some dimensions of kinetic energy are dropped), however there is a vortex with an energy level just above which leads to a wormhole down a well, at the bottom of which is the fully frozen state, and a lower energy than the liquid. However, reaching the vortex requires some energy input first.

Time for a thought experiment. Consider dropping a stone into a pond. This is clearly a non-equilibrium situation but why? How does the entropy increase after normal physics is followed?

My interpretation.

There are forces and energy exchange even in (dynamic) equilibrium states. For example, the pond has lively liquid molecules running around and bumping into each other. However all such molecules are identical so many of these states are identical and overall they all seem similar with our measuring ability, even if we can see some waves etc. There may be salt dissolved in them but at equilibrium, these should be uniformly spread throughout the pond, perhaps slightly below the surface (due to higher density). The boundaries of the pond are hardly at equilibrium since a (dead) planet at its lowest energy state should layers organised according to density and stone and water shouldn't be at the same level. This hasn't happened partially because "work" and energy input is required to mix and move them and the water doesn't have enough energy to solve that so quickly (but don't underestimate its power), as well as weather cycles/rain meaning water rapidly fills underground caverns faster than they cave in.

So the stone drops in the lake. This is very low entropy since there is so much gravitational potential energy difference between the stone and the water/air. In addition, stone is not the same as water so there is some mixing/dissolving tendency. Gravity tends to put like densities together so we expect the stone to settle at the bottom which is a much lower energy state. The quickest energy conversion is the gravitational potential energy converted into kinetic energy of the stone and then of the water and of the bottom upon impact. It is a much lower energy state to have water above stone. However, the energy dissipates in waves and ripples, taking time to spread out. It is like the energy has a momentum of its own that is carried away until blocked by something (i.e. shockwave). It is not clear to me (I didn't study fluids) how much energy is absorbed as permanent kinetic energy (heat) of water locally and how much gets radiated out (even if the wave doesn't increase energy locally due to friction, the pressure still reduces by a factor of 2 or 3 dimensions as it spreads out. (i.e. inverse or inverse square or something in between).

Over time, this wave might be reflected back by boundaries and the pond should even out in temperature again and flatten. Small effects may mean that a heavier stone is covered by sediment and less likely to rise during torrential flooding etc. This will make the densities align further even if very slow.

There should be a way to measure the timescale at which energy transitions in non-equilibrium situations occur based on the interface between energy states. I haven't researched to check what work has been done.

We don't experience much of the energy transition ourselves as we watch, but we can appreciate some of it, such as the sight of the motion of a falling object (perhaps some excited emotions as we anticipate its effect), as well as the sound of the splash and sight of some ripples. But most of the energy conversion barely reaches us and is barely noticeable.

In our neural network example, it seems like the loss function sends repeated shockwaves through the net, but they can be reduced and blocked by the zeroing out property of ReLU, or reductions in the loss function. If the neurons are orthogonal to the gradient, the waves just pass straight through the neuron with no effect, and get absorbed at the input/bias neurons.

Legacy functionality needs protection of vital parts to prevent rot
If training shifts over time, I'd expect the net to quickly adapt and forget valuable lessons it has learnt. With a pool of neural nets, I think that selecting from competition adds some selection bias for protection of valuable and generalisable lessons. I'm not sure how this would appear in practice, but I imagine a series of filters that guard the output region of the network so that it is resistant to being changed by backpropagation except through specific channels (metaparameters?). These filters cover all parts of the gradient that aren't orthogonal to the neurons, meaning that forward computation is let through (so that network contributes functionality), but backpropagating gradient just scatters upon impact.

I suspect that it is this sort of defense that leads to local optima that are hard to escape from, since it isn't easy to upgrade the legacy functionality.

Author:  dhu163 [ Thu Dec 30, 2021 8:28 am ]
Post subject:  Re: Neural networks optimising mapping functions

I've just written up a few formulae trying to get conservation laws/inequalities during neural network training. I've gotten something I've called thoroughput though unfortunately that is already a NN term. It is simple enough that I think it's likely people understand the theory quite well already, but I enjoy working things out myself anyway.

In the process, I've noticed that backpropagation produces wave effects that are quite akin to quantum double-slit effects.

Edit: 20220102
I suppose the multiplier could be thought of like transformers for electricity. And my throughput seems a little like the poynting vector though that kind of assumes 3d space.

I've seen neural networks architecture diagrams used to model and prove the Euler-Lagrange (least action) equation. As well as logical formulae.

I suppose my summary is that neural networks are so general that their structure represents a lot about the way humans reason. There are discrete objects with links and fuzzy logic connecting them. Here fuzzy doesn't mean that there isn't a simple equation connecting them but merely that it isn't boolean logic with 0 and 1s.

What ideas do neural networks struggle to represent?

I'm not really sure yet. But I can see that while neural networks with cycles can be designed (which can act as homeostatic machines whose stable states can be mapped consistently to info, akin to eigenvectors and eigenvalues), they don't seem easy to solve for/completely understand without running them, but I expect they demonstrate many cool qualities.

Also, they throw a lot of garbage out with their filters. It might not be so easy to understand the effects of random rounding errors or the dimensions of what information they are throwing away.

Classical info theory vs info theory needed for NNs
In Go we have rules setting up a system and hook it up to an nn training system and state information transfers across. It seems interesting to study the entropy, blockers and most common information that is transferred.

Classical info theory is about discrete representations with a finite dimension of symbols (encoding when both input and output are finite dimensional and simple). With NNs we deal with infinite dimensions of info, but our neurons, weights and possible transforms are finite.

Interface, laws, focusing on just a small aspect, seem to be how humans have tackled such problems

A note for myself
In many ways, the input acts as a root.

To aid intuition think of a puppet fixed to the floor and strings pulling on limbs. Parallel strings pull most taut and may even unravel. (i.e. training to f(x)=x is normally very speedy unless there is some knot in the middle). Otherwise, even if one component unravels a lot (with large weight magnitude), getting very fat in contribution, and bearing most burden, it still has connections that it relies on, and if those don't stretch far, then it can't stretch far either.

Sketchy thoughts on NN/entropy
NN adjacent like KLD? If start similar, stay similar (but is that stable).

In initial training burst, the speed may be fast to minimise loss, but that can put the filters "into alignment" and it can be hard to untangle them again. - most relevant dimensions may be lost to entropy and filtering (much like the long thin ReLU). "Healthy" info flows must normally go through many routes simultaneously.

Otherwise, we can study how different inputs have different gradients. Poynting is cross product, but we have related values (like det) on the different functions (of different neurons in a layer).

If A=B, we expect gradient to be similar (but as output weights may be different, different things may happen), and shifts may be different due to different inputs, but the demand sent to inputs will be the same, just with different multipliers.

If A+B=C (in same layer) and input and output weights satisfy that inequality too. Then, A+B should exactly the same way as C I think.

If so, to understand what will evolve the least, we just need to understand which linear combinations are zeroed out by neurons in the same layer (much like x<0 is zeroed by ReLU).

following on from above, a layer has local degenerate symmetry if input weights W1 and output weights W2 satisfy: vW1=0 and W2v=0 has a non-zero solution.

This symmetry might not be stable, but it will be locally maintained in the short term. It is interesting to work out when the left kernel of W1 and the right kernel of W2 come close to alignment and how often that occurs. In combination with a bias, it is also relevant.

Page 1 of 1 All times are UTC - 8 hours [ DST ]
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group