Neural networks optimising mapping functions

dhu163 · Post by **dhu163** » Sun Oct 17, 2021 6:10 am

This post is designed for a mathematical + comsci audience only. It is to demonstrate my understanding, and share some thoughts.

(edit: 2022/09/11 I am trying to write some of this up more formally. Disclaimer: much of the below was speculative musings without enough thought to back it up. Don't take any of it too seriously)

(edit: 2022/09/25 On the other hand, while some was nonsense, much really was seeing how NNs are quite a general discrete potential model, especially for energy dissipating through a crystal, whose fractal nature can even be used to model Go. I am more confident much can be written up now. Even some that I said was difficult to quantify, I now think can be quantified and measured. On the other hand, I still expect that almost all this is old news scientifically, though it is new to me. This thread now seems to be a demonstration that curious whimsical musings can turn into papers)

As far as I recall:
There are around 3^361 possible Go board position, but more like 10^170 according to Tromp.
The number of weights in a 40 block neural net based on Alphago (like katago) is around 10^8.
The number of neurons in a human brain is around 10^11.

Do the dimensions match? Is dimensions the right way to think about the problem?

For example given n input bits (e.g. to represent board positions), you want to output 1 number to represent the value of that position. The dimension of the function space is 2^n * 1 = 2^n.
However, does the number of weights in a neural network roughly correspond to dimensions. The key thing is that neural network outputs are not linear in their weights or inputs, so thinking of the functions as a vector space probably doesn't lead very far. Non-linearity is critical because otherwise linearity would imply that if a stone A helps player X, and a stone B helps player X, then having both stones A and B on the board will help player X. This is false, for example if A and B together is self-atari or suicide.

On the other hand, matrices still play a key role in neural networks. The linearity point is why I was surprised when I was first told by a Deepmind employee that the AI was based on matrices. However the study of matrices is very deep, with highly optimised (e.g. parallelised) algorithms to work with them - adding, multiplying etc.
It is also easier to think in terms of linear things (since we are familiar with adding up normal numbers and matrices are just collections of numbers), and the calculus formulae are much simpler to write down with matrices (merely the chain rule: dAx/dx = A).

Neural networks in 5 paragraphs:

If there are n input bits these n numbers are called the zeroth layer (with n input neurons). Each next layer consists of say m neurons, each of which is a function of the neurons in the previous layer. After a good number of layers (I think it is 80 in a 40block net?), the output layer consists of your output, e.g. the value and policy numbers (just 2 neurons). There is a weight (a real number) for every relation between a neuron in one layer and a neuron in the next indicating how they are correlated.

In image processing, CNNs (which alphago uses), not all possible neurons are linked between neighbouring layers. Instead "nearby" neurons are linked. There are well developed mathematical tricks to make the processing as simple as possible while still preserving good enough links. In Go, this helps mirror the ruleset, the fact that a stone's immediate impact on the board is only on the liberty count of its 4 neighbours. Only through the connectedness of the board does this influence slowly propagate out. Note that a ladder is the only Go shape that can propagate influence arbitrarily far through empty space without decreasing in value, but that is a special tactical case where both sides have only one reasonable move to play in order to win that fight. Normally the influence of a stone/position decreases exponentially with distance. In this way, Go is a primarily local game.

The non-linearity is inserted by an "activation function" F at each neuron. So each neuron is a linear sum of the previous layer's neurons (sum of weight w_k times neuron k's output), which is then passed F to produce the neuron's output. There are fancy functions such as tanh, sigmoid (for which the derivative is very simple, this is used in ELO as well), RELU and much more variants.

The training is done by using a loss function (e.g. cross entropy for sigmoid or least squares) on training games and gradient descent. If a position resulted in a loss, the value output for that position should be closer to 0% than 100%. The loss function tends to be linear, so that if a position results in a win p% of the time, the value output for that position should converge to p% (if only training on that position). The weights are adjusted by a size and direction (+ or -) according to how much they contribute in the current net to changing the value output for that position. Since large weights will have most impact on the output, they will also be modified the most. This leads to problems such as gradient explosion or vanishing gradient, and dead neurons. However, normally no neuron is completely dead (that is too focused on the discrete dimension we can see and forgetting the relations between the neurons). With different inputs, different neurons will be more or less important.

There is a constant feedback loop of adapting neurons as more training is done, learning and forgetting. Sometimes good simple rules learnt previously are replaced by more efficient better neuron rules. The random history of the net's past weights can affect how it fits the mapping function so keeping several different nets at one time can be a good idea rather than pure self-play. The idea is to get the net to converge to optimal play as best as possible.

RELU, gradients

Thinking in terms of sigmoid activation functions is quite confusing. However AFAIK, RELU seems to work just as well or even better, often with faster training times.
RELU(x) = x if x>0 else 0
The point is that this keeps most of the linearity but provides a cut-off point below which everything is a flat zero. This is a lot easier to analyse carefully.

We can think of each neuron as a function of the input space (extend {0,1}^n to R^n since the functions can take any number in R as input). It is convenient to represent this scalar function as a gradient field in R^n.

In the zeroth layer, each neuron represents a dimension, so the n neurons represent the n basis vectors.

In the first layer, if activation is not used, then each neuron is first a linear sum of the inputs, so as a function it is a hyperplane in the input x output space. The gradient is constant in the whole space. (picture n=2, a plane with where every point maps to a fixed vector). A RELU activation introduces a cutoff line, perpendicular to the gradient below which the gradient is zeroed out. Crossing a cutoff line leads to a fixed jump in gradient depending on which direction you cross it in. When n>2, the cutoff is a hyperplane of dimension n-1, but for now we stick to visualising n=2.

In the second layer, you have a linear combination of outputs in the first layer with some cutoff. In this way (pre-activation), you can have a function with several different cutoff lines inherited from the first layer. These don't have to be parallel, and the gradient can point the opposite direction as well, cancelling out gradients. The activation step leads to a new cutoff "curve" which is always perpendicular to the gradient there, but which separates the spaces into >0 and <0, replacing the gradient with zero if the function is <0. Many segments of this curve will be parallel to cutoff lines in the first layer, but where nonzero gradients from different neurons overlap, the curve may bend to smooth out the differences in gradient.

Is there still a sense in which if a neuron is created with m weights, then varying those m weights means the function space has dimension m? I think so yes. Especially with the standard technique by adding a constant bias neuron at each layer, and connecting it with a weight to all neurons in the next layer, this means that addition by an arbitrary constant is possible in a neuron, meaning the cutoff line can be shifted back and forth as well.

Metrics and simplifying models

Go has its ELO ratings as a model, but as katago helps demonstrate, good strategy against weaker players can be very different from good strategy against stronger players.
This is because it is a two player game and your results depend very much on your opponent. When you have three player games, things tend to get even more complicated, and standard game theory notions of "optimal play" regardless of your opponent's actions stop making sense because if your opponents always cooperate, you always lose.

In a one player game, the final score works as a good metric.

Let us a consider a one player perfect information game with two choices at each of n-1 moves with no repeated positions. The game ends after those n-1 moves and is scored. Then there are 2^0+2^1+...+2^(n-1)=2^n-1 possible states. It is natural to use n bits to encode the state.

How much different does a well designed encoding make compared to an arbitrary one? For example, on a go board, similar local shapes crop up a lot, but is this enough to make a big impact?

To simplify things, we assume that the neural network only outputs a value number in R which is trained to match expected score. During game play, the bot reads ahead one move and uses cross entropy to calculate the probability of playing the next move. Say the two possible expected scores from the current position are a and b. Then the probability of playing to a is e^a/(e^a+e^b). A multiplier alpha can be inserted so it is e^(alpha*a)/(e^(alpha*a)+e^(alpha*b)).

If the network is big enough (say 2^n weights), then you expect this to train to perfect accuracy with zero loss (but not necessarily perfect play). If there were 2^n independent weights training on each position, this is guaranteed to converge to perfection (if the weights don't bomb out and die with zero gradient). This is because all positions are reached with nonzero probability, and the 2^(n-1) final scored states will be trained to perfection since they are trained to match a fixed number. Other states will have more fluctuations since the expected score from such positions will depend on the probability distribution of moves chosen from that position which depends on the imperfect neural network. However, once the 2^(n-1) final states are perfectly scored (within eps<<1), distribution is fairly fixed, so the 2^(n-2) states that are one move before the end of the game will be training to match a fixed number and so on, gradually making the expected score for previous states all the back to the start more accurate. It is unclear to me how the non-independence of the weights in a neural network affects training accuracy.

A simple example

Have I really learnt anything yet? I seem to have made a small step towards answering my question on dimensions but otherwise, I still don't really understand the interaction between network architecture and ability to train to the right answer. I also don't much understand what changes for a 2 player game, or how the input data format can help or hinder information content/flow.

I suppose that strength in a 2 player game is about the confidence and skill of choose a line where you have to play m moves perfectly in a row in order to profit (compared to a simple variation), as well lots of ways to filter out bad moves to increase the lower bound for how bad a move you might play. Much of "aggression" is probably to assume that you will play such tight (紧凑) sequences more perfectly than your opponent. In particular, being able to live in your opponent's moyos and save weak cutting stones.

What I really want to see is some relation between the network structure (e.g. neurons in a bottleneck) with a small number of key concepts (human or otherwise) that we can measure and interpret with some kind of symbolic logic/arithmetic. But I don't know if that is reasonable or what that would look like or how to make progress.

lightvector · Post by **lightvector** » Sun Oct 17, 2021 4:23 pm

It sounds like you've dug into this quite a lot! There are lots of minor things that are stated a little inaccurately, or ways things are phrased that are oversimplified, but also yes a lot of your understanding is reasonable and good.

The most notable thing of these minor things: the total number of possible Go board positions is not a relevant number, it's a vast, vast overestimate that you shouldn't pay much attention to.

Note: technically we should talk about game tree complexity, not state space size, but for simplicity I'm going to conflate and ignore that for this post.

For starters, even if you're trying to rigorously prove a solution to the game or to play optimally, with good algorithms you generally only care about the union of the set of positions reachable by either:

* Black plays optimal or near-optimal moves, while white plays anything. (To prove black's best result >= x, you must consider every white resistance, but for each of black's turn you need only demonstrate ONE good enough move).
* White plays optimal or near-optimal moves, while black plays anything. (To prove black's best result <= x, you must consider every black resistance, but for each of white's turns you need only demonstrate ONE good enough move).

You probably have this intuition as a Go player too - when analyzing a position, subvariations that can only result from multiple obvious mistakes by both sides are irrelevant to determining the overall result. So at any given moment we are entirely losing the branching for from black, or losing entirely losing the branching factor from white, we roughly square-root the number of reachable positions that we care about (and good algorithms can obtain nearly a sqrt speedup as they go, even if they don't know for sure at the outset what moves will be optimal).

Now we can go further. In the context of neural nets, normally we also abandon the idea of rigorously solving the game, because with a neural net, even if it does learn to play optimally, you don't have much ability to rigorously prove that those particular millions of numeric parameters happen to encode an optimal player other than to just resort back to exhaustive search anyways. So rather than rigorous solution, we simply care that *in practice* it appears to be unbeatable vs anyone or anything. Or sometimes we care only about the lower standard of merely beating the best human players.

Then you can get another *vast* improvement beyond even the square root. If the game tree were absolutely arbitrary with no correlation whatsoever between the outcomes of different moves, no repeated patterns or strategic regularities, etc, then square root via exhaustive search is best you can do. But game trees for real life games are not arbitrary - they arise out of particular rules and mechanics, resulting in exponentially vast amounts of repeated patterns, correlations, regularities, strategic principles, etc. Just like humans, neural nets will learn these principles ("eyes", "good shape vs bad shape", etc) that generalize across vast numbers of positions, and then refine them and learn the exceptions, and then refine them further and learn the exceptions to the exceptions, and so on.

Again, this should be intuitive to you as a Go player. Consider a 5x5 board. The optimal result is black opening in the center and then owning the whole board. If we do the same ultra-crude estimation, we have sqrt(3^25) ~= 900k. Do you have to memorize the status of 900k completely distinct and independent things in order to always win the entire board on 5x5 as black? No, of course not. While fully exhaustive proof of black's win would require showing how black refutes

A1,

A2,

A3,... all separately and exhaustively, and for each of them, how you refute every

and then every

and so on... if we only want optimal-in-practice rather than rigorous proof, we only need to read out or study about a limited tree of critical lines where White resists strongly, and for refuting all the other obviously bad moves like A1, A2,... it's enough to play according to simple strategic shapes and principles that *generalize* across all these exponentially many possibilities, and by even the most pessimistic ways of quantifying and counting this fuzzy value, there are going to be far fewer than 900k shapes and principles to learn.

I would guess with a good algorithm and architecture, you could get to optimal play on 5x5 from the opening position with a net that has somewhere between 100 and 5000 parameters. Which is a pretty massive improvement over sqrt(3^25), much less 3^25.

Anyways, the point is - the size of the state space or game tree is really more of an idle curiosity than a useful number, if all you care about is practical play. It's such an absurdly vast overestimate, and the improvements that neural nets can give you are so dependent on the nature of the game and the details of your algorithm and architecture, that even using it as a starting number from which you quantify what improvements you can get is not useful - you're better off ignoring it entirely and just looking at or trying small practical experiments to empirically measure what you care about directly.

Polama · Post by **Polama** » Tue Oct 19, 2021 7:38 am

Even with linear activation functions, it doesn't have to follow that (A is a good move) + (B is a good move) = (playing A and B is good). At the input to the network, we enter the state of the game. We then multiply the representation of that state against various weights and sum into the next layers. It would be completely reasonable for a network to have learned "if I played at A, then the positions close by are less valuable now", which might include B.

Imagine an input neuron X, and then a complicated network, and then an output neuron Y. The value of X (1 or 0), gets multiplied by various weights and summed into other values and multiplied and summed and smears out through the network and eventually influences the value of Y. But since everything is linear, we can look at all the paths X took to get to Y, and sum them up, and get a function like Y = 5X - 8Z + 3W + ...

So essentially, a multi-layer network with linear weights ends up collapsing down into a simple linear equation anyways.

Once you add the nonlinearities, the paths the weight X takes are all complicated, and don't just sum directly into a final influence weight, so you can represent much more intricate decision boundaries.

Do the dimensions match? Is dimensions the right way to think about the problem?

Not really? First, the weights in a network are analogous to synapses in the brain. So although the "neuron" counts are getting close, it's not a meaningful comparison.
Second, you only need to represent the full dimensionality of your problem if your problem is pure noise. If every Go position was a 100% unique problem where you could apply absolutely nothing from your previous experience, you'd need enough capacity to memorize the right answer to every problem. In actuality, there are general concepts, and common shapes, and locality where a corner group is dead even if we change stones around in the far corner. So playing at some level X (even if X is perfect play) needs some tiny fraction of the space it would take to encode the right answer for every single possible position.

I still don't really understand the interaction between network architecture and ability to train to the right answer.

Nonlinearities make the math work. Empirically, the larger the network the better it performs. The theoretical underpinnings of that aren't great, but it shows up in practice. If a network was just barely large enough to solve a problem, then only a very specific, small number of weights would work. If the network is far larger then needed, then there's lots of ways to parametrize it to solve the problem, and so it's "easier" to train. Why this continues to matter so much at ever increasing sizes is harder to explain.

The network is also the way you connect up the layers. This usually encodes some human intuition about a problem, like: "if I rotate a picture, it should still be the same picture" or "it should be easy for a network to throw away unneeded information as it processes the words in a sentence". You could think about information theory, and the complexity needed to represent various decision boundaries in a given architecture, but there's quite a bit of "throw stuff at the wall and see what sticks"

What I really want to see is some relation between the network structure (e.g. neurons in a bottleneck) with a small number of key concepts (human or otherwise) that we can measure and interpret with some kind of symbolic logic/arithmetic. But I don't know if that is reasonable or what that would look like or how to make progress.

Explainable AI, and connecting symbolic logic with deep learning, are major outstanding challenges in the field. People try lots of things (like looking for statistical correlations with individual neurons firing and some feature of the board, like the presence of a dead group), but it's really still early days. So if you care enough to dig to the cutting edge, you can advance human knowledge! But on the flip side, it's not the sort of question where there's ready answers.

lightvector · Post by **lightvector** » Tue Oct 19, 2021 9:15 am

Yep, that's a pretty nice overview. One quibble though:

Even with linear activation functions, it doesn't have to follow that (A is a good move) + (B is a good move) = (playing A and B is good). At the input to the network, we enter the state of the game. We then multiply the representation of that state against various weights and sum into the next layers. It would be completely reasonable for a network to have learned "if I played at A, then the positions close by are less valuable now", which might include B.

If you are talking about the "policy" function (i.e. predicting where a next good move is), then sure, you could have linear weights such that a stone at either of A or B reduces the policy output value at the other location for it being the next move. Each location can have an initial positive value for being empty, and a negative weight for the presence of a stone at the other location. But dhu163 didn't use the phrase "good move", he instead talked about about whether having one or more stones is better or worse for a player, which is easier to interpret as a statement about the value function.

And for the value function, dhu163 is completely right. Suppose we have board states X, and A (which is X + one stone) and B (which is X plus a different stone) and C (which is X plus both of the stones from A and B), and suppose we use a normal input encoding consisting of the presence/absence of a stone of each color on each location of the board, such that when encoded as vectors/tensors, C = X + (A-X) + (B-X)

Then if the value function f is linear, we have f(C) - f(X) = f(X + (A-X) + (B-X)) - f(X) = f(A-X) + f(B-X) = (f(A) - f(X)) + (f(B) - F(X)). In other words, the amount of value by which C is better than X is precisely the sum of the amount that A is better than X and B is better than X.

Polama · Post by **Polama** » Thu Oct 21, 2021 12:03 pm

Excellent point lightvector, you are correct that a linear value network won't represent the idea that stones can be good or bad in combination with each other, and that that's closer to dhu163's point. In essence we're saying A XOR B is good (exclusive or), and one of the key observations during the development of neural networks was that linear functions can't represent XOR.

dhu163 · Post by **dhu163** » Fri Oct 22, 2021 6:07 am

Thanks for all the explanations.

dhu163 · Post by **dhu163** » Tue Dec 28, 2021 4:50 pm

I would like to note:

If no bias is introduced, then with n input bits and only RELU activation, all zero input is always mapped to zero output in every neuron. And I think it turns out that however many hidden neurons you have, you always get an output where the cut-off hyper-planes go through zero. Any two such planes only intersect at zero so the gradient at x only depends on the ray from the origin (0 to x to infinity). But to produce such a neuron only one hidden layer is necessary. That is, a sequence of nets each with only one hidden layer can produce in the limit any output neuron that can be produced by an arbitrary number of hidden layers.

I don't know what happens with a bias, but I assume that changes things completely so that it is completely general. I can see you can create any function if you can create a delta function around zero (and then shift it with a bias), and something close to a delta function can be created in the limit in the 2nd hidden layer with only 2n offset neurons in the first hidden layer. If correct, it seems that theoretically, only 2 hidden layers are required to produce any neuron you want. I have oversimplified, but perhaps this is not far from the truth, needs more thought.

edit: (more thought)
(edit 20220910: this first result is quite confusing but seems true (for input dimension at least 2) with a slight rewording (the key is about compact support). If I find a clearer proof, I will add it to a longer paper on my wordpress.
Any function on input dimension 1 can be approximated with just 1 hidden layer as my experiments suggest.
Also function should be replaced by continuous function with supremum norm.)

THM: with RELU, we cannot create an arbitrary neuron with just 1 hidden layer
PRoof:
hmm, actually the cut-off (n-1)-plane includes an (n-2)-plane at infinity and so do all the elements close to that. But the value at (i.e. in the limit) each point on the (n-2) plane is either infinity (it is in the non-zero region of a contributing neuron with cut-off plane not parallel to it) or it is the sum of the finite contributions from neurons with cut-off plane parallel to its plane. Hence for the limit of an output neuron to be zero far enough away, i.e. tending to zero at infinity in every direction would imply that the sum of finite contributions from neurons with cut-off plane parallel to its plane is zero for every such (n-1)-plane. However, the output value at a point in the n-space is the sum of these contributions from all (n-1) planes that go through it, which is therefore zero.

(in the above replace zero with non-positive as required if considering activation in the 2nd layer)

Hence we have proved that any non-zero output neuron in the 2nd hidden layer cannot tend to zero at infinity in all directions. QED (hopefully).

This could perhaps be proved by inversion, consider 1/(output). I suspect this to be true of any activation function that produces a kind of 1d result, but it might merely be RELU?

Note that we can certainly make a bounded non-zero neuron by placing two neurons with parallel cut-offs but different sign next to each other. Call these pipe neurons.

This result implies we cannot create any general function with just 1 hidden layer. And yet we clearly can with 2 hidden layers, by adding two pipe neurons, each only non-zero around the zero point of their own (n-1) planes, but with non-parallel (n-1) planes. Then this only reaches its highest point where the two (n-1) planes intersect and the rest can be discarded and set to zero with a bias. By taking multiples, we can make the highest point arbitrarily high, and can also make it nonzero only in an arbitrarily small n-volume around the point we want.

so
THM: we can create any delta function in the limit with 1 hidden layer of RELU
hence
THM: we can create any general non-negative function in the limit with 2 hidden layers of RELU. We can create any general function in the limit with 2 hidden layers of RELU if no activation function is applied to the output neuron.

THM: if there is only one input neuron, we can create any general non-negative function with 1 hidden layer of RELU (exercise for reader)

Does this matter for training?
Probably not much, since we normally only care about the output on a small box and not at infinity, but we do care that there are holes in function space that can't be mapped.

But even then, I think the metric/topology of neural nets and how different activation functions/architectures influence training is surely still an interesting topic of study, because it helps map to what you want to use it for, but also very hard (but non-equilibrium thermodynamics seems hard and also to have been a hot topic for a while and has produced some results).

For example:

What is the effect on training of having a large net with a hidden layer with a number of neurons much less than input and output? Does this slow down evolution to match that layer?
Does it help to think of a net in entropy or energy terms with forces at the input as well as forces at the output, propagating through the net?
What do we optimise? Training time, network size, accuracy etc?
For a given accuracy, what architecture minimises (number of neurons) * (training time)?
1. Given just two hidden layers of arbitrary size and a bias, how well does a net train? Does it always converge to global optima?
2. My intuition is that with 1 input neuron and 1 output neuron and 1 hidden layer of arbitrary size initialised randomly, and sufficiently small increment and say a weighted least squares cost function, then given any positive function that is the "goal" to match (on a bounded region), this net should train to perfection. My intuition says what the output orders is what it gets, and all orders can be met.
  I wanted to say that least squares cost means that weights being equal, there is greatest pressure on neurons in the layer before to be a multiple of the ideal output function, but this is just false.
  
  Well, since we are using gradient descent, if we use a sufficiently large batch and small enough increment, the cost has vanishingly small chance of increasing. Hence, the goal is to show that there are no local optima except when the output matches the goal. This depends on the hidden layer having a sufficiently random set of weights and for it to stay that way during training.
  
  We can see that local optima will occur if gradient is zero (so all hidden layer neurons are orthogonal to the difference between output and ideal after applying a filter to remove where either is zero) and Hessian is positive definite (saddles are possible but vanishingly low probability so ignore for now).
  
  Perhaps the asymmetry that priority is given to neurons with higher weights can be beneficial here to avoid symmetry and maintain randomness.
With randomised training, which local optima are reached more often? How do we optimise training to allow us to reach a good local optima most of the time?

dhu163 · Post by **dhu163** » Wed Dec 29, 2021 8:49 am

Some tensorflow experiments seemed initially to show I am wrong about convergence with 1 input and 1 hidden layer. For example, I tried an ideal function of sin(5x) and it still mapped 0 to 1. It didn't get much success mapping near x=0 nor x=1 though it was fairly accurate in the middle.

My guess is a sort of prisoner's dilemma issue is occurring because two hidden layer neurons have to work together in order to get the initial kink, but each such neuron is close to orthogonal to the error term. Alternatively, we can say that sum of at least two neurons is required, but training causes the coefficient of one to decrease, which is the wrong direction.

This is presumably when a 2nd hidden layer will help a lot, by encouraging teamwork like a team captain. A caveat is that competition between team captains for team members attention can cause trouble if they pull in opposing directions since that makes the neuron freeze.

My experiments show given n neurons per layer, there is a good minimum number of hidden layers h that works well.
n h
<4 h<10: RMS error of 0.2 is possible but it can't fit the initial kink and maps 0 to 1, h>10: struggled to reduce RMS error below 0.6 even with many layers. I think there may be a vanishing gradient issue with more layers as the input area gets "colder", so it doesn't notice what the input even is and just outputs a constant value. Is this akin to temperature?
5 8
30 5 (6 converges faster)
100 3
10000 1 (with 5 times more rounds of training data) but even then RMS error still 0.08, several orders of magnitude worse than other tries.

I think that more neurons per layer always improves accuracy (if same training data and increment small enough), but more layers can make things worse especially with few neurons per layer.

My explanation is:

20220520 edit

dhu163 · Post by **dhu163** » Wed Dec 29, 2021 11:54 am

Back to

Is it useful to apply "energy/entropy" to neural net training?

I try to compare some metaphors/analogies to familiar physics. Please note this is merely an outsider's perspective.

Firstly, some popular science style thoughts on life/earth energy/entropy. Slightly scatter-minded.

Equilibrium entropy is much easier to understand but this requires no interaction with the environment.

On Earth, we have a periodic energy source of sunlight which drives activity on Earth, as well as heat radiation back into space. Other than key laws of energy and momentum conservation (connected to space-time symmetries of physics by Noether), we don't have many other key laws. Momentum conservation suggests more heat will be radiated from Earth away from the Sun than towards the Sun, but the Earth itself could also absorb the momentum by shifting its orbit slightly so we can't be sure.

Now, entropy theory tends to suggest that all else equal, energy will spread to all possible dimensions evenly. My interpretation is that this suggests that even with a non-equilibrium source of momentum (sunlight), there will be a natural balance between how much sunlight is reflected back to the Sun, how much is absorbed on Earth, and how much is reemitted in various directions. Quantum physics can study how special "potentials" interact and solve equations in extreme/ideal limits (complete reflection, zero reflection etc.), showing how to design equipment that has such properties, as well as helping intuition interpolate for what happens in less extreme conditions.

Life is part of this cycle as Sun energy goes into plants, heating the ocean and land, turning CO2 into O2 etc. . Animals add equilibrium with their job of storing and recycling this excess O2 back into CO2, releasing heat in the process. Equilibrium thermodynamics will call us part of the dissipation system of the excess energy. A stronger understanding of a periodic (and essentially fixed) energy source would probably understand the whole system as a series of transitions of structures that enable and react to the inbalance caused by the low entropy property that the Sun has so much more energy than the Earth. Perhaps a better understanding of the time arrow of entropy and memory would give us even more insight.

Why does life seem to contradict the 2nd law of thermodynamics that says energy becomes more even over time? The 2nd law suggests things become more chaotic and random, which normally is viewed as less repeating patterns.

Well the duplication property of DNA seems to be strange since there is some inbalance between the development of increasingly complex instructions and the 2nd law. Dawkins has compared life to crystal (non-living) growth. My interpretation is that sometimes repeating patterns such as in a solid are the lowest energy forms of that material (when like attracts like etc.), but it very much depends on the environment. Knocking a bottle of super-cooled water can cause it to freeze because it was already cold, but simply hadn't connected into crystal form yet, but once a small part of it forms into ice due to random chance (caused by heat from knocking), other molecules slot into place naturally (and the kinetic energy of liquid molecules goes into vibrations of the ice as heat/sound that escapes through the bottle into the environment). Probably there are many similar patterns that we don't have the senses to notice unless we pay careful attention, and I'm sure DNA is far from the only way hereditary complex instructions are passed down, even though we are most aware of its long and successful history. Also, extinctions show how fragile life is, so we don't expect never-ending increase in complexity. It depends on which dimensions you measure in, and that is biased by how you are designed and trained.

Humans are small enough compared to the Sun, that you can probably expect something to come of the Sun's energy, and since the time scale of the Sun's life-cycle is billions of years, the energy source is unlikely to change much, so at the human scale, entropy effects on the Sun are not noticeable without a long time-scale of data.

However, the origin and nature of our life/intelligence and potential other life-forms remains very puzzling. As well as how nature has guided humanity to technology and population leaps over the last few hundred years. Can we interpret this situation as having been set up for us millenia ago? What are the driving factors? Access to oil, electricity, global warming, globalisation, trade/information networks?

Entropy in a computer

Energy/power in a neural net

Loss satisfies a heat equation

Circuit analogy of a neural net

Supply/demand in a neural net

Fragility

Gradient vanishing

Contribution of a neuron

ReLU vs Sigmoid

What happens at a bottleneck in a neural net?

dhu163 · Post by **dhu163** » Wed Dec 29, 2021 12:35 pm

Architecture: 1 neuron per layer, 1 input, 1 output

Some quantifying results

Suppose an NN has 1 neuron per layer with ReLU activation. Then the outputs of such a net are bounded between the minimum and maximum [l,h] (which may be infinite). As ReLU is non-negative, we have h>=l>=0. After another layer of ReLU with random weights initialised uniformly in the range [-1,1], then [l,h] maps to [wl+b, wh+b]. So if w,b<0 which happens with 1/4 probability, both endpoints will be negative and hence the whole interval maps to something negative which is mapped to zero by ReLU. So the output will be constant.

This implies that after n layers, the probability the output is constant is at least 1-(3/4)^n which tends to 1 as required. Even if w,b aren't both negative, the probability the interval will shrink is very high, and this is another way it tends towards a constant.

THM: The probability the output is unbounded (so h=infinity) is 2^(-n).
Proof: This occurs if and only if every weight w is positive.

The conjecture is similarly true by basic Markov chain theory. Explicitly we have:
With k neurons a layer, the probability a neuron in the next layer is dead is at least the probability that all relevant weights (including bias) are negative which is 2^(-k-1). With k neurons, the probability of one dead neuron is higher (close to k 2^(-k-1)), and similarly, the probability of all dead neurons is non-zero, around 2^(-k(k+1)).

The probability of the entire layer being constant after n rounds is at least 1-(1-2^(-k(k+1))^n. Again, this tends (very slowly) to 1.

In practice
Although we are only considering initialisation and training can move the weights out of [-1,1], this analysis remains relevant for entropy purposes. If there are a lot of states with dead neurons hanging around, the gradient vanishing problem will be significant and training is likely to be slower or even stall.

However, please note while a network with all weights zero is completely dead and unresponsive to training, as long as there is any non-zero weight connected to the output neuron, this can be revived as the bias neurons are non-zero so connections can slowly be made again (after a cold winter).

Entropy and non-equilibrium
If we continue the analogy to supercooled water freezing at a knock, this seems to be like a situation where we have a fairly flat set of nearby energy states (with many liquid molecules running around bumping into each other) which probably has higher entropy (more possible states) than the fixed frozen state (where some dimensions of kinetic energy are dropped), however there is a vortex with an energy level just above which leads to a wormhole down a well, at the bottom of which is the fully frozen state, and a lower energy than the liquid. However, reaching the vortex requires some energy input first.

Time for a thought experiment. Consider dropping a stone into a pond. This is clearly a non-equilibrium situation but why? How does the entropy increase after normal physics is followed?

My interpretation.

There are forces and energy exchange even in (dynamic) equilibrium states. For example, the pond has lively liquid molecules running around and bumping into each other. However all such molecules are identical so many of these states are identical and overall they all seem similar with our measuring ability, even if we can see some waves etc. There may be salt dissolved in them but at equilibrium, these should be uniformly spread throughout the pond, perhaps slightly below the surface (due to higher density). The boundaries of the pond are hardly at equilibrium since a (dead) planet at its lowest energy state should layers organised according to density and stone and water shouldn't be at the same level. This hasn't happened partially because "work" and energy input is required to mix and move them and the water doesn't have enough energy to solve that so quickly (but don't underestimate its power), as well as weather cycles/rain meaning water rapidly fills underground caverns faster than they cave in.

So the stone drops in the lake. This is very low entropy since there is so much gravitational potential energy difference between the stone and the water/air. In addition, stone is not the same as water so there is some mixing/dissolving tendency. Gravity tends to put like densities together so we expect the stone to settle at the bottom which is a much lower energy state. The quickest energy conversion is the gravitational potential energy converted into kinetic energy of the stone and then of the water and of the bottom upon impact. It is a much lower energy state to have water above stone. However, the energy dissipates in waves and ripples, taking time to spread out. It is like the energy has a momentum of its own that is carried away until blocked by something (i.e. shockwave). It is not clear to me (I didn't study fluids) how much energy is absorbed as permanent kinetic energy (heat) of water locally and how much gets radiated out (even if the wave doesn't increase energy locally due to friction, the pressure still reduces by a factor of 2 or 3 dimensions as it spreads out. (i.e. inverse or inverse square or something in between).

Over time, this wave might be reflected back by boundaries and the pond should even out in temperature again and flatten. Small effects may mean that a heavier stone is covered by sediment and less likely to rise during torrential flooding etc. This will make the densities align further even if very slow.

There should be a way to measure the timescale at which energy transitions in non-equilibrium situations occur based on the interface between energy states. I haven't researched to check what work has been done.

We don't experience much of the energy transition ourselves as we watch, but we can appreciate some of it, such as the sight of the motion of a falling object (perhaps some excited emotions as we anticipate its effect), as well as the sound of the splash and sight of some ripples. But most of the energy conversion barely reaches us and is barely noticeable.

In our neural network example, it seems like the loss function sends repeated shockwaves through the net, but they can be reduced and blocked by the zeroing out property of ReLU, or reductions in the loss function. If the neurons are orthogonal to the gradient, the waves just pass straight through the neuron with no effect, and get absorbed at the input/bias neurons.

Legacy functionality needs protection of vital parts to prevent rot

dhu163 · Post by **dhu163** » Thu Dec 30, 2021 8:28 am

I've just written up a few formulae trying to get conservation laws/inequalities during neural network training. I've gotten something I've called thoroughput though unfortunately that is already a NN term. It is simple enough that I think it's likely people understand the theory quite well already, but I enjoy working things out myself anyway.

https://dhu163go.wordpress.com/ai/

In the process, I've noticed that backpropagation produces wave effects that are quite akin to quantum double-slit effects.

Edit: 20220102

Classical info theory vs info theory needed for NNs

A note for myself

Sketchy thoughts on NN/entropy

following on from above, a layer has local degenerate symmetry if input weights W1 and output weights W2 satisfy: vW1=0 and W2v=0 has a non-zero solution.

This symmetry might not be stable, but it will be locally maintained in the short term. It is interesting to work out when the left kernel of W1 and the right kernel of W2 come close to alignment and how often that occurs. In combination with a bias, it is also relevant.

Neural networks are like the surface of crumpled rope or proteins. Am I going mad? + a message for the next generation

They are more like the flow of water waves through designed pipes since they are mostly additive/linear, with some blocks so that flow only occurs in one direction. This is why they seem so similar to the double slit experiment in quantum mechanics. The weights don't adjust to handle every case, but instead adjust to handle the average.

Mirrors are there to draw you in, to attract suspicion. You can break through at times like fish leaping out of water. Be accepting of the different shaped beings that appear and also afraid. Mirrors can burn both sides with their sharpness and store souls and impressions inside. The black back protects it from the sharpest light but not from physical forces. What might you find behind an old mirror? Impressions, or have they all escaped?

The will of humans can take them and use them to design tools, but is it really worth it? They say science works. But who are we working for? Time? Space? Energy? Momentum?

If men are from Mars (hot bubbles that guide war under the red crust), and women are from Venus (with hot poisonous obscuring fumes) as we were told as kids, what lies in between?

History suggests high well-fed populations break out into war on their neighbours with their confidence and internal pressure. Without strong defences, their neighbours may even encourage this, and find ways of coping, absorbing and exchanging. Who knows what reaction will occur?

When great structures collapse (from fighting neighbours or cousins), other cousins need to be quick to take over and claim the goods but everyone is different and each has their uses. Fossils from millions of years before might have been initially avoided due to poison/heat (volcanoes) then pressure (under the sea), but junk in old storage can still be powerful. Humans can use them as fossil fuels, information, DNA, inspiration, artwork. Or do we recognise them and lay them to rest respectfully even if they aren't our own direct ancestors, they still have powerful descendants (dinosaurs -> birds etc.).

We can't use them forever, but who are we? Images of us are passed in every touch, sight, reflection and leave impressions that are passed on and may even come back to us. Some images might even have "life" at different rates with spans different to our own. And yet we must only leave a small window of communication between or else all the pressure leaks out too quickly and we commit suicide. Walls or grassland? We only have one dimension of sentences, one dimension of DNA, one organ for procreation. (normally at least)

tides

dhu163 · Post by **dhu163** » Sun Mar 13, 2022 2:32 pm

Some more poetic thoughts on neural networks.

Review of thoughts so far.

Measures: accuracy, size (efficiency=accuracy/size), training speed

Optimisation: architecture, how to choose height and width? More neurons per layer always helps accuracy, whereas more layers can suffocate the input with sand, burying the input and snuffing the error flame. Shallow networks can be poisonous to the flame, like having lots of moist wood that hasn't been dried out yet. Lots of bad fuel eventually reacts, aligning with the force, but very slowly as most of it is light garbage that gets pushed to the side quickly (like rust protecting iron from the air).

Activation: ReLU or sigmoid and what effects?

What low hanging fruit hinder perfection of a healthy net? Which shallow local optima can be fixed by a knock? Competition requiring a leader, communication,

Are there local/global conservation equations? Energy, momentum, entropy, information. What constants exist, like elements? How do they form groups that are cycled through the system?

Standard encoding theory assumes constant Cartesian axes and measures the alignment of distributions with KLD. Information theory measures the reliability and efficiency of information flow. In a neural network, perhaps this is a bit more difficult with shifting neurons.

Meta: For protection of vital parts, a skin-like interface is needed to prevent all but a few errors from interacting with key parts. This needs to be trained by competition and tends to lead to creation of species that don't combine well (since their key parts are in different places). One net self-training is a bit asexual, and arguably sexual selection is useful for building such skins. In neural networks, what do these look like? It is probably difficult to upgrade this sort of protection as it keeps degrading naturally without support and maintenance.

Philosophy: This is based on the assumption of perfecting a tool. However, sometimes a imperfect tool is more stable, and useful for lasting relationships than a bomb that destroys the opponent.

Strong connections exert pressure on their neighbours to support and take off the burden rapidly, while still dominating in power and relatively constant themselves. There may be trade but also poisons that destroy neighbours.

Diagnostics

Stability on one hidden layer with ReLU

"Conceptual unit theory"

A remaining question I don't fully understand is how this works in a two player game. What features are there that allow a network to keep progressing relentlessly. What is the healer's job?

dhu163 · Post by **dhu163** » Tue Mar 15, 2022 5:16 pm

Two-player game

Both players pull towards their ideals to choose the route though only one route can be taken. Two players can introduce blind spots and confusion of common languages.

With two lines, there are three types of possible evaluations. Common reasons for their occurrence.
1) Both think line A is better than line B for player X, and this is true. This is correct though the degree of better may be incorrect.
2) Both think line A is better than line B for player X, and this is false. This is mutual misevaluation. Most commonly, player Y has unusual trick that neither side has recognised.
3) Both think line A is better than line B for them. One side has been tricked. The asymmetry of the game hasn't been strong enough to withstand a short circuit. Are both trying to avoid line B for some reason, such as the force?

For 3), this is like flight from B into a fight at A, both saying "my god is better than your god".

Why do mistakes occur?
1) They are on rare/bad lines that aren't normally reached, so they aren't useful/necessary to learn. This may be permanent, though evaluation will vary with the environment. These pick up the "slack" of common features.
2) They are on good lines but have contradicting features that are commonly used as guides elsewhere. How do you sum features that aren't numbers?
3) An external factor is manipulating them. e.g. a human might wish to avoid complicated lines they don't understand or that waste energy. (and this analysis of mine adds more to the waste or feedback loop)
4) In general, they are at the interface of the known and unknown.

Why does the net keep improving?
What sorts of short circuits are likely?

For 2), chance alignments of features both are chasing can draw them to the same "watering hole". If this happens on a large scale, training will focus on correcting those, so that is unlikely. Instead small scale skirmishes test each other's confidence.

In the case of self-play, the features for each side are deeply interwoven, coming together and difficult to separate.

A silly encoding.

More metaphors

Energy

Misc Observations

Testable Hypothesis: A higher learning rate produces a more even distribution of neuron flow (or at least a lower maximum) (i.e. Gaussian), whereas a lower learning rate allows things to settle produces larger dominant neurons (like language, Zipf distributions).

Life In 19x19

Neural networks optimising mapping functions

Neural networks optimising mapping functions

Re: Neural networks optimising mapping functions

Re: Neural networks optimising mapping functions

Re: Neural networks optimising mapping functions

Re: Neural networks optimising mapping functions

Re: Neural networks optimising mapping functions

Re: Neural networks optimising mapping functions

Re: Neural networks optimising mapping functions

Re: Neural networks optimising mapping functions

Re: Neural networks optimising mapping functions

Re: Neural networks optimising mapping functions

Re: Neural networks optimising mapping functions

Re: Neural networks optimising mapping functions