It is currently Mon Mar 18, 2024 11:31 pm

All times are UTC - 8 hours [ DST ]




Post new topic Reply to topic  [ 13 posts ] 
Author Message
Offline
 Post subject: Neural networks optimising mapping functions
Post #1 Posted: Sun Oct 17, 2021 6:10 am 
Lives in gote

Posts: 470
Liked others: 62
Was liked: 278
Rank: UK 2d Dec15
KGS: mathmo 4d
IGS: mathmo 4d
This post is designed for a mathematical + comsci audience only. It is to demonstrate my understanding, and share some thoughts.

(edit: 2022/09/11 I am trying to write some of this up more formally. Disclaimer: much of the below was speculative musings without enough thought to back it up. Don't take any of it too seriously)

(edit: 2022/09/25 On the other hand, while some was nonsense, much really was seeing how NNs are quite a general discrete potential model, especially for energy dissipating through a crystal, whose fractal nature can even be used to model Go. I am more confident much can be written up now. Even some that I said was difficult to quantify, I now think can be quantified and measured. On the other hand, I still expect that almost all this is old news scientifically, though it is new to me. This thread now seems to be a demonstration that curious whimsical musings can turn into papers)

As far as I recall:
There are around 3^361 possible Go board position, but more like 10^170 according to Tromp.
The number of weights in a 40 block neural net based on Alphago (like katago) is around 10^8.
The number of neurons in a human brain is around 10^11.

Do the dimensions match? Is dimensions the right way to think about the problem?

For example given n input bits (e.g. to represent board positions), you want to output 1 number to represent the value of that position. The dimension of the function space is 2^n * 1 = 2^n.
However, does the number of weights in a neural network roughly correspond to dimensions. The key thing is that neural network outputs are not linear in their weights or inputs, so thinking of the functions as a vector space probably doesn't lead very far. Non-linearity is critical because otherwise linearity would imply that if a stone A helps player X, and a stone B helps player X, then having both stones A and B on the board will help player X. This is false, for example if A and B together is self-atari or suicide.

On the other hand, matrices still play a key role in neural networks. The linearity point is why I was surprised when I was first told by a Deepmind employee that the AI was based on matrices. However the study of matrices is very deep, with highly optimised (e.g. parallelised) algorithms to work with them - adding, multiplying etc.
It is also easier to think in terms of linear things (since we are familiar with adding up normal numbers and matrices are just collections of numbers), and the calculus formulae are much simpler to write down with matrices (merely the chain rule: dAx/dx = A).

Neural networks in 5 paragraphs:

If there are n input bits these n numbers are called the zeroth layer (with n input neurons). Each next layer consists of say m neurons, each of which is a function of the neurons in the previous layer. After a good number of layers (I think it is 80 in a 40block net?), the output layer consists of your output, e.g. the value and policy numbers (just 2 neurons). There is a weight (a real number) for every relation between a neuron in one layer and a neuron in the next indicating how they are correlated.

In image processing, CNNs (which alphago uses), not all possible neurons are linked between neighbouring layers. Instead "nearby" neurons are linked. There are well developed mathematical tricks to make the processing as simple as possible while still preserving good enough links. In Go, this helps mirror the ruleset, the fact that a stone's immediate impact on the board is only on the liberty count of its 4 neighbours. Only through the connectedness of the board does this influence slowly propagate out. Note that a ladder is the only Go shape that can propagate influence arbitrarily far through empty space without decreasing in value, but that is a special tactical case where both sides have only one reasonable move to play in order to win that fight. Normally the influence of a stone/position decreases exponentially with distance. In this way, Go is a primarily local game.

The non-linearity is inserted by an "activation function" F at each neuron. So each neuron is a linear sum of the previous layer's neurons (sum of weight w_k times neuron k's output), which is then passed F to produce the neuron's output. There are fancy functions such as tanh, sigmoid (for which the derivative is very simple, this is used in ELO as well), RELU and much more variants.

The training is done by using a loss function (e.g. cross entropy for sigmoid or least squares) on training games and gradient descent. If a position resulted in a loss, the value output for that position should be closer to 0% than 100%. The loss function tends to be linear, so that if a position results in a win p% of the time, the value output for that position should converge to p% (if only training on that position). The weights are adjusted by a size and direction (+ or -) according to how much they contribute in the current net to changing the value output for that position. Since large weights will have most impact on the output, they will also be modified the most. This leads to problems such as gradient explosion or vanishing gradient, and dead neurons. However, normally no neuron is completely dead (that is too focused on the discrete dimension we can see and forgetting the relations between the neurons). With different inputs, different neurons will be more or less important.

There is a constant feedback loop of adapting neurons as more training is done, learning and forgetting. Sometimes good simple rules learnt previously are replaced by more efficient better neuron rules. The random history of the net's past weights can affect how it fits the mapping function so keeping several different nets at one time can be a good idea rather than pure self-play. The idea is to get the net to converge to optimal play as best as possible.


RELU, gradients
Thinking in terms of sigmoid activation functions is quite confusing. However AFAIK, RELU seems to work just as well or even better, often with faster training times.
RELU(x) = x if x>0 else 0
The point is that this keeps most of the linearity but provides a cut-off point below which everything is a flat zero. This is a lot easier to analyse carefully.

We can think of each neuron as a function of the input space (extend {0,1}^n to R^n since the functions can take any number in R as input). It is convenient to represent this scalar function as a gradient field in R^n.

In the zeroth layer, each neuron represents a dimension, so the n neurons represent the n basis vectors.

In the first layer, if activation is not used, then each neuron is first a linear sum of the inputs, so as a function it is a hyperplane in the input x output space. The gradient is constant in the whole space. (picture n=2, a plane with where every point maps to a fixed vector). A RELU activation introduces a cutoff line, perpendicular to the gradient below which the gradient is zeroed out. Crossing a cutoff line leads to a fixed jump in gradient depending on which direction you cross it in. When n>2, the cutoff is a hyperplane of dimension n-1, but for now we stick to visualising n=2.

In the second layer, you have a linear combination of outputs in the first layer with some cutoff. In this way (pre-activation), you can have a function with several different cutoff lines inherited from the first layer. These don't have to be parallel, and the gradient can point the opposite direction as well, cancelling out gradients. The activation step leads to a new cutoff "curve" which is always perpendicular to the gradient there, but which separates the spaces into >0 and <0, replacing the gradient with zero if the function is <0. Many segments of this curve will be parallel to cutoff lines in the first layer, but where nonzero gradients from different neurons overlap, the curve may bend to smooth out the differences in gradient.

Is there still a sense in which if a neuron is created with m weights, then varying those m weights means the function space has dimension m? I think so yes. Especially with the standard technique by adding a constant bias neuron at each layer, and connecting it with a weight to all neurons in the next layer, this means that addition by an arbitrary constant is possible in a neuron, meaning the cutoff line can be shifted back and forth as well.


Metrics and simplifying models
Go has its ELO ratings as a model, but as katago helps demonstrate, good strategy against weaker players can be very different from good strategy against stronger players.
This is because it is a two player game and your results depend very much on your opponent. When you have three player games, things tend to get even more complicated, and standard game theory notions of "optimal play" regardless of your opponent's actions stop making sense because if your opponents always cooperate, you always lose.

In a one player game, the final score works as a good metric.

Let us a consider a one player perfect information game with two choices at each of n-1 moves with no repeated positions. The game ends after those n-1 moves and is scored. Then there are 2^0+2^1+...+2^(n-1)=2^n-1 possible states. It is natural to use n bits to encode the state.

How much different does a well designed encoding make compared to an arbitrary one? For example, on a go board, similar local shapes crop up a lot, but is this enough to make a big impact?

To simplify things, we assume that the neural network only outputs a value number in R which is trained to match expected score. During game play, the bot reads ahead one move and uses cross entropy to calculate the probability of playing the next move. Say the two possible expected scores from the current position are a and b. Then the probability of playing to a is e^a/(e^a+e^b). A multiplier alpha can be inserted so it is e^(alpha*a)/(e^(alpha*a)+e^(alpha*b)).

If the network is big enough (say 2^n weights), then you expect this to train to perfect accuracy with zero loss (but not necessarily perfect play). If there were 2^n independent weights training on each position, this is guaranteed to converge to perfection (if the weights don't bomb out and die with zero gradient). This is because all positions are reached with nonzero probability, and the 2^(n-1) final scored states will be trained to perfection since they are trained to match a fixed number. Other states will have more fluctuations since the expected score from such positions will depend on the probability distribution of moves chosen from that position which depends on the imperfect neural network. However, once the 2^(n-1) final states are perfectly scored (within eps<<1), distribution is fairly fixed, so the 2^(n-2) states that are one move before the end of the game will be training to match a fixed number and so on, gradually making the expected score for previous states all the back to the start more accurate. It is unclear to me how the non-independence of the weights in a neural network affects training accuracy.


A simple example
Using the assumptions above, lets set n=3 and use a very simple network with no hidden neurons, just 3 weights from input to output. (in theory we should need 7 weights to map all 7 states in general).

If we ignore (0,0,0), let's say the game tree is

root = (1,0,0)
(parent: children) pairs:
(1,0,0) : (0,1,0) (1,1,0)
(0,1,0) : (0,0,1) (0,1,1)
(1,1,0) : (1,0,1) (1,1,1)

The rightmost nonzero bit indicates the level, and the bits to the left of that indicate the parent.
with final scores A,B,C,D in (0,0,1),(0,1,1),(1,0,1),(1,1,1) respectively.

I wrote a simple program (I just started learning tensorflow yesterday) to train on this model. RELU lead to a dead network around 10% of the time, so I switched to LeakyReLU. I can't make any general observations yet, but I do note that on nodes which lead to bad scores, they don't get many playouts, and tend to have much more inaccurate value estimates. This is because they contribute less to the loss function, with less occurrences.


Have I really learnt anything yet? I seem to have made a small step towards answering my question on dimensions but otherwise, I still don't really understand the interaction between network architecture and ability to train to the right answer. I also don't much understand what changes for a 2 player game, or how the input data format can help or hinder information content/flow.

I suppose that strength in a 2 player game is about the confidence and skill of choose a line where you have to play m moves perfectly in a row in order to profit (compared to a simple variation), as well lots of ways to filter out bad moves to increase the lower bound for how bad a move you might play. Much of "aggression" is probably to assume that you will play such tight (紧凑) sequences more perfectly than your opponent. In particular, being able to live in your opponent's moyos and save weak cutting stones.

What I really want to see is some relation between the network structure (e.g. neurons in a bottleneck) with a small number of key concepts (human or otherwise) that we can measure and interpret with some kind of symbolic logic/arithmetic. But I don't know if that is reasonable or what that would look like or how to make progress.


Last edited by dhu163 on Sat Sep 24, 2022 4:36 pm, edited 6 times in total.

This post by dhu163 was liked by: Elom0
Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #2 Posted: Sun Oct 17, 2021 4:23 pm 
Lives in sente

Posts: 757
Liked others: 114
Was liked: 916
Rank: maybe 2d
It sounds like you've dug into this quite a lot! There are lots of minor things that are stated a little inaccurately, or ways things are phrased that are oversimplified, but also yes a lot of your understanding is reasonable and good.

The most notable thing of these minor things: the total number of possible Go board positions is not a relevant number, it's a vast, vast overestimate that you shouldn't pay much attention to.

Note: technically we should talk about game tree complexity, not state space size, but for simplicity I'm going to conflate and ignore that for this post.

For starters, even if you're trying to rigorously prove a solution to the game or to play optimally, with good algorithms you generally only care about the union of the set of positions reachable by either:

* Black plays optimal or near-optimal moves, while white plays anything. (To prove black's best result >= x, you must consider every white resistance, but for each of black's turn you need only demonstrate ONE good enough move).
* White plays optimal or near-optimal moves, while black plays anything. (To prove black's best result <= x, you must consider every black resistance, but for each of white's turns you need only demonstrate ONE good enough move).

You probably have this intuition as a Go player too - when analyzing a position, subvariations that can only result from multiple obvious mistakes by both sides are irrelevant to determining the overall result. So at any given moment we are entirely losing the branching for from black, or losing entirely losing the branching factor from white, we roughly square-root the number of reachable positions that we care about (and good algorithms can obtain nearly a sqrt speedup as they go, even if they don't know for sure at the outset what moves will be optimal).

Now we can go further. In the context of neural nets, normally we also abandon the idea of rigorously solving the game, because with a neural net, even if it does learn to play optimally, you don't have much ability to rigorously prove that those particular millions of numeric parameters happen to encode an optimal player other than to just resort back to exhaustive search anyways. So rather than rigorous solution, we simply care that *in practice* it appears to be unbeatable vs anyone or anything. Or sometimes we care only about the lower standard of merely beating the best human players.

Then you can get another *vast* improvement beyond even the square root. If the game tree were absolutely arbitrary with no correlation whatsoever between the outcomes of different moves, no repeated patterns or strategic regularities, etc, then square root via exhaustive search is best you can do. But game trees for real life games are not arbitrary - they arise out of particular rules and mechanics, resulting in exponentially vast amounts of repeated patterns, correlations, regularities, strategic principles, etc. Just like humans, neural nets will learn these principles ("eyes", "good shape vs bad shape", etc) that generalize across vast numbers of positions, and then refine them and learn the exceptions, and then refine them further and learn the exceptions to the exceptions, and so on.

Again, this should be intuitive to you as a Go player. Consider a 5x5 board. The optimal result is black opening in the center and then owning the whole board. If we do the same ultra-crude estimation, we have sqrt(3^25) ~= 900k. Do you have to memorize the status of 900k completely distinct and independent things in order to always win the entire board on 5x5 as black? No, of course not. While fully exhaustive proof of black's win would require showing how black refutes :w2: A1, :w2: A2, :w2: A3,... all separately and exhaustively, and for each of them, how you refute every :w4: and then every :w6: and so on... if we only want optimal-in-practice rather than rigorous proof, we only need to read out or study about a limited tree of critical lines where White resists strongly, and for refuting all the other obviously bad moves like A1, A2,... it's enough to play according to simple strategic shapes and principles that *generalize* across all these exponentially many possibilities, and by even the most pessimistic ways of quantifying and counting this fuzzy value, there are going to be far fewer than 900k shapes and principles to learn.

I would guess with a good algorithm and architecture, you could get to optimal play on 5x5 from the opening position with a net that has somewhere between 100 and 5000 parameters. Which is a pretty massive improvement over sqrt(3^25), much less 3^25.

Anyways, the point is - the size of the state space or game tree is really more of an idle curiosity than a useful number, if all you care about is practical play. It's such an absurdly vast overestimate, and the improvements that neural nets can give you are so dependent on the nature of the game and the details of your algorithm and architecture, that even using it as a starting number from which you quantify what improvements you can get is not useful - you're better off ignoring it entirely and just looking at or trying small practical experiments to empirically measure what you care about directly.

Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #3 Posted: Tue Oct 19, 2021 7:38 am 
Lives with ko

Posts: 248
Liked others: 23
Was liked: 148
Rank: DGS 2 kyu
Universal go server handle: Polama
Even with linear activation functions, it doesn't have to follow that (A is a good move) + (B is a good move) = (playing A and B is good). At the input to the network, we enter the state of the game. We then multiply the representation of that state against various weights and sum into the next layers. It would be completely reasonable for a network to have learned "if I played at A, then the positions close by are less valuable now", which might include B.

Imagine an input neuron X, and then a complicated network, and then an output neuron Y. The value of X (1 or 0), gets multiplied by various weights and summed into other values and multiplied and summed and smears out through the network and eventually influences the value of Y. But since everything is linear, we can look at all the paths X took to get to Y, and sum them up, and get a function like Y = 5X - 8Z + 3W + ...

So essentially, a multi-layer network with linear weights ends up collapsing down into a simple linear equation anyways.

Once you add the nonlinearities, the paths the weight X takes are all complicated, and don't just sum directly into a final influence weight, so you can represent much more intricate decision boundaries.

Quote:
Do the dimensions match? Is dimensions the right way to think about the problem?


Not really? First, the weights in a network are analogous to synapses in the brain. So although the "neuron" counts are getting close, it's not a meaningful comparison.
Second, you only need to represent the full dimensionality of your problem if your problem is pure noise. If every Go position was a 100% unique problem where you could apply absolutely nothing from your previous experience, you'd need enough capacity to memorize the right answer to every problem. In actuality, there are general concepts, and common shapes, and locality where a corner group is dead even if we change stones around in the far corner. So playing at some level X (even if X is perfect play) needs some tiny fraction of the space it would take to encode the right answer for every single possible position.


Quote:
I still don't really understand the interaction between network architecture and ability to train to the right answer.


Nonlinearities make the math work. Empirically, the larger the network the better it performs. The theoretical underpinnings of that aren't great, but it shows up in practice. If a network was just barely large enough to solve a problem, then only a very specific, small number of weights would work. If the network is far larger then needed, then there's lots of ways to parametrize it to solve the problem, and so it's "easier" to train. Why this continues to matter so much at ever increasing sizes is harder to explain.

The network is also the way you connect up the layers. This usually encodes some human intuition about a problem, like: "if I rotate a picture, it should still be the same picture" or "it should be easy for a network to throw away unneeded information as it processes the words in a sentence". You could think about information theory, and the complexity needed to represent various decision boundaries in a given architecture, but there's quite a bit of "throw stuff at the wall and see what sticks"

Quote:
What I really want to see is some relation between the network structure (e.g. neurons in a bottleneck) with a small number of key concepts (human or otherwise) that we can measure and interpret with some kind of symbolic logic/arithmetic. But I don't know if that is reasonable or what that would look like or how to make progress.


Explainable AI, and connecting symbolic logic with deep learning, are major outstanding challenges in the field. People try lots of things (like looking for statistical correlations with individual neurons firing and some feature of the board, like the presence of a dead group), but it's really still early days. So if you care enough to dig to the cutting edge, you can advance human knowledge! But on the flip side, it's not the sort of question where there's ready answers.

Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #4 Posted: Tue Oct 19, 2021 9:15 am 
Lives in sente

Posts: 757
Liked others: 114
Was liked: 916
Rank: maybe 2d
Yep, that's a pretty nice overview. One quibble though:

Quote:
Even with linear activation functions, it doesn't have to follow that (A is a good move) + (B is a good move) = (playing A and B is good). At the input to the network, we enter the state of the game. We then multiply the representation of that state against various weights and sum into the next layers. It would be completely reasonable for a network to have learned "if I played at A, then the positions close by are less valuable now", which might include B.


If you are talking about the "policy" function (i.e. predicting where a next good move is), then sure, you could have linear weights such that a stone at either of A or B reduces the policy output value at the other location for it being the next move. Each location can have an initial positive value for being empty, and a negative weight for the presence of a stone at the other location. But dhu163 didn't use the phrase "good move", he instead talked about about whether having one or more stones is better or worse for a player, which is easier to interpret as a statement about the value function.

And for the value function, dhu163 is completely right. Suppose we have board states X, and A (which is X + one stone) and B (which is X plus a different stone) and C (which is X plus both of the stones from A and B), and suppose we use a normal input encoding consisting of the presence/absence of a stone of each color on each location of the board, such that when encoded as vectors/tensors, C = X + (A-X) + (B-X)

Then if the value function f is linear, we have f(C) - f(X) = f(X + (A-X) + (B-X)) - f(X) = f(A-X) + f(B-X) = (f(A) - f(X)) + (f(B) - F(X)). In other words, the amount of value by which C is better than X is precisely the sum of the amount that A is better than X and B is better than X.

Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #5 Posted: Thu Oct 21, 2021 12:03 pm 
Lives with ko

Posts: 248
Liked others: 23
Was liked: 148
Rank: DGS 2 kyu
Universal go server handle: Polama
Excellent point lightvector, you are correct that a linear value network won't represent the idea that stones can be good or bad in combination with each other, and that that's closer to dhu163's point. In essence we're saying A XOR B is good (exclusive or), and one of the key observations during the development of neural networks was that linear functions can't represent XOR.

Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #6 Posted: Fri Oct 22, 2021 6:07 am 
Lives in gote

Posts: 470
Liked others: 62
Was liked: 278
Rank: UK 2d Dec15
KGS: mathmo 4d
IGS: mathmo 4d
Thanks for all the explanations.

Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #7 Posted: Tue Dec 28, 2021 4:50 pm 
Lives in gote

Posts: 470
Liked others: 62
Was liked: 278
Rank: UK 2d Dec15
KGS: mathmo 4d
IGS: mathmo 4d
I would like to note:

If no bias is introduced, then with n input bits and only RELU activation, all zero input is always mapped to zero output in every neuron. And I think it turns out that however many hidden neurons you have, you always get an output where the cut-off hyper-planes go through zero. Any two such planes only intersect at zero so the gradient at x only depends on the ray from the origin (0 to x to infinity). But to produce such a neuron only one hidden layer is necessary. That is, a sequence of nets each with only one hidden layer can produce in the limit any output neuron that can be produced by an arbitrary number of hidden layers.

I don't know what happens with a bias, but I assume that changes things completely so that it is completely general. I can see you can create any function if you can create a delta function around zero (and then shift it with a bias), and something close to a delta function can be created in the limit in the 2nd hidden layer with only 2n offset neurons in the first hidden layer. If correct, it seems that theoretically, only 2 hidden layers are required to produce any neuron you want. I have oversimplified, but perhaps this is not far from the truth, needs more thought.

edit: (more thought)
(edit 20220910: this first result is quite confusing but seems true (for input dimension at least 2) with a slight rewording (the key is about compact support). If I find a clearer proof, I will add it to a longer paper on my wordpress.
Any function on input dimension 1 can be approximated with just 1 hidden layer as my experiments suggest.
Also function should be replaced by continuous function with supremum norm.)
THM: with RELU, we cannot create an arbitrary neuron with just 1 hidden layer
PRoof:
hmm, actually the cut-off (n-1)-plane includes an (n-2)-plane at infinity and so do all the elements close to that. But the value at (i.e. in the limit) each point on the (n-2) plane is either infinity (it is in the non-zero region of a contributing neuron with cut-off plane not parallel to it) or it is the sum of the finite contributions from neurons with cut-off plane parallel to its plane. Hence for the limit of an output neuron to be zero far enough away, i.e. tending to zero at infinity in every direction would imply that the sum of finite contributions from neurons with cut-off plane parallel to its plane is zero for every such (n-1)-plane. However, the output value at a point in the n-space is the sum of these contributions from all (n-1) planes that go through it, which is therefore zero.

(in the above replace zero with non-positive as required if considering activation in the 2nd layer)

Hence we have proved that any non-zero output neuron in the 2nd hidden layer cannot tend to zero at infinity in all directions. QED (hopefully).

This could perhaps be proved by inversion, consider 1/(output). I suspect this to be true of any activation function that produces a kind of 1d result, but it might merely be RELU?

Note that we can certainly make a bounded non-zero neuron by placing two neurons with parallel cut-offs but different sign next to each other. Call these pipe neurons.

This result implies we cannot create any general function with just 1 hidden layer. And yet we clearly can with 2 hidden layers, by adding two pipe neurons, each only non-zero around the zero point of their own (n-1) planes, but with non-parallel (n-1) planes. Then this only reaches its highest point where the two (n-1) planes intersect and the rest can be discarded and set to zero with a bias. By taking multiples, we can make the highest point arbitrarily high, and can also make it nonzero only in an arbitrarily small n-volume around the point we want.

so
THM: we can create any delta function in the limit with 1 hidden layer of RELU
hence
THM: we can create any general non-negative function in the limit with 2 hidden layers of RELU. We can create any general function in the limit with 2 hidden layers of RELU if no activation function is applied to the output neuron.

THM: if there is only one input neuron, we can create any general non-negative function with 1 hidden layer of RELU (exercise for reader)

Does this matter for training?
Probably not much, since we normally only care about the output on a small box and not at infinity, but we do care that there are holes in function space that can't be mapped.


But even then, I think the metric/topology of neural nets and how different activation functions/architectures influence training is surely still an interesting topic of study, because it helps map to what you want to use it for, but also very hard (but non-equilibrium thermodynamics seems hard and also to have been a hot topic for a while and has produced some results).

For example:
  1. What is the effect on training of having a large net with a hidden layer with a number of neurons much less than input and output? Does this slow down evolution to match that layer?
  2. Does it help to think of a net in entropy or energy terms with forces at the input as well as forces at the output, propagating through the net?
  3. What do we optimise? Training time, network size, accuracy etc?
  4. For a given accuracy, what architecture minimises (number of neurons) * (training time)?
    1. Given just two hidden layers of arbitrary size and a bias, how well does a net train? Does it always converge to global optima?
    2. My intuition is that with 1 input neuron and 1 output neuron and 1 hidden layer of arbitrary size initialised randomly, and sufficiently small increment and say a weighted least squares cost function, then given any positive function that is the "goal" to match (on a bounded region), this net should train to perfection. My intuition says what the output orders is what it gets, and all orders can be met.
      I wanted to say that least squares cost means that weights being equal, there is greatest pressure on neurons in the layer before to be a multiple of the ideal output function, but this is just false.


      Well, since we are using gradient descent, if we use a sufficiently large batch and small enough increment, the cost has vanishingly small chance of increasing. Hence, the goal is to show that there are no local optima except when the output matches the goal. This depends on the hidden layer having a sufficiently random set of weights and for it to stay that way during training.

      We can see that local optima will occur if gradient is zero (so all hidden layer neurons are orthogonal to the difference between output and ideal after applying a filter to remove where either is zero) and Hessian is positive definite (saddles are possible but vanishingly low probability so ignore for now).

      Perhaps the asymmetry that priority is given to neurons with higher weights can be beneficial here to avoid symmetry and maintain randomness.
  5. With randomised training, which local optima are reached more often? How do we optimise training to allow us to reach a good local optima most of the time?

jokes, but I can't help but thinking of a neural net as a poor life-form that is part of a circuit every time we train it. Backpropagation is a Huxley style electric shock therapy to train it to act as we wish. We add a connection to the input and the output and hit the power on.

The only difference perhaps to humans is that human body architecture has a lot more inputs and outputs than engineers can access as well as internal complexity (not repeating patterns) and was "designed" for different purposes, so we can't alter the internals so easily.


Last edited by dhu163 on Sun Sep 11, 2022 8:03 am, edited 6 times in total.
Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #8 Posted: Wed Dec 29, 2021 8:49 am 
Lives in gote

Posts: 470
Liked others: 62
Was liked: 278
Rank: UK 2d Dec15
KGS: mathmo 4d
IGS: mathmo 4d
Some tensorflow experiments seemed initially to show I am wrong about convergence with 1 input and 1 hidden layer. For example, I tried an ideal function of sin(5x) and it still mapped 0 to 1. It didn't get much success mapping near x=0 nor x=1 though it was fairly accurate in the middle.

My guess is a sort of prisoner's dilemma issue is occurring because two hidden layer neurons have to work together in order to get the initial kink, but each such neuron is close to orthogonal to the error term. Alternatively, we can say that sum of at least two neurons is required, but training causes the coefficient of one to decrease, which is the wrong direction.

This is presumably when a 2nd hidden layer will help a lot, by encouraging teamwork like a team captain. A caveat is that competition between team captains for team members attention can cause trouble if they pull in opposing directions since that makes the neuron freeze.

Again, bad metaphors, but I can't help but wonder if forcing teamwork is part of the reason sexual reproduction has had more visible success than asexual reproduction.


My experiments show given n neurons per layer, there is a good minimum number of hidden layers h that works well.
n h
<4 h<10: RMS error of 0.2 is possible but it can't fit the initial kink and maps 0 to 1, h>10: struggled to reduce RMS error below 0.6 even with many layers. I think there may be a vanishing gradient issue with more layers as the input area gets "colder", so it doesn't notice what the input even is and just outputs a constant value. Is this akin to temperature?
5 8
30 5 (6 converges faster)
100 3
10000 1 (with 5 times more rounds of training data) but even then RMS error still 0.08, several orders of magnitude worse than other tries.

I think that more neurons per layer always improves accuracy (if same training data and increment small enough), but more layers can make things worse especially with few neurons per layer.

My explanation is:
The net struggles most when the error function is orthogonal to the neurons in the previous layer, but even this can be offset by changes to the neurons in the previous layer. When there are 10,000 neurons, you would expect that it can't be orthogonal to all of them even after the chaos of training. So I think it also struggles when the error is of the form n1+n2+c, where n1,n2 are neurons in the previous layer but due to the c term, the inner product with n1 and n2 is negative.

This is true for my problem above with 1 layer. The required offset to fix the problem around x=0 requires a sum of neurons like ideal(x)-model(x)=(RELU(x)-RELU(x-b))/b-1 with b~0.3. However, the inner product of this with RELU(x) with this is like -b/2<0.
Since the gradient looks like the (inner product)*(weights), this reduces the contribution from RELU(x) which can't be good.

The inner product with RELU(x-b) is zero since each is non-zero at different points ("they have independent support").

How does another hidden layer greatly reduce the problem? It makes it more likely that a neuron of the form n1+n2 is already floating around, at which point that will just plug nicely into the error term, with maximal covariance (=inner product /norms). The connections also mean that such a neuron will readjust coefficients to be more like n1+n2 if necessary.


20220520 edit
Further experiments show that just 5 hidden neurons with 1 hidden layer produces an error of 0.12 with only 1000 rounds of training. In a sense, the many more neurons are not helping yet.

even just 2 neurons per layer with 4 layers managed to reach error 0.02 with 600 training rounds.


Last edited by dhu163 on Fri May 20, 2022 6:44 am, edited 3 times in total.
Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #9 Posted: Wed Dec 29, 2021 11:54 am 
Lives in gote

Posts: 470
Liked others: 62
Was liked: 278
Rank: UK 2d Dec15
KGS: mathmo 4d
IGS: mathmo 4d
Back to

Is it useful to apply "energy/entropy" to neural net training?

I try to compare some metaphors/analogies to familiar physics. Please note this is merely an outsider's perspective.

Firstly, some popular science style thoughts on life/earth energy/entropy. Slightly scatter-minded.

Equilibrium entropy is much easier to understand but this requires no interaction with the environment.

On Earth, we have a periodic energy source of sunlight which drives activity on Earth, as well as heat radiation back into space. Other than key laws of energy and momentum conservation (connected to space-time symmetries of physics by Noether), we don't have many other key laws. Momentum conservation suggests more heat will be radiated from Earth away from the Sun than towards the Sun, but the Earth itself could also absorb the momentum by shifting its orbit slightly so we can't be sure.

Now, entropy theory tends to suggest that all else equal, energy will spread to all possible dimensions evenly. My interpretation is that this suggests that even with a non-equilibrium source of momentum (sunlight), there will be a natural balance between how much sunlight is reflected back to the Sun, how much is absorbed on Earth, and how much is reemitted in various directions. Quantum physics can study how special "potentials" interact and solve equations in extreme/ideal limits (complete reflection, zero reflection etc.), showing how to design equipment that has such properties, as well as helping intuition interpolate for what happens in less extreme conditions.

Life is part of this cycle as Sun energy goes into plants, heating the ocean and land, turning CO2 into O2 etc. . Animals add equilibrium with their job of storing and recycling this excess O2 back into CO2, releasing heat in the process. Equilibrium thermodynamics will call us part of the dissipation system of the excess energy. A stronger understanding of a periodic (and essentially fixed) energy source would probably understand the whole system as a series of transitions of structures that enable and react to the inbalance caused by the low entropy property that the Sun has so much more energy than the Earth. Perhaps a better understanding of the time arrow of entropy and memory would give us even more insight.

Why does life seem to contradict the 2nd law of thermodynamics that says energy becomes more even over time? The 2nd law suggests things become more chaotic and random, which normally is viewed as less repeating patterns.

Well the duplication property of DNA seems to be strange since there is some inbalance between the development of increasingly complex instructions and the 2nd law. Dawkins has compared life to crystal (non-living) growth. My interpretation is that sometimes repeating patterns such as in a solid are the lowest energy forms of that material (when like attracts like etc.), but it very much depends on the environment. Knocking a bottle of super-cooled water can cause it to freeze because it was already cold, but simply hadn't connected into crystal form yet, but once a small part of it forms into ice due to random chance (caused by heat from knocking), other molecules slot into place naturally (and the kinetic energy of liquid molecules goes into vibrations of the ice as heat/sound that escapes through the bottle into the environment). Probably there are many similar patterns that we don't have the senses to notice unless we pay careful attention, and I'm sure DNA is far from the only way hereditary complex instructions are passed down, even though we are most aware of its long and successful history. Also, extinctions show how fragile life is, so we don't expect never-ending increase in complexity. It depends on which dimensions you measure in, and that is biased by how you are designed and trained.

Humans are small enough compared to the Sun, that you can probably expect something to come of the Sun's energy, and since the time scale of the Sun's life-cycle is billions of years, the energy source is unlikely to change much, so at the human scale, entropy effects on the Sun are not noticeable without a long time-scale of data.

However, the origin and nature of our life/intelligence and potential other life-forms remains very puzzling. As well as how nature has guided humanity to technology and population leaps over the last few hundred years. Can we interpret this situation as having been set up for us millenia ago? What are the driving factors? Access to oil, electricity, global warming, globalisation, trade/information networks?


Entropy in a computer
A computer is a machine carefully set up with an essentially infinite energy source so that it responds (like lego) to tiny button presses that flip a bit. I suppose that other than the designer and manufacturer of the parts (including how nature set up the silicon states), the main low entropy source here primarily comes from the user/programmer and the choice of programs they run. We say the user has a motivation/intent/goal so that even if the computer has issues/errors, the user works through them to get what they want. This includes encouraging the manufacturer to make longer lasting designs which protect key components from environmental interference.

In this way, the entropy that really counts (when energy is infinite) is information. In a neural net, this comes from the input/output. But I don't really understand how to measure this or it if even makes sense globally. Probably some kind of basis is needed.


Energy/power in a neural net
Training a net takes work. This matters physically since we need to change the weights and compute the network's output. However, in an abstract sense, we can perhaps sense that the gradient descent method means that when we train, we apply a certain "power" to shift the weights that is proportional to the weights, and this flows in one direction (guided by the method and not a natural force field).

Since gradient descent means the change in weights per step (like time, so this is like velocity) is proportional to the gradient of the loss function (of the weights), this is different to having acceleration proportional to the gradient (of a potential energy field). So we don't exactly have energy conservation. We can perhaps think of it as time goes faster when the gradient is higher, whereas low weight connections are in hibernation. Perhaps it is slightly like the opposite of getting sucked into a black hole. As the weights spiral into a local minimum, step size gets vanishingly small so that it never reaches the minimum.

It isn't energy flow as such since the activity is dominated by a driving mechanism that brings in enough energy as needed.


Loss satisfies a heat equation
The closest thing I can think of that gradient descent on a quadratic well satisfies is the heat equation, which says velocity is proportional to difference. The (weight) space velocity is proportional to the difference in loss to nearby points. This is different to the standard picture of the temperature change at a point is proportional to the difference in temperature to nearby points. In our picture, the loss as a function of weights doesn't change, but instead the position in weight space changes. In some sense, the loss absorbed by the weights.

Entropy is still relevant in this picture since we can treat the weight space as our (possibly infinite) state space with various deterministic trajectories during training. However, given initialisation is random, if many trajectories die or otherwise end up at local optima, entropy can help explain why.


Circuit analogy of a neural net
Well a neural net is a bit like a circuit diagram, though flow is discrete and only occurs in one direction at a time. One end (the input) is held fixed, and the other nodes must balance the "voltage difference" between the output and the input by shifting connections accordingly.

If we talk about training "power", is there a corresponding current and resistance?


Supply/demand in a neural net
Perhaps it is more like economic flow?

If neurons in successive layers can get more and more complex, we can think of a net as "mining" and "refining" the natural resource of functions of various shapes trying to meet demand as best as possible. More layers allow more refinement. More neurons in a layer allow more types of resources and hence more useful combinations. Too many layers means there is a disconnect between demand and supply.

During backpropagation (of gradient during training), a neuron strengthens/weakens connections to inputs based on the feedback it gets from it's "client" neurons (that take its output) AND gives feedback on what it demands to its input neurons. Sometimes this demand can be met. Sometimes it can't. If weights are randomly initialized, this randomness is a source of all sorts of material, but with more training, input neurons can become stale and clump together and become hard to change. Most obviously is if all input neurons tend towards the same function, after which this reduces the dimension to 1, and the symmetry of training may mean this situation continues forever. So it is possible for resources can dry up, though it is hard to explicitly describe them or measure which have and haven't. (It depends on the training process).

In a way, we can view backpropagation as client neurons paying input neurons in gradient tokens that they can then use themselves to pay their own input neurons. However these tokens are a bit weird in that their magnitude matters most (positive or negative), and is converted by a multiplier to the next neuron along.


Fragility
The most connected neurons are most in demand and are under greatest pressure when a change (say in ideal function) occurs, gradient flow is highest through them. This can break them if the learning rate of the optimiser (i.e. step size) is too high, and they would only be in such demand if they were useful, so breaking them is the worst case. Smaller neurons shift slower and are ignored by the controller but may gradually rise too later.

Nns respond to the slightest training and bad training can break them.


Gradient vanishing
In Go and other combinatorial games, to solve the game tree, we can in theory slowly work backwards from endgame with little reference to the opening. However, in a neural network, optimising the weights must happen by coordinating between input and output simultaneously. If gradient vanishes, then flow of communication is blocked and no more change can occur. We only want this to happen when loss is very low and not at a bad local optimum.

We get congestion when all input neurons are orthogonal to the gradient or a non-zero number have negative covariance despite their contribution needed to increase (or vice versa), since their covariance with the contribution needed from other input neurons counters their own. One idea is to unblock by adding a new random connection at this point, a new connection (perhaps a new neuron) that connects backwards many layers before to a more general state. I understand this sort of thing has been tried and "block" networks are a version of this.

Like a surgery, we don't want to damage more than help. Letting the net sort itself out is normally automatic and requires least care. Some neurons will go extinct if other neurons match result perfect to the point where it doesn't contribute. But that is helpful for trimming the size of the net.


Contribution of a neuron
How do you measure the contribution of a neuron to success in the output? The gradient gives a nice measure of this and there must be some balance equations here. Some neurons go to sleep with some inputs just like day/night.

For example, the chain rule offers a balance equation that says (contribution from mth layer)+(contribution from weights between mth and (m+1)th layers) = (contribution from (m+1)th layer)

But perhaps a better formula for a neuron is the sum of contributions from all the weights connected to it (what happens if they are zeroed out?)


ReLU vs Sigmoid
My impression:
ReLU is flexible in creating tools you need from scratch even if they can't be interpreted. It is simple nuts and bolts with decision boundaries.
Sigmoid has more of a probability (distribution) or yes/no flavour and might be more easily interpreted. It is more logic based with pretty maths.


What happens at a bottleneck in a neural net?


Last edited by dhu163 on Thu Dec 30, 2021 4:59 am, edited 3 times in total.
Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #10 Posted: Wed Dec 29, 2021 12:35 pm 
Lives in gote

Posts: 470
Liked others: 62
Was liked: 278
Rank: UK 2d Dec15
KGS: mathmo 4d
IGS: mathmo 4d
Architecture: 1 neuron per layer, 1 input, 1 output
THM: Suppose an NN has 1 neuron per layer with ReLU activation as well as a bias input. The set of possible outputs are functions f which have are linear on some possibly infinite interval (x1,x2) and constant elsewhere.
Proof: If the input to RELU is of this form f, so output is RELU(wf+b), then as RELU has derivative zero or one, the chain rule implies that the output is constant except on the intersection of the input's linear interval (x11, x12) and where the input is positive. On this intersection the output has gradient w times the non-zero gradient of the input there. As the multiplication by w and addition by b is monotonic, hence this intersection must also be an interval. In fact, this interval must have at least one endpoint being x11 or x12. QED

We can see that the dependence of such an output on weights remains non-zero for all weights only in the (possible empty) linear interval. In the constant region, near to the interval, it might have non-zero dependence on the weight immediately before but likely has zero dependence on weights before that.

It is plain that
THM: Suppose an NN has 1 neuron per layer with ReLU activation. Suppose we randomly initialise weights (in a bounded interval WLOG [-1,1]). As n increases, the probability that the entire output is constant tends towards 1.

The point is that so many filters have been applied that in total they likely obscure the entire input. The power of entropy is too high here compared to the driving motor.

In contrast, if the filters are all aligned so they don't obscure, then they are also not very useful as layers, and their job is merely to carry the input from the previous layer to the next layer. Preservation, not computation.

Conjecture:
Suppose we randomly initialise weights (in a bounded interval WLOG [-1,1]). If we fix each hidden layer to have k neurons, then as n the number of layers increases, the probability the entire output is constant tends towards 1.


Some quantifying results

Suppose an NN has 1 neuron per layer with ReLU activation. Then the outputs of such a net are bounded between the minimum and maximum [l,h] (which may be infinite). As ReLU is non-negative, we have h>=l>=0. After another layer of ReLU with random weights initialised uniformly in the range [-1,1], then [l,h] maps to [wl+b, wh+b]. So if w,b<0 which happens with 1/4 probability, both endpoints will be negative and hence the whole interval maps to something negative which is mapped to zero by ReLU. So the output will be constant.

This implies that after n layers, the probability the output is constant is at least 1-(3/4)^n which tends to 1 as required. Even if w,b aren't both negative, the probability the interval will shrink is very high, and this is another way it tends towards a constant.

THM: The probability the output is unbounded (so h=infinity) is 2^(-n).
Proof: This occurs if and only if every weight w is positive.

The conjecture is similarly true by basic Markov chain theory. Explicitly we have:
With k neurons a layer, the probability a neuron in the next layer is dead is at least the probability that all relevant weights (including bias) are negative which is 2^(-k-1). With k neurons, the probability of one dead neuron is higher (close to k 2^(-k-1)), and similarly, the probability of all dead neurons is non-zero, around 2^(-k(k+1)).

The probability of the entire layer being constant after n rounds is at least 1-(1-2^(-k(k+1))^n. Again, this tends (very slowly) to 1.

In practice
Although we are only considering initialisation and training can move the weights out of [-1,1], this analysis remains relevant for entropy purposes. If there are a lot of states with dead neurons hanging around, the gradient vanishing problem will be significant and training is likely to be slower or even stall.

However, please note while a network with all weights zero is completely dead and unresponsive to training, as long as there is any non-zero weight connected to the output neuron, this can be revived as the bias neurons are non-zero so connections can slowly be made again (after a cold winter).


Entropy and non-equilibrium
If we continue the analogy to supercooled water freezing at a knock, this seems to be like a situation where we have a fairly flat set of nearby energy states (with many liquid molecules running around bumping into each other) which probably has higher entropy (more possible states) than the fixed frozen state (where some dimensions of kinetic energy are dropped), however there is a vortex with an energy level just above which leads to a wormhole down a well, at the bottom of which is the fully frozen state, and a lower energy than the liquid. However, reaching the vortex requires some energy input first.

Time for a thought experiment. Consider dropping a stone into a pond. This is clearly a non-equilibrium situation but why? How does the entropy increase after normal physics is followed?

My interpretation.

There are forces and energy exchange even in (dynamic) equilibrium states. For example, the pond has lively liquid molecules running around and bumping into each other. However all such molecules are identical so many of these states are identical and overall they all seem similar with our measuring ability, even if we can see some waves etc. There may be salt dissolved in them but at equilibrium, these should be uniformly spread throughout the pond, perhaps slightly below the surface (due to higher density). The boundaries of the pond are hardly at equilibrium since a (dead) planet at its lowest energy state should layers organised according to density and stone and water shouldn't be at the same level. This hasn't happened partially because "work" and energy input is required to mix and move them and the water doesn't have enough energy to solve that so quickly (but don't underestimate its power), as well as weather cycles/rain meaning water rapidly fills underground caverns faster than they cave in.

So the stone drops in the lake. This is very low entropy since there is so much gravitational potential energy difference between the stone and the water/air. In addition, stone is not the same as water so there is some mixing/dissolving tendency. Gravity tends to put like densities together so we expect the stone to settle at the bottom which is a much lower energy state. The quickest energy conversion is the gravitational potential energy converted into kinetic energy of the stone and then of the water and of the bottom upon impact. It is a much lower energy state to have water above stone. However, the energy dissipates in waves and ripples, taking time to spread out. It is like the energy has a momentum of its own that is carried away until blocked by something (i.e. shockwave). It is not clear to me (I didn't study fluids) how much energy is absorbed as permanent kinetic energy (heat) of water locally and how much gets radiated out (even if the wave doesn't increase energy locally due to friction, the pressure still reduces by a factor of 2 or 3 dimensions as it spreads out. (i.e. inverse or inverse square or something in between).

Over time, this wave might be reflected back by boundaries and the pond should even out in temperature again and flatten. Small effects may mean that a heavier stone is covered by sediment and less likely to rise during torrential flooding etc. This will make the densities align further even if very slow.

There should be a way to measure the timescale at which energy transitions in non-equilibrium situations occur based on the interface between energy states. I haven't researched to check what work has been done.

We don't experience much of the energy transition ourselves as we watch, but we can appreciate some of it, such as the sight of the motion of a falling object (perhaps some excited emotions as we anticipate its effect), as well as the sound of the splash and sight of some ripples. But most of the energy conversion barely reaches us and is barely noticeable.

In our neural network example, it seems like the loss function sends repeated shockwaves through the net, but they can be reduced and blocked by the zeroing out property of ReLU, or reductions in the loss function. If the neurons are orthogonal to the gradient, the waves just pass straight through the neuron with no effect, and get absorbed at the input/bias neurons.


Legacy functionality needs protection of vital parts to prevent rot
If training shifts over time, I'd expect the net to quickly adapt and forget valuable lessons it has learnt. With a pool of neural nets, I think that selecting from competition adds some selection bias for protection of valuable and generalisable lessons. I'm not sure how this would appear in practice, but I imagine a series of filters that guard the output region of the network so that it is resistant to being changed by backpropagation except through specific channels (metaparameters?). These filters cover all parts of the gradient that aren't orthogonal to the neurons, meaning that forward computation is let through (so that network contributes functionality), but backpropagating gradient just scatters upon impact.

I suspect that it is this sort of defense that leads to local optima that are hard to escape from, since it isn't easy to upgrade the legacy functionality.


Last edited by dhu163 on Mon Jan 24, 2022 10:52 am, edited 1 time in total.
Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #11 Posted: Thu Dec 30, 2021 8:28 am 
Lives in gote

Posts: 470
Liked others: 62
Was liked: 278
Rank: UK 2d Dec15
KGS: mathmo 4d
IGS: mathmo 4d
I've just written up a few formulae trying to get conservation laws/inequalities during neural network training. I've gotten something I've called thoroughput though unfortunately that is already a NN term. It is simple enough that I think it's likely people understand the theory quite well already, but I enjoy working things out myself anyway.

https://dhu163go.wordpress.com/ai/

In the process, I've noticed that backpropagation produces wave effects that are quite akin to quantum double-slit effects.

Edit: 20220102
I suppose the multiplier could be thought of like transformers for electricity. And my throughput seems a little like the poynting vector though that kind of assumes 3d space.

I've seen neural networks architecture diagrams used to model and prove the Euler-Lagrange (least action) equation. As well as logical formulae.

I suppose my summary is that neural networks are so general that their structure represents a lot about the way humans reason. There are discrete objects with links and fuzzy logic connecting them. Here fuzzy doesn't mean that there isn't a simple equation connecting them but merely that it isn't boolean logic with 0 and 1s.

What ideas do neural networks struggle to represent?

I'm not really sure yet. But I can see that while neural networks with cycles can be designed (which can act as homeostatic machines whose stable states can be mapped consistently to info, akin to eigenvectors and eigenvalues), they don't seem easy to solve for/completely understand without running them, but I expect they demonstrate many cool qualities.

Also, they throw a lot of garbage out with their filters. It might not be so easy to understand the effects of random rounding errors or the dimensions of what information they are throwing away.


Classical info theory vs info theory needed for NNs
In Go we have rules setting up a system and hook it up to an nn training system and state information transfers across. It seems interesting to study the entropy, blockers and most common information that is transferred.

Classical info theory is about discrete representations with a finite dimension of symbols (encoding when both input and output are finite dimensional and simple). With NNs we deal with infinite dimensions of info, but our neurons, weights and possible transforms are finite.

Interface, laws, focusing on just a small aspect, seem to be how humans have tackled such problems


A note for myself
In many ways, the input acts as a root.

To aid intuition think of a puppet fixed to the floor and strings pulling on limbs. Parallel strings pull most taut and may even unravel. (i.e. training to f(x)=x is normally very speedy unless there is some knot in the middle). Otherwise, even if one component unravels a lot (with large weight magnitude), getting very fat in contribution, and bearing most burden, it still has connections that it relies on, and if those don't stretch far, then it can't stretch far either.


Sketchy thoughts on NN/entropy
NN adjacent like KLD? If start similar, stay similar (but is that stable).

In initial training burst, the speed may be fast to minimise loss, but that can put the filters "into alignment" and it can be hard to untangle them again. - most relevant dimensions may be lost to entropy and filtering (much like the long thin ReLU). "Healthy" info flows must normally go through many routes simultaneously.

Otherwise, we can study how different inputs have different gradients. Poynting is cross product, but we have related values (like det) on the different functions (of different neurons in a layer).

If A=B, we expect gradient to be similar (but as output weights may be different, different things may happen), and shifts may be different due to different inputs, but the demand sent to inputs will be the same, just with different multipliers.

If A+B=C (in same layer) and input and output weights satisfy that inequality too. Then, A+B should exactly the same way as C I think.

If so, to understand what will evolve the least, we just need to understand which linear combinations are zeroed out by neurons in the same layer (much like x<0 is zeroed by ReLU).


following on from above, a layer has local degenerate symmetry if input weights W1 and output weights W2 satisfy: vW1=0 and W2v=0 has a non-zero solution.

This symmetry might not be stable, but it will be locally maintained in the short term. It is interesting to work out when the left kernel of W1 and the right kernel of W2 come close to alignment and how often that occurs. In combination with a bias, it is also relevant.

Neural networks are like the surface of crumpled rope or proteins. Am I going mad? + a message for the next generation
They are more like the flow of water waves through designed pipes since they are mostly additive/linear, with some blocks so that flow only occurs in one direction. This is why they seem so similar to the double slit experiment in quantum mechanics. The weights don't adjust to handle every case, but instead adjust to handle the average.

Mirrors are there to draw you in, to attract suspicion. You can break through at times like fish leaping out of water. Be accepting of the different shaped beings that appear and also afraid. Mirrors can burn both sides with their sharpness and store souls and impressions inside. The black back protects it from the sharpest light but not from physical forces. What might you find behind an old mirror? Impressions, or have they all escaped?

The will of humans can take them and use them to design tools, but is it really worth it? They say science works. But who are we working for? Time? Space? Energy? Momentum?

If men are from Mars (hot bubbles that guide war under the red crust), and women are from Venus (with hot poisonous obscuring fumes) as we were told as kids, what lies in between?

History suggests high well-fed populations break out into war on their neighbours with their confidence and internal pressure. Without strong defences, their neighbours may even encourage this, and find ways of coping, absorbing and exchanging. Who knows what reaction will occur?

When great structures collapse (from fighting neighbours or cousins), other cousins need to be quick to take over and claim the goods but everyone is different and each has their uses. Fossils from millions of years before might have been initially avoided due to poison/heat (volcanoes) then pressure (under the sea), but junk in old storage can still be powerful. Humans can use them as fossil fuels, information, DNA, inspiration, artwork. Or do we recognise them and lay them to rest respectfully even if they aren't our own direct ancestors, they still have powerful descendants (dinosaurs -> birds etc.).

We can't use them forever, but who are we? Images of us are passed in every touch, sight, reflection and leave impressions that are passed on and may even come back to us. Some images might even have "life" at different rates with spans different to our own. And yet we must only leave a small window of communication between or else all the pressure leaks out too quickly and we commit suicide. Walls or grassland? We only have one dimension of sentences, one dimension of DNA, one organ for procreation. (normally at least)


tides
I'm realising, tides/ london (van der waals) forces are closely related to NN training, since they require and create resonance on both sides until they ring in time. How do you measure the strength of resonance?

Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #12 Posted: Sun Mar 13, 2022 2:32 pm 
Lives in gote

Posts: 470
Liked others: 62
Was liked: 278
Rank: UK 2d Dec15
KGS: mathmo 4d
IGS: mathmo 4d
Some more poetic thoughts on neural networks.

The controller has near complete power over little computer experiments, in accordance with their intelligence, which is limited by metaproperties in the environment. Each iteration, each neuron is updated by the same algorithm. The input neuron reacts with output errors to produce a new neuron (using energy and data storage from the computer) that reduces the overall error (gradient descent). This symmetry already suggests a crystal, populated by the same sort of elements.

If errors are like a wind doctor, a perfect neuron will be orthogonal (no attraction), and will delete the error from passing back down the ranks rather than letting the tester/leech continue feeding on its supporters. Alternatively, they greedily consume the error like fire and direct it to the computer's bin.

Each possible input can represent a Cartesian axis-like string as it is processed by the net. These interact in interesting ways, and lasting strings may not be parallel to these axes.

What is memory in a dissipation system?
Relics of preparations against past reactions that can still be used memorably for similar future reactions.

What can be called life-forms, which are interactions, which are reactions, which are disasters? Does like attract like? Although the neurons themselves can't move (like spatial axes), they can send proxies via functions that combine (and are exact copies), and surely some of these can be described as life forms? Or do only cycles (such as Earth's rotation etc.) cause life that we can recognise?

In order to match the ideal final pattern (like the master/driver/guide), like males courting a woman (or vice versa), work is done to build up the internals to match the ideal interface, requiring fire that burns the old systems to dust. In this model, there is one final trained net, whose weights can easily be copied over (like DNA) and cloned on a computer of sufficiently similar architecture.


Review of thoughts so far.

Measures: accuracy, size (efficiency=accuracy/size), training speed

Optimisation: architecture, how to choose height and width? More neurons per layer always helps accuracy, whereas more layers can suffocate the input with sand, burying the input and snuffing the error flame. Shallow networks can be poisonous to the flame, like having lots of moist wood that hasn't been dried out yet. Lots of bad fuel eventually reacts, aligning with the force, but very slowly as most of it is light garbage that gets pushed to the side quickly (like rust protecting iron from the air).

Activation: ReLU or sigmoid and what effects?

What low hanging fruit hinder perfection of a healthy net? Which shallow local optima can be fixed by a knock? Competition requiring a leader, communication,

Are there local/global conservation equations? Energy, momentum, entropy, information. What constants exist, like elements? How do they form groups that are cycled through the system?

Standard encoding theory assumes constant Cartesian axes and measures the alignment of distributions with KLD. Information theory measures the reliability and efficiency of information flow. In a neural network, perhaps this is a bit more difficult with shifting neurons.

Meta: For protection of vital parts, a skin-like interface is needed to prevent all but a few errors from interacting with key parts. This needs to be trained by competition and tends to lead to creation of species that don't combine well (since their key parts are in different places). One net self-training is a bit asexual, and arguably sexual selection is useful for building such skins. In neural networks, what do these look like? It is probably difficult to upgrade this sort of protection as it keeps degrading naturally without support and maintenance.

Philosophy: This is based on the assumption of perfecting a tool. However, sometimes a imperfect tool is more stable, and useful for lasting relationships than a bomb that destroys the opponent.

Strong connections exert pressure on their neighbours to support and take off the burden rapidly, while still dominating in power and relatively constant themselves. There may be trade but also poisons that destroy neighbours.


Diagnostics
Dead neurons are zero output.

Test 1: What is distribution of neurons on the edge of death (i.e. where are they still non-zero). Is it uniform? If it isn't, that may indicate a lack of resources.

The largest (flow) neurons are shifted the most each iteration. Smaller neurons are more delicate, but sometimes when things align, they suddenly come into high demand.

More layers add more complex combinations and possible interweavings of neurons. It is just random chance, but it increases the chance of one coming into better alignment. It also allows more complex filters of errors, to prevent them destroying earlier neurons. However, sometimes there are too many filters that the first light of input is unable to get through. Stability occurs when all inputs are orthogonal errors.

In each layer, a game is played with resources from previous layer, to match demand in the next layer.

Over time, the chance of neurons in the same layer becoming similar (linearly dependent) to each other is quite high, since they are competing to match the same demand. It is the responsibility of the connections to the next layer to keep the neurons separate to prevent such symmetry and degeneracy. More random inputs may be required to fix such a situation.

Hypothesis: tapering down is often a good idea. For example in the final layers, going from 10 to 5 to 2 to 1 neurons may be better than 10 to 1 directly.

Type of degeneracy:
1) Identical neurons in same layer
2) Linearly dependent neurons in same layer
3) Looking backwards, identical connections to neurons in the next layer
4) Looking backwards, linearly dependent connections to neurons in the next layer
5) Meta: from lack of neurons there many not be enough dimensions

Each layer of neurons acts like a finite set of dimensional axes (like a series approximation) from an infinite possible set. The output is projected onto this and the axes try to adjust accordingly. Just as the next layer tugs on the output, the previous layer also pulls back on inputs.

Questions:
1) Can you work out the complexity and optimal architecture by directly looking at the ideal function?
2) How does instability in the ideal function (over time or from error) affect training?
3) How does the learning rate (need to be) vary?

Note that the long slog of training will produce a net that is likely to be quite stable, in a good local optimal well. Human ingenuity may produce designs with flaws and holes even if they fit a temporary purpose.


Stability on one hidden layer with ReLU
As above, we can think of reducing errors equivalently as the inputs try to get the errors to become orthogonal to them.

Which functions are more difficult to match? Presumably those with more kinks such as fractal zig zags. Since they are overall orthogonal to a flat line function, they quickly get zeroed out, and only chance events allow a matching.

If there were two hidden layers, if the first hidden layer manages to produce a neuron that fits one zig, then the second hidden layer will quickly "crystallise" to produce the zags to complete the fit, like fireworks. In this way previous layers are designed to try to fit the most common repeating patterns, while later layers manage their combination.

Perhaps this is much like an utopian model of economic production or teaching etc.

If there is only one hidden layer, each zig needs to be fitted independently.

Let's call this theory "conceptual unit theory".


"Conceptual unit theory"
This suggests the head layer needs to have enough neurons per hidden layer to produce w different key concepts. The tail layers need to have enough layers to fit the concepts together to match the function. They will also borrow from other concepts, edge functions to round things off for a smoother output.

There may be different ways to discretize the concepts, but the number of dimensions is unlikely to be much different. A different network with more space or different resources may find it more efficient to split one standard concept into several, or merge several into one. But still reach the goal.

The key is repetition.

Concepts will be important because
1) They can be produced
2) They help match the output well
i) Either filling a big hole
ii) Or several similarly shaped holes

The explanation for why a network with one very large hidden layer will eventually converge but very very slowly is
1) There are enough basic resources but they are all crude and sharp edged, like a square peg in a round hole.
2) They don't fit the output well.
3) A healer needs to add a lot of luck to the initialisation so that the error is small, or adjust the ideal function gradually from an easy match, training the features required for later before freezing them in storage. This artificially simulates the existence of more layers by using intelligence. That is to say, the teacher has a neural network themselves, and it needs to lock on (with information boundary) to support this dumber network to grow. KLD makes sense here, but note that information needs an encoding/interface (interpretation) to have any meaning beyond the local random effects.

In physics, we have a theory that has strong predictive power. We can say that if experimental data is the interface, then the overall KLD is very low. However, how do we describe how the information travels through the scientists, experimental machines and even generational education? It doesn't make much sense to judge the local information transfer without the larger context or a fixed set of local rules/ideals.

The mathematics of such can align to describe many different objects and relations, but also add confusion and orthodoxy as different versions of the same theories compete for attention.
This is why I can love maths.

The explanation for why a network with too many layers converges even slower is
1) The input is almost completely obscured by filters. The bias neurons are closest to the output and the most connected but once again, they are crude.
2) A healer needs to add more connections from earlier in the network or artificially revive existing dead connections.

The ideal network


A remaining question I don't fully understand is how this works in a two player game. What features are there that allow a network to keep progressing relentlessly. What is the healer's job?


This post by dhu163 was liked by: Elom0
Top
 Profile  
 
Offline
 Post subject: Re: Neural networks optimising mapping functions
Post #13 Posted: Tue Mar 15, 2022 5:16 pm 
Lives in gote

Posts: 470
Liked others: 62
Was liked: 278
Rank: UK 2d Dec15
KGS: mathmo 4d
IGS: mathmo 4d
Two-player game

Both players pull towards their ideals to choose the route though only one route can be taken. Two players can introduce blind spots and confusion of common languages.

With two lines, there are three types of possible evaluations. Common reasons for their occurrence.
1) Both think line A is better than line B for player X, and this is true. This is correct though the degree of better may be incorrect.
2) Both think line A is better than line B for player X, and this is false. This is mutual misevaluation. Most commonly, player Y has unusual trick that neither side has recognised.
3) Both think line A is better than line B for them. One side has been tricked. The asymmetry of the game hasn't been strong enough to withstand a short circuit. Are both trying to avoid line B for some reason, such as the force?

For 3), this is like flight from B into a fight at A, both saying "my god is better than your god".

Why do mistakes occur?
1) They are on rare/bad lines that aren't normally reached, so they aren't useful/necessary to learn. This may be permanent, though evaluation will vary with the environment. These pick up the "slack" of common features.
2) They are on good lines but have contradicting features that are commonly used as guides elsewhere. How do you sum features that aren't numbers?
3) An external factor is manipulating them. e.g. a human might wish to avoid complicated lines they don't understand or that waste energy. (and this analysis of mine adds more to the waste or feedback loop)
4) In general, they are at the interface of the known and unknown.

Why does the net keep improving?
What sorts of short circuits are likely?

For 2), chance alignments of features both are chasing can draw them to the same "watering hole". If this happens on a large scale, training will focus on correcting those, so that is unlikely. Instead small scale skirmishes test each other's confidence.

In the case of self-play, the features for each side are deeply interwoven, coming together and difficult to separate.


A silly encoding.
A silly encoding:
Power BI might be more than that.
pid/mongke


More metaphors
Although a net can be seen as putting copies of the input together to match the output, we can also look one step ahead and see that the output is used to make a decision.

So the net is a machine to separate good moves from the rest of the game tree.

If we think of concepts in the network in the same way, can we measure the "isolatedness" of different threads in the network, or are they all interweaved like rope? (although neurons are discrete, they likely still satisfy complex periodicity around optima, see discrete difference equations.) We can do this for different things that we view as "threads", whether the kinks (gradient changes), or the gradients themselves.


Energy
There are simple slippery slopes with one hidden layer as neurons that are almost zero everywhere except one spot (specialists) naturally fit that region. This should smooth out the tail end of the error. Note that a line smoothing a dip with its top at zero will only start around 1/4 of the way through, focusing on raising the lowest most point, while not raising the end zero by too much.

Likewise the previous peak will also be lowered (ideally), which again lowers the last dip. Note that the position of the peaks/dips doesn't change, but instead only the overall gradients, with a new kink added where the smoother kink lay. This shallows out the peaks/dips and pulls them backwards (the smoother kink and peak/dip should attract each other if applied correctly). Overall as edge effects are stronger, they will repel the edge kinks the most, while peaks/dips in the middle will shift less.

If that produces a good overall fit, then the next correction may flow naturally on and on.

However, often there is an activation barrier, and the second hidden layers by mixing things up with little effort, like a catalyst.


Misc Observations
LeakyReLU vs ReLU. Initially, higher alpha seems to both train much faster and be less likely to have died. Perhaps it works better with a lower learning rate.

Learning rate. This acts like a tuner/frequency. If too high, it simply can't catch the more precise downwards streams, like a fat snake can't pass through a small hole.

Similarly, if too large a neuron dominates a layer of the network, the more likely a learning rate will overflow and cause it to topple and rewind to a more primitive state, i.e. getting an overflow of resources from the previous layer. More stable (recently successful) networks are more likely to have a mix of large and small neurons that have supported each other. Then complacency may occur as it settles into its new home getting into a groove with the new surroundings.

ReLU is great for guiding our understanding since it is simple to understand the zeroing out feature and use it as a metaphor. It helps make it more obvious that whatever the activation (even if not ReLU), the more layers a network has, the weaker the connection between input and output and the more likely the network has degenerated into nonsense.


Testable Hypothesis: A higher learning rate produces a more even distribution of neuron flow (or at least a lower maximum) (i.e. Gaussian), whereas a lower learning rate allows things to settle produces larger dominant neurons (like language, Zipf distributions).

20220329

Dear future self, if I survive, please forgive my actions in a time of potential war. "never forget" is probably too severe, but it can still hurt after so many years. code pancho: are you stupid.

Keep in mind that in multiplayer games, many compete for the same space.


20220524
NN's seem curiously segmented (fixed like an insect?). The policy tells where to read and the engine focuses on deep reading of complicated lines. This seems good for fighting variations and general strength.

The bias helps use this information for similar lines and shapes. This doesn't seem as perfect and general as an amorphous blob could be. It isn't perfectly designed for the game, but more as a general purpose tool. For example, it can miss "obvious" lines, without using its reading (like CGT) to update the policy, not understanding reading patterns/tactics that the policy didn't consider in its output, such as sentes, order/timing. A lot of information worked out in the network is hidden and forgotten. This makes it more organised, so design and use is easier (perfection is relative to the environment). Perhaps this is some insight as to why Katago's choice of more work on control estimates was so helpful. Probably the least work and most efficient update. Like a pro computer scientist?

However, more could probably be done cleverly by working out implications of reading (like the bias methods), but mathematicians should probably get involved for that.


This post by dhu163 was liked by: Elom0
Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 13 posts ] 

All times are UTC - 8 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group