Nvidia RTX 30xx

RobertJasiek · Post by **RobertJasiek** » Wed Sep 02, 2020 12:26 am

Model         Tensor                      Storage                           ALU          TFlops             SLI possible  USD (net)
              Cores        TFlops                                           Cores        32b   64b

RTX 2080 TI   544          113,8          DDR6  11GB        616GB/s          4352        13,45 0,42         yes           ~1000
RTX 3080      272 = 0,5x   238 = 2,1x     DDR6X 10GB = 0,9x 760GB/s = 1,2x   8704 = 2x   29,77 0,93 = 2,2x  no              700
RTX 3090      328 = 0,6x   285 = 2,5x     DDR6X 24GB = 2,2x 936GB/s = 1,5x  10496 = 2,4x 35,58 1,11 = 2,6x  yes            1500

Do Tensor TFlops refer to the combined speed of all Tensor Cores?

Do these values mean that 1x RTX 3080 is roughly as fast for Go-AI as 2x RTX 2080 TI (with SLI)?

Is the advantage of using 2 GPUs (with SLI) just 2x the speed of 1 GPU or is there an additional advantage?

Is 0,5x the number of Tensor Cores any disadvantage or is the factor 2,1x of Tensor TFlops the only relevant value?

Are 24GB instead of 10GB storage just 1,2x faster (936GB/s instead of 760GB/s) or can also 2,4x more go positions be stored? How do these sizes of GPU storage and, say, 64GB of RAM cooperate for storing more go positions?

RobertJasiek · Post by **RobertJasiek** » Thu Sep 03, 2020 12:26 am

Speaking about Nvidia RTX graphics cards.

Some hardware manufacturer says that 2x 2080 TI with SLI is 15% faster than 2x 2080 TI without SLI. So although 3080 does not support SLI, now my speculation is that 2x 3080 would only be a divisor of 1.15 slower than 2x 3080 TI with SLI (if that were available).

Therefore, since 3080 is significantly faster than 2080 TI, hardware-wise 2x 3080 should at least be slightly faster than 2x 2080 TI with SLI, but at the acceptable price of $1400 instead of >$2000.

Will this be so also software-wise? That is, do the Go AI neural net programs (KataGo et al) use 2 installed graphics cards? Or would there be 2 graphics cards in the PC but only 1 graphics card actually used by the programs for one given task, when there is no SLI?

thirdfogie · Post by **thirdfogie** » Fri Sep 04, 2020 3:39 am

Robert,

I don't have any answers to your questions, but there is something else for you
to think about.

I have an NVIDIA GPU (GeForce GTX 1660) which I use to run old versions of
Lizzie and Leela Zero. When analysing a game, the 4 Intel CPUs are kept busy
at an average load of 60%. Presumably, this load is needed to feed the GPU
with data and handle the results. The analysis also uses a lot of my main
memory: 8Gbytes is just enough. I usually have an SGF editor (Quarry) open at
the same time to record the results as SGF comments and labels, and the system
loading definitely makes editing with Quarry slow and difficult. It is
possible that the programs use as much memory as they can get: I have not read the code.

My point is that even if you can use two GPUs in parallel, your main CPUs might
not be able to keep the GPUs busy. You would also need to think carefully about
cooling, but you probably know that.

For reference, my PC has four Intel i5 processors clocked at 3GHz. It runs
Linux (Debian version 10, Linux kernel 4.19). Other operating systems may have
more efficient GPU drivers. I have not read any other comments about the CPU
load when running a GPU for Go.

I hope this helps.

Polama · Post by **Polama** » Fri Sep 04, 2020 9:42 am

Not an expert, so I also lack all the answers, but here's info that hopefully helps:

The new tensor cores are 4x faster. So half the cores, but 2x the overall processing power. I don't think number of cores would often/ever matter, just the raw tflops?

I think SLI is for using the GPU for its original graphical purposes? The deep learning libraries can farm out independent jobs to multiple gpu's either way. I'm not sure whether that requires the deep learning code to be written with that in mind or not: you might find it using just one gpu in practice.

The gpu ram speed and size are distinct. The GB/s is how quickly you can transfer data to and from the GPU (e.g. loading a new board position). The ram is how much you can fit on the GPU at once. If you're training deep networks, more RAM lets you run larger batches which speeds up training, but sometimes there's a point of diminishing returns where more RAM doesn't really help. I don't see why the GPU couldn't be evaluating many positions at once with more GPU RAM, but that comes down to the code. Basically, more speed will definitely help (if your cpu/ram can keep it fed), and a bigger GPU RAM _could_ help even more, or not at all, depending on the code.

GPU ram is distinct from your normal RAM. You want at least as much RAM (or else you won't be able to keep enough data on hand to keep the gpu fed), plus a good buffer for normal computer stuff. Only board positions in the GPU are going to processed, so having excess computer RAM stops helping at some point.

Hopefully people will benchmark the 30xx chips, because that'll be the best way to see what the net impact of all the variables are. For most problems, training is harder than evaluating, so it's probably a bigger deal if you're training networks then just using them. But Go could certainly be different for all I know.

RobertJasiek · Post by **RobertJasiek** » Fri Sep 04, 2020 10:24 am

Since I would build a new gaming PC (Windows preferred), first I decide on the graphics card(s), then the other hardware, except that I am already convinced to implant at least 64GB RAM.

The later decision about CPU is limited by money; more cores are better but this is an open end parameter. Fewer than 6 cores make no sense, paying for 8 should be possible, 12 / 16 / more would be nice but the prices raise quickly.

Currently, my real concern is to get at least roughly the graphics cards speed of 2x 1080 TI or (ca. 35% faster) 2x 2080 TI because that means "usually stronger playing than 9p". So I first need to find out how to achieve this without paying $3077 (net price) for 2x 3090 SLI and without paying too much for used 2080 TIs (whose value should now be at most $400 each as already a 3070 is faster at $499 new).

If a gaming PC cannot be used for a second task (such as opening an SGF editor), this is no serious problem because I have my office PC.

I do not know yet if I also want to train a net. More likely, I just want to use existing nets.

Why do you think that tensor cores of the 3rd generation are 4 times faster than the preceding 1st generation? I recall to have heard the factor 2. One Nvidia diagram shows the factor 2.7 for applied use of tensor cores but we have to be careful because we do not know all presumed circumstances and parameters.

Having watched some youtube videos, I have learned that one cannot simply compare counts of hardware items, such as raw numbers of ALU cores or Tensor cores.

In a different thread, somebody with 2x 2080 TI SLI has said that it works well for some Go AI.

Right, much depends on how code is written, so we need statements from each programmer: SLI? Nvlink? GPU RAM size? RAM size? Recommended CPU cores? Etc.

Don't you think that the programs can dynamically store in the RAM instead of only using the GPU RAM?

Surely people in the web will benchmark 30xx cards during the coming weeks and months starting from September 17. Unfortunately, most will test 3D gaming while we are interested in deep learning tests.

For Go, training is harder than using nets.

Polama · Post by **Polama** » Fri Sep 04, 2020 6:46 pm

RobertJasiek wrote:...Why do you think that tensor cores of the 3rd generation are 4 times faster than the preceding 1st generation? I recall to have heard the factor 2. One Nvidia diagram shows the factor 2.7 for applied use of tensor cores but we have to be careful because we do not know all presumed circumstances and parameters....

I came across it on Tom's Hardware, which usually seems trustworthy.

To be clear: the individual tensor cores are 4x, but there's half as many of them. From your numbers, you saw 0.5x cores and 2.1x more tflops total throughput (meaning about 4x per core). And as you note, benchmarks reported by the manufacturer can be misleading.

Don't you think that the programs can dynamically store in the RAM instead of only using the GPU RAM?

The goal is to keep the GPU cores running as close to full out as possible. As they finish calculations, you need to shove new network weights in. Reaching all the way out to computer RAM is a bottleneck and won't keep the GPU cores running full speed. So basically any position that you want to evaluate in the next few milliseconds should be in the GPU Ram.

That said, you can have game histories, giant databases of pro games, whatever in RAM. You can have lots of trees that aren't being explored cached out there. But the AI won't get "smarter" with larger CPU Ram (past the point where the GPU Ram is being well fed): it's a question of how many board positions it can reason through (and through how big of networks), and that'll come down to GPU Ram.

RobertJasiek · Post by **RobertJasiek** » Fri Sep 04, 2020 9:41 pm

Polama wrote:the individual tensor cores are 4x, but there's half as many of them. From your numbers, you saw 0.5x cores and 2.1x more tflops total throughput (meaning about 4x
per core). And as you note, benchmarks reported by the manufacturer can be misleading.

Ok, right. 4x is the theoretical order of magnitude per core, 2.7x is what Nvidia promises but probably is only an upper bound so the 2.1x more tflops total throughput for 3080 compared to 2080 is somewhat closer to the truth.

However, a first ALU cores test puts the promised 2x in relation. Nvidia selected 8 sample 3D games and their test resulting in an average 1.8x improvement from 2080 to 3080. Given Nvidia's bias, that must be an upper limit, too. Since 2080 TI is circa 1.3 as fast as 2080 for 3D games, we get 1.8 / 1.3 ~= 1.4 as the factor from 2080 TI to 3080 for 3D games.

For tensor cores, it might be a bit more.

Similar guesstimates for 3090 give circa 1.7x as the factor from 2080 TI to 3090 for 3D games.

So I doubt that 3080 or 3090 can quite reach 2x compared to 2080 TI for deep learning.

Nevertheless, close to 2x might be good enough: At the EGC Pisa, which ended on 2018-08-05, a professor of computer science from, IIRC, San Francisco (sorry, forgot his name) said that 2x 1080 TI (or was it 2x 2080 TI) roughly equalled the 4 TPUs of AlphaGo Zero. Since 2080 TI was launched only afterwards on 2018-09-27, I think what he must have said was 2x 1080 TI. Hence, if 3090 is circa 2x 1080 TI, a 3090 would be good enough, although 2x 2080 TI would still be faster but only for programs actually using SLI.

Then there is the option to await 3080 TI hoping it to have SLI but I guess we speak of 2x net $1100 or $1200 to achieve very roughly 1.35x the speed of 2x 2080 TI.

So far my current kaffeesatzleserei. The principle options for alleged >9p play are:

2x 2080 TI (currently used only in the USA with aproximately reasonable prices)
3080 (probably not enough, although more than good enough for kyu learners)
2x 3080 (presumes the programs to use them despite missing SLI)
3090 (probably not quite, but maybe good enough nevertheless; advantage of avoiding SLI troubles)
2x 3080 TI (if this will have SLI)
2x 3090 (clear case but too expensive by far)

EDITs

Gomoto · Post by **Gomoto** » Fri Sep 04, 2020 10:57 pm

I am not sure that you need 2x 2080Ti to reach superhuman strength.

Do you have a source for 2x 2080Ti are needed for "usually stronger playing than 9p"? Thank you.

RobertJasiek · Post by **RobertJasiek** » Fri Sep 04, 2020 11:55 pm

1) The professor's statement (probably about 2x 1080 TI).

2) goame's experience suggesting 2x 2080 TI SLI in the thread viewtopic.php?f=18&t=17715&start=0

3) Various descriptions of 1x 2080 TI being insufficient for consistent superhuman strength.

Gomoto · Post by **Gomoto** » Sat Sep 05, 2020 12:22 am

Thank you

Gomoto · Post by **Gomoto** » Sat Sep 05, 2020 12:43 am

I am not convinced

Mike Novack · Post by **Mike Novack** » Sat Sep 05, 2020 6:39 am

Every so often I feel I have to jump into a discussion to point something out. A statement like "one 2080 ti is not powerful enough" is neither right nor wrong. It is MEANINGLESS unless time control is discussed. Computers do not differ in "power", what problem they can manage << if it is computable, it is computable on a Turing Machine >> They differ on how long it takes them to do it.

You have to bring time into it.

A statement like "one 2080 ti would take twice as long per move as 2 2080 ti's" to have >9p strength and the time control is shorter than twice" is sensible. But without that reference to time, nonsense.

Uberdude · Post by **Uberdude** » Sat Sep 05, 2020 7:50 am

A Turing machine has infinite memory, real computers don't.

Gomoto · Post by **Gomoto** » Sat Sep 05, 2020 7:54 am

Mike, care to share your opinion on the time settings for >9p 2080Ti?

I for one assume clicking through a game at a reasonable pace

Uberdude · Post by **Uberdude** » Sat Sep 05, 2020 8:05 am

What does "usually stronger playing than 9p" mean? Firstly by 9p do you mean a top 10 pro, top 100 pro, top 1000 pro, average strength of actual 9ps, weak old 9p? "consistent superhuman strength" is rather different to my interpretation. Some more explicit phrases
1) usually (>50%) beats a "9p" in an even game with typical internet time controls. You don't need multiple GPUs, a modern phone will do.
2) usually (>50%) beats a 9p in an even game with serious tournament time controls. Ditto?
3) practically always (>99.099%) beats a 9p in an even game with X time controls
4) when given a realistic (from strong players) whole-board position from a game, picks an equal or better move than the 9p >50% of the time
5) when given a realistic (from strong players) whole-board position from a game, picks an equal or better move than the 9p >99.999% of the time
6) when given an artificial whole-board position, picks an equal or better move than the 9p >99.999% of the time
7) when given sub-board problems is able to consider that local area as an abstraction and give the local best move better than the 9p >99.99% of the time
8) when given pathological bot-trap positions, is able to give answers at least as good as the 9p
9) after studying several tens of thousands of whole-board, local and pathological positions over the next 2 years, there will not be one instance of the bot giving a worse move (after 30 seconds of thought) than one I was able to convince myself was correct due to logical reasoning, because if I do I want my money back because L19 gave me bad advice.
10) after studying several tens of thousands of whole-board, local and pathological positions over the next 2 years, there will not be one instance of the bot giving a worse move (after 24 hours of thought) than one I was able to convince myself was correct due to logical reasoning, because if I do I want my money back because L19 gave me bad advice.

Knowing Robert, I suspect it's an 9 or 10.

Life In 19x19

Nvidia RTX 30xx

Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx

Re: Nvidia RTX 30xx