"RAMGATE" and Nai's Benchmark is somewhat suspicious.


“RAMGATE” seems to be going on at full storm.

Meanwhile I found Nai’s Benchmark code at this website:

I copied the text/source code, cleaned it up with textpad, copy & pasted helpers_math.h code into it. Studied it.

Then compiled/build and run it via visual studio 2010 and incremental linking disabled and cuda 6.5 toolkit, probably some beta installed.

My conclusion is the following:

  1. The benchmarking tool behaves weirdly on my GT 520… it reports unbelieveable high numbers. So this is a clear indication something is wrong at least with visual studio and cuda 6.5.

  2. Perhaps the division code is flawed… perhaps () must be added to make the divisions happen in the proper order, though this doesn’t seem to be the problem.

  3. The benchmarking is too short… only 10 loops of 128 MB block is tested. Almost no GPU load is applied according to GPU-z.

  4. The kernel code itself seems suspicious… all inputs are added to a single temporarely variable.

  5. Perhaps Visual Studio Compiler or Cuda Compiler detects that the code doesn’t do anything usefull and simply removes any useless code.

  6. I was not capable of viewing any generated PTX ? Is there a setting in Visual Studio that allows this ? I found this weird. I guess I could modify the command line parameters or so to generate PTX… but this should be by default at least for output files ?! weird ?! I feel sorry for C/C++ programmers to have to deal with this runtime crap… it’s nice to write fast simple cuda programs like this… except they have no idea what the hell is going on… PTX probably gets included inside the executable somewhere… How you know the kernel is actually loaded and running successfully ? I guess you don’t…

  7. I do see the memory being allocated.

  8. There does seem to be some slight gpu load activity but barely.

This makes me conclude there is something fishy going on with this benchmark… at least with visual studio 2010 and cuda toolkit 6.5 and/or my system… but this doesn’t surprise me at all… after my cuda turned into crap youtube video which also included opengl interaction.

Therefore I will also not be running his executable just in case he is trying to infect systems.

I am not saying his benchmark is totally flawed or anything like that… It’s not producing expected results on my C/C++ system as he posted it ?!?

I also tried changing BenchmarkCounts from 10 to 100 or 1000 this completely freaks out the benchmark… sometimes returning 0 or negative numbers.

I also tried changing the float to double where the gigabytes/second calculation is done.

I wonder if the rapid launch of multiple kernel calls is maybe affecting the output. (I do have many browsers open though… maybe that is interferring with cuda… or maybe cuda is just totally failing on gt 520… my own benchmark from long ago does seem to work a bit.).

Anyway maybe those rapid kernel calls or not syncing properly or whatever.

Or perhaps the copy & paste operation from html to text screwed somethng up… to me it doesn’t seem like it.

I write this posting to any interested cuda coder. It should be quite easy to copy & paste his source code from that german link I gave you.

Could you try out compiling/building his code on your system… the running it… examining results… maybe posting here…

And then later modify it a little bit like:

int BenchmarkCount = 1000;

It’d be interested if there is a difference in results if it is increased to 100 or 1000 or maybe even beyond that… but maybe the milliseconds will overflow… or maybe watchdog kicking in.

On my system at least the results are totally whack.

Perhaps later when I have some more time I may write my own benchmark… but might take a different approach just to make sure all the rules of the cuda/driver api architecture and such are followed… I will have to brush up a bit on my cuda programming skills… fortunately I can probably look at my old code…

Anyway some questions: Is his way of coding safe ? In other words:

Performing 10 kernel calls between event start and event stop ?
for (int j = 0; j < BenchmarkCount; j++)
BenchMarkDRAMKernel <<<BlockCount, BlockSize >>>(Pointers[i]);


I have no further time to look into this any further right now… but maybe this loop should be around it and not in it… just some hints which may or may not be wrong.

I do believe there is a problem with gtx 790 though… because of many gamers mentioning stutter…

So let’s consider the issues I am having with this code and cuda 6.5 toolkit… something bizar by itself. Unless others or having problems as well re-creating his benchmark and re-creating believable results.

Just to clear here is the results of the current build from visual studio 2010:

Nai’s Benchmark
Allocating Memory . . . Chunk Size = 134217728 Byte
Press any key to continue . . .
Allocated 7 Chunks
Benchmarking DRAM
Press any key to continue . . .
DRAM-Bandwidth of 0. Chunk: 838860.812500 GByte/s
DRAM-Bandwidth of 1. Chunk: 699050.687500 GByte/s
DRAM-Bandwidth of 2. Chunk: 822412.562500 GByte/s
DRAM-Bandwidth of 3. Chunk: 822412.562500 GByte/s
DRAM-Bandwidth of 4. Chunk: 762600.750000 GByte/s
DRAM-Bandwidth of 5. Chunk: 806596.937500 GByte/s
DRAM-Bandwidth of 6. Chunk: 806596.937500 GByte/s
Press any key to continue . . .
Benchmarking L2-Cache
Press any key to continue . . .
L2-Cache-Bandwidth of 0. Chunk: 2567941.250000 GByte/s
L2-Cache-Bandwidth of 1. Chunk: 2567941.250000 GByte/s
L2-Cache-Bandwidth of 2. Chunk: 2567941.250000 GByte/s
L2-Cache-Bandwidth of 3. Chunk: 2621440.000000 GByte/s
L2-Cache-Bandwidth of 4. Chunk: 2567941.250000 GByte/s
L2-Cache-Bandwidth of 5. Chunk: 2621440.000000 GByte/s
L2-Cache-Bandwidth of 6. Chunk: 2621440.000000 GByte/s
Press any key to continue . . .

I don’t think my GT 520 can do 838860 GByte/s, do you ?

Maybe he updated his source code to correct errors… don’t know… don’t think so…

If you do believe this number is correct then hmmm…

I think I can explain the 0 or negative numbers… this was probably when I changed printf from float to double and then change it back to float and forgot to change %d back to %f… but either then that… numbers seem way too inflated ?!

The author/nai itself seems to be unsure about the benchmark… something may be wrong with it, or maybe not… he claims it’s good for swapping behaviour testing. But he also seems to mention something about maybe the CPU or CPU/MAIN Ram executing cuda code. So my theory could be that the cuda runtime detects that the kernel code is so simple that it might be better to run it on a cpu ??? Seems somewhat far fetched… but something like that would not surprise me… having seen similar compiler optimizations… maybe latest nvidia/cuda driver has some kind of whacky runtime optimization…

I think it would be best if the kernel actually returns some usefull data just to prevent such optimizations. Only piece of code I do not understand are lines like these:

if (length(Temp) == -12354)
	In[0] = Temp;

What does length and some arbitrary negative number have to do with all of this ? hmmm…

If it is inverted (notted) it because -1 if 16 bit… maybe some kind of magical number/floating point number… not sure… maybe it’s a way of writing 65535 or so with a negative number… but bit pattern seems different… weird.

Others and he himself seem to have spotted some troubles with his own code… later on in the german thread he uploads a new piece of software, “click spoiler” to view new source code.

I will try his newer source code and examine it and see what’s different.



I’m the “German programmer”, who initially has programmed this benchmark. As I’ve already stated in this thread several times and immediately after posting the download: This benchmark isn’t eligible for benchmarking the VRAM-bug. You’re just benchmarking the swapping behaviour of the global memory space in CUDA. However some people spread this benchmark all around the internet, without reading, understanding or knowing its issues. Even some major news sites disgraced themselves by using this benchmark without investigating it further.

There’s also a more recent version of the source code hidden within the spoiler down the page. Furthermore this benchmark isn’t very well designed, since I’ve written it within 20 minutes and I’ve never dreamed of it becoming so notoriously “popular”. Thus the quality of the code is very poor. Also I’m very sorry for the unjustified uproar, which this benchmark is causing.

To 1. : Those high number are caused by errors if the kernel launch fails. A failed kernel launch has a runtime of about 0 ms. Since the bandwidth is calculated by size/time the estimated bandwidth becomes very high. Error handling would avoid this problem, but I was too lazy to program it. Thus your high bandwidth suggests, that there is an error. Maybe the windows watch dog? Maybe wrong project settings?

To 2. : Code is ok.

To 3. : A larger problem size would indeed increase the accuracy. But increasing the problem size would increase the runtime. A higher runtime might cause the kernel to fail because of the watch dog. I was again too lazy to avoid this.

To 4. : Simple hack, to measure the read bandwidth. The compiler cannot determine whether “Temp” will be written back or not. Thus he cannot omit the load instruction.

“Therefore I will also not be running his executable just in case he is trying to infect systems.”
Damn! I wanted your pc for my botnet, too! :)

@ Anybody else:
As a CUDA programmer I’m also kind of interested, why the bandwidth drops so much. I assume that it is caused by the undocumented swapping behaviour of the virtual global memory space. But what are the precise explanations for those drops? My investigations suggest that the global memory swapping is one way associative. They also suggest that a page fault doesn’t cause the GPU to upload the page from CPU DRAM to GPU DRAM. Thus the GPU copies the data over the PCI-E for each access again and again, just like pinned memory. Is this correct?

Regards Nai

Others and he himself seem to have spotted some troubles with his own code… later on in the german thread he uploads a new piece of software, “click spoiler” to view new source code.

I will try his newer source code and examine it and see what’s different.

(I have not read Nai’s comment yet above, first want to post this, this is the result of his second version which I copied from that forum (I am not sure how many versions there are…) However do note the different look of the outputted text (MiBytee) and such):

Nai’s Benchmark
Allocating Memory . . .
Chunk Size: 128 MiByte
Allocated 7 Chunks
Allocated 896 MiByte
Benchmarking DRAM
DRAM-Bandwidth of Chunk no. 0 (0 MiByte to 128 MiByte):2621440.00 GByte/s
DRAM-Bandwidth of Chunk no. 1 (128 MiByte to 256 MiByte):2419790.75 GByte/s
DRAM-Bandwidth of Chunk no. 2 (256 MiByte to 384 MiByte):1997287.63 GByte/s
DRAM-Bandwidth of Chunk no. 3 (384 MiByte to 512 MiByte):2419790.75 GByte/s
DRAM-Bandwidth of Chunk no. 4 (512 MiByte to 640 MiByte):2516582.50 GByte/s
DRAM-Bandwidth of Chunk no. 5 (640 MiByte to 768 MiByte):2419790.75 GByte/s
DRAM-Bandwidth of Chunk no. 6 (768 MiByte to 896 MiByte):2516582.50 GByte/s
Benchmarking L2-Cache
L2-Cache-Bandwidth of Chunk no. 0 (0 MiByte to 128 MiByte):12839707.00 GByte/s
L2-Cache-Bandwidth of Chunk no. 1 (128 MiByte to 256 MiByte):11439011.00 GByte/s

L2-Cache-Bandwidth of Chunk no. 2 (256 MiByte to 384 MiByte):11650844.00 GByte/s

L2-Cache-Bandwidth of Chunk no. 3 (384 MiByte to 512 MiByte):11870672.00 GByte/s

L2-Cache-Bandwidth of Chunk no. 4 (512 MiByte to 640 MiByte):12098954.00 GByte/s

L2-Cache-Bandwidth of Chunk no. 5 (640 MiByte to 768 MiByte):11234743.00 GByte/s

L2-Cache-Bandwidth of Chunk no. 6 (768 MiByte to 896 MiByte):12839707.00 GByte/s

Press any key to continue . . .

Again the numbers do not make sense for my GT 520, even if the card were accidently testing the cpu’s cache memory these figures still do not make any sense at all… 16 GByte/sec is probably cache limit of AMD X2 3800+ dual core. I go investigate further :)

Ok, I am reading Nai’s german replies on the initial thread… my german not too great… but apperently this benchmark both versions somehow fail on the fermi/GT 520… why I don’t know… apperently Nai wrote the benchmark for Titan… I have no idea why it’s failing at the moment… though my hunch would maybe be the start/stop thing… continueing to read the german thread…

Ok, I think Nai also gives an explanation why the PageFile and Virtual Memory is still necessary. It’s because the Operating System/System itself can have multiple graphics applications and they all want memory from the graphics cards. Windows however/for example will swapped the GPU memory to disk/virtual memory… to make room and so forth… so it’s because of multi-tasking/multi-threading/multi-applications that pagefile might still be necessary ! So people beware ! Make sure to have enough pagefile memory ! ;) :)

Ok, no further versions of Nai’s benchmark. At the end of the thread Nai seems to apologize a little bit for all the confusion. He seems to write that the benchmark is not suited for “discovering the gtx 970” bug.

My conclusion could be that this bug does not actually exist ? I did read some people claiming this bug existed on GTX 980 as well.

What might be going on is that people expect the FULL 4 GB to be available, while in reality they may have other applications running that are consuming near 500 MB of VRAM. However that seems somewhat unlikely. Meanwhile NVIDIA has also looked into this.

There is now some speculation/confirmation that this might be caused by disabling sm (some kind of units) inside the GTX 790 which would lead to this bad memory performance of the top 500 MB.
I find that explanation somewhat fishy. If that would be the cause then this might be a hardware issue where some kind of chip is limitedly connected to some kind of ram chip. However there are also rumors that maybe a bios flash migth correct this and turn a GTX 970 into a GTX 980.

To me it all makes little sense at least in light of this benchmark… the benchmark does spawn an awfull lot of threads but only per 128 MB blocks… why the top blocks would suddenly have bad performance is really odd to me ? I cannot imagine how a chip design <-> memory interface <-memory chips> communication design could suddenly limit bandwidth !? That seems really weird. This could prove that this is indeed an artificially created limitation perhaps in the bios…

Let me be clear about this one more time: As a programmer I just cannot imagine why suddenly communication to a different part of the memory addressing space would suddenly lead to such a big drop in bandwidth ???

Perhaps the benchmark is flawed… and most not be taking too seriously… however this could still mean there is an processing issue while gaming.

Perhaps this limited bandwidth issue only happens when the GPU is at full processing capacity. But the weird thing is… the GPU IS at full processing capacity during the benchmark for the lower 128 MB blocks… so again this is really weird and totally odd ? This is the weirdest thing concerning RAM I have ever seen (indirectly since I don’t have a GTX 970 and never will lol).

I am also not statisfied with NVIDIA’s explanation. I hope a better explanation, more technically detailed explanation will follow in the future. Now I go read Nai’s english reply.

However I am not done yet… I still need to analyze version 2 of Nai’s benchmark… but it’s immediatly apperently to me that it’s probably flawed (I am not entirely sure because I am used to the nvidia driver api for cuda), but this spawn code seems bad:

BenchMarkDRAMKernel<< <1966080, 128 >>etc

As far as I know at least my fermi/GT 520 is limited to 65535 for these kinds of parameters. I think I saw a documentation for the newer cuda 6 where these limitations have been raised to 1 million or 1 billion or so that could explain why his benchmark does work on newer graphics cards and not older ones like mine. This kinda sucks. But such is life. This is basically nvidia to blame a little bit by not providing a more easy to use distribution technique of elements to be processed. However it’s also a programmer issue… Nai could have done a better job to be backwards compatible with older graphics cards by calculating a more acceptable parameter range for older graphics cards, but he probably was a bit lazy which he admitted too… which is just fine with me… if you have a newer card… why bother with older ones eh ? ;) This does show issues with NVIDIA’s design… and how this leads to backwards compatibility problems, which is something not to be to happy about. They could still solve it if they wanted to… but maybe they don’t know how… or maybe they’ dont want to put another lighter API over cuda to solve these distribution issues. Quite frankly this sucks a bit… thanks to Nai though for making a nice example to show these kinds of issues at play.

I may examine the benchmark more later on… but I think it’s time I read Nai’s english reply to my thread ;)

Ok I read Nai’s reply… yeah perhaps kernel is failing… perhaps not… will have to look into that.

I have one question for you though:

Could you explain this code ? (I think you hinted in german about some kind of floating point degenerate or something not sure if it had to do with this code):

if (length(Temp) == -12354)
In[0] = Temp;

I don’t understand that negative number ? Is it some special floating point number or bit pattern ? What does length have to do with it ?

Is it perhaps just some “garbage code” to "fool the compiler’, into making it believe the “kernel is doing something usefull ?” to prevent code illimination ? or other strange optimizations ?

Yes it is.
length will not become negative. Thus there won’t be any writeback. But the compiler is too stupid to regognize this. Because of that he also cannot use dead code removal for the load instructions. The number is just randomly chosen (should have been -12345, but I typod) by my whim. By the way: You’re free to modify this benchmarks to you liking :). Some people have already done this. There are also some slightly improved versions around the internet. You could easily modify it for your old fermi card by replacing the 1D thread blocks by 2D thread blocks.


Let me rectify myself a little bit:
Maybe some of those measured drops on a Geforce 980 GTX were caused by this VRAM bug.
But people also observed similar drops on other GPUs, which were probably the majority of all complaints because of this benchmark. I suppose that those drops are caused by the global memory swapping behaviour of the GPU. And now I’m wondering how this swapping behaviour exactly works. To me this is much more interesting than this VRAM bug. :)

Here is my Test CUDA Memory Bandwidth Performance application, with nice gui, block size setup, round setup, graph/chart, log/error messages, kernel source and ptx source.

The packed folder contains a winrar file containing the 3 files, 2 of them are necessary to run the application (*.exe and *.ptx).


My GT520 is showing approx 1.5 GigaByte/sec bandwidth with these float4 and kernel and short running time. It should be able to achieve 9 GigaByte/Sec so not sure why it’s so low… (maybe kernel launch parameters could be better) I am kinda curious what other graphics cards will show.

Also this a first version/release (0.03), maybe later I will update it a little bit, so it has some better launch parameters/optimal calculation support for newer graphics cards, for now this will have to do ! ;)

The unpacked folder contains the 3 files unpacked in case anybody is having troubles with extracting them.

I just added a little “save chart to file” button, which saves the chart into two files one “bitmap” and one “wmf” which is a new kind of graphics format which is much smaller. So that basically all windows systems should be able to read that file, you could then open it in ms-paint and re-save it as a jpg or so.

Here is example of single run:

Here is example of multiple runs:

And finally I will convert the single one to jpg so it can be shown here:

I hope you enjoy it… maybe this little app will shine some more light on things ;) :)

I’d be curious to seem some charts of GTX 970 and perhaps other models like that as well to see if there is indeed some thruth to it all ?! ;) :)

Ouch… actually my worst fears are becoming true (I don’t own one, but it’s a sssad story for them owners, though I feel confident a driver update can fix this issue and limit the card to 3.5 GB?)

(Actually I could imagine something like this… but I just hoped it didn’t exist… but I guess with parallel chip designed systems… this multiple interfaces to d-rams makes sense… and thus because of defects in chips and still wanting to sell… we might be seeing more of this in future or maybe not… or maybe this was just deliberately disabled… quite weird really) anyway… didn’t read full story yet… but that diagram basically already tells it… just ouch… ouch ouch ouch. I just hope my new bandwidth test can also spot it… ;) That be interesting… anyway back to more reading… cheer up gtx 970 owners… at least you can still play pretty decently… 500 MB to miss is almost just 10% of RAM no biggie… just like windows takes some RAM :) :) I’d be happy with a gtx 970 if it was 5 to 15 watts of power usage/heat output or so ;) Maybe in future ;)

Actually I couldn’t… why is one L2 cache/memory controller disabled ? Weird… Because of defects ? or delibirate ? weird…