Turning off coalescing


The GPU seems to like doing “coalescing”.

For massively random access with just little 32 bit integers this behaviour might back fire and just make things worse.

Therefore it could be interesting if the GPU simply had an option/api call/setting to simply turn off coalescing.

So far setting a memory pointer/array to “volatile” seems to help for massively random access. (Gives 50% more performance ?!)

I wonder if there is perhaps anything else which can be done to improve massive random access ?

So far one other little trick seems to be to limit thread blocks to just 16 threads but the performance benefit was very small.


We should just throw GDDR5 away altogether :biggrin:

I bought a cheap GT 520 of 65 euro’s which is passively cooled and has 35 watt or something, it goes up to 40 or 50 degrees.

Anyway I think I know what is going on… I guess I confused the shader frequency with the memory frequency… and I think I also now know how to calculate maximum random access performance without cache benefits:

The memory frequency of this card is: 810 megahurtz lol hurtz.

Anyway… the memory bus is 64 bit.

This means it can transfer 8 bytes per clock.

So the maximum number of bytes it can transfer is clock * bus.

However because of coalescing the ammount of memory transactions per second becomes: (clock * bus) / memory transaction size

So this is:

(810.000.000 * 8) / 128 = 50.625.000

Only 50 million memory transactions per second, which is really low.

So it’s a little bit surprising that I have been seeing 64 million and now even 92 million which almost seems beyond it’s spec.

This is probably because of lucky caching ;) :) (And also the memory test doesn’t do full random access but block-based random access which might get more cache-lucky).

Anyway I hope these formula’s/calculations are correct. This will allow me to figure out which card I might want to purchase in the future ! ;)

I am looking for “maximum memory transactions per second”.

Oh by the way the card has ddr3 ;)

Also turning of memory coalescing would make sense if it was possible.

Suppose it was possible to use the lanes in 4 byte memory transaction sizes then this would give:

(810.000.000 * 8 / 4) = 1.620.000.000

A hell lot more memory transactions !

I have seen some documents about ATI cards being able to do 32 bit memory transactions, so I am becoming a little bit curious about ATI cards ?!?

However the test program I wrote is perhaps a bit extreme, perhaps not… so in practice there might be some coalescing but that’s based on luck… I’d rather have garantueed performance than just luck.

Ok, my sharpness is declining as it gets bed time for me.

Apperently I got confused or didn’t look further then my nose was long.

The mentioned megahurtz is probably something else… what it is… I don’t know…

But the website mentions different timings:


According to this website the memory is actually 1200 MHZ. However the accuracy of the website might be in doubt, but usually it’s pretty close.

(I noticed this because the higher performing cards also has 800 MHZ I thought… ouch… is that really true… and then I noticed higher performing cards have 4000 MHZ for memory speed ?!;))

So I am gonna try my formula again (I am not sure which numbers to plug into my formula so I am just gonna try):

(1200 * 1000 * 100 * 8) / 128 = 75.000.000

That’s much closer to the number I saw: 64.000.000

The rest of the bytes maybe overhead ?!?

Anyway… now 92.000.000 doesn’t seem so far off anymore ;)

There are a lot of things you’ll have to read if you want to have a better understand of memory.

Prefetch buffer

Hmm… parallel nsight mentions only half of the memory speed, it mentions:


I have seen this before on websites… some double it or some half it or something…

I suspect the doubling on some websites is either wrong or it compensates for 2 reads per clock cycle ?!?

Maybe sometime I will read up on DRAM and such…

For now it would help if somebody else who knows a lot about DRAM could provide me with some formula to be able to calculate how many random access reads and writes it can do per second ?