Improve performance using volatile

I came across this thread Turning off coalescing in Nvidia Forum where it is stated that “So far setting a memory pointer/array to “volatile” seems to help for massively random access. (Gives 50% more performance ?!)”.

I am doing finite difference computation (3D Stencil Computation) on GPU (Fermi) using CUDA and want to improve the performance of the computation. Since accessing the z axis of 3D array is random (3D array is laid in Z,Y,X from slow to fast), I feel like using volatile would be a better choice. Currently I am using shared memory

__shared__ float 2dplane[32][32]

.

When I tried using volatile as

volatile float **plane = 2dplane

, I get this error

Can anyone tell me how to use volatile on 2d array.

Try this:

volatile float test1[16][16];

(Just add the keyword volatile in front of it, this requires compute 1.1 capability)

You wrote the Z-axis is random, what happens exactly ? is only one element read or multiple from then on sequentially ?

If there is any form of sequential access then I don’t think volatile will help and might even make things worse.

However I have only tested on a graphics card with 1 multi processor and 48 cuda cores and the testing conditions were very extreme (pretty much completely single random within blocks of 8000 integers).

However all testing and reporting results is welcome ! External Image =D

Also volatile is probably only relevant for graphics cards which have “caches”.

I think only compute 2.0 graphics cards and up have caches…

Volatile simply by-passes the caches, which could save memory copies I guess…

So if the program does nothing but “trash the cash” then volatile can probably help…

I came across this thread which is highly interesting too:

I have been noticing this too… ptx seems to use many new registers, volatile could help here as well.

Somebody even mentioned a “register” directive would could be cool.

I was also thinking about a “cached” directive. (This would turn programming up side down :), because apperently the default is “cache” everything, which is nice for sequential algorithms and such.

So perhaps the opposite might be nice to: “dont_cache” but that’s probably what volatile already does.

But according to thread I posted above it has other side effects, I am not sure if these are still present in the compiler but I am going to give it a try.

I think the compiler sees the “volatile” directive and thinks to itself: “oh gjez, I better put that variable in a register because it’s not going to be cached and that would hurt performance”.
^ side effect ^ External Image :)

Hmm those guys weren’t sure what it does… when volatile is placed in front of locals… or maybe they were… but now I have enough reason to doubt it… at least for the locals…

The test kernel I wrote isn’t compute intensive and probably not register intensive so it doesn’t have a performance effect.

The ptx code did become longer though with all these volatiles… that’s probably not so good.

The register usage did go up which could confirm my hypothesis that compiler will start using more registers, this goes against what somebody else wrote.

Somebody else wrote: volatile might always read/write directly from memory… but that’s probably not possible with cuda since cuda needs a register to load stuff into and write stuff from… I am also not completely sure about that last statement… maybe there are other instructions which can load/write from/to global memory as well.

For now I am sticking with what ptx manual wrote about it: “volatile prohibits cache operations” which I think means it by-passes cache operations.

Any other effect is probably a side effect and shouldn’t be relied on.

The question is: what is considered “cache” is a register considered cache too ?

I don’t think so… registers are something by themselfes and are not considered cache ?!..

One other guy was nagging about C language giving a different meaning to volatile… hmmmm… well I am not a true C programmer, so I am not sure about that… and for me it doesn’t matter.

Whatever the cuda doc says and the ptx doc says and the ptx output says goes with me ! External Image :)

So where is this 3D array that you speak of? You have shown a 2D array in shared memory.
Also make sure that you place volatile in the correct location which depends whether you want to declare the value or the pointer to a value as volatile.

This should solve your problems:

volitile float **plane = &2dplane[0];

Either that or you can just make 2dplane a 1d array, which is better practice anyway.

__shared__ float 2dplane[1024];

volitile float **plane = 2dplane;

According to documentation it’s also possible to do:

volatile shared

or

shared volatile

Question remains if for his code it’s better to use shared or global memory and if volatile will actually help or not External Image

Either one of those I ment to write… not sure which one actually… External Image