I came across this thread Turning off coalescing in Nvidia Forum where it is stated that “So far setting a memory pointer/array to “volatile” seems to help for massively random access. (Gives 50% more performance ?!)”.
I am doing finite difference computation (3D Stencil Computation) on GPU (Fermi) using CUDA and want to improve the performance of the computation. Since accessing the z axis of 3D array is random (3D array is laid in Z,Y,X from slow to fast), I feel like using volatile would be a better choice. Currently I am using shared memory
__shared__ float 2dplane[32][32]
.
When I tried using volatile as
volatile float **plane = 2dplane
, I get this error
Can anyone tell me how to use volatile on 2d array.
(Just add the keyword volatile in front of it, this requires compute 1.1 capability)
You wrote the Z-axis is random, what happens exactly ? is only one element read or multiple from then on sequentially ?
If there is any form of sequential access then I don’t think volatile will help and might even make things worse.
However I have only tested on a graphics card with 1 multi processor and 48 cuda cores and the testing conditions were very extreme (pretty much completely single random within blocks of 8000 integers).
However all testing and reporting results is welcome ! External Image =D
I came across this thread which is highly interesting too:
I have been noticing this too… ptx seems to use many new registers, volatile could help here as well.
Somebody even mentioned a “register” directive would could be cool.
I was also thinking about a “cached” directive. (This would turn programming up side down :), because apperently the default is “cache” everything, which is nice for sequential algorithms and such.
So perhaps the opposite might be nice to: “dont_cache” but that’s probably what volatile already does.
But according to thread I posted above it has other side effects, I am not sure if these are still present in the compiler but I am going to give it a try.
I think the compiler sees the “volatile” directive and thinks to itself: “oh gjez, I better put that variable in a register because it’s not going to be cached and that would hurt performance”.
^ side effect ^ External Image :)
Hmm those guys weren’t sure what it does… when volatile is placed in front of locals… or maybe they were… but now I have enough reason to doubt it… at least for the locals…
The test kernel I wrote isn’t compute intensive and probably not register intensive so it doesn’t have a performance effect.
The ptx code did become longer though with all these volatiles… that’s probably not so good.
The register usage did go up which could confirm my hypothesis that compiler will start using more registers, this goes against what somebody else wrote.
Somebody else wrote: volatile might always read/write directly from memory… but that’s probably not possible with cuda since cuda needs a register to load stuff into and write stuff from… I am also not completely sure about that last statement… maybe there are other instructions which can load/write from/to global memory as well.
For now I am sticking with what ptx manual wrote about it: “volatile prohibits cache operations” which I think means it by-passes cache operations.
Any other effect is probably a side effect and shouldn’t be relied on.
The question is: what is considered “cache” is a register considered cache too ?
I don’t think so… registers are something by themselfes and are not considered cache ?!..
One other guy was nagging about C language giving a different meaning to volatile… hmmmm… well I am not a true C programmer, so I am not sure about that… and for me it doesn’t matter.
Whatever the cuda doc says and the ptx doc says and the ptx output says goes with me ! External Image :)
So where is this 3D array that you speak of? You have shown a 2D array in shared memory.
Also make sure that you place volatile in the correct location which depends whether you want to declare the value or the pointer to a value as volatile.