Texture memory performance No speedup by using texture memory

I had two constant arrays in my program therefore I taught I should use the texture memory as a linear memory. But the performance didn’t change when I used texture memory.
My kernel is on the global memory and I have about 50 operations per thread. The size of the array that I moved to texture memory is 8 times the number of threads. I tried different number of threads from 1024 to 102400 but the performance is the same if I use the texture or not.
I was wondering am I expecting too much from the texture memory or that I may be doing some thing wrong. Thanks!

It’s not clear what’s your access pattern. If your reads are coalesced then you’ll won’t get any benefit from using texture memory.

Thanks. I guess that’s it. I think my memory access is coalesced.

Use profiler to check whether your global memory accesses are coalesced. Texture won’t give you a higher bandwidth, it will only help in cases where your accesses are not coalesced, but the access pattern exhibits good locality.


How can I do that?? I am using CUDA 1.0 and I think that this option of the profiler is not allowed for it. Or is it?

You are correct. The 1.0 profiler doesn’t have the signal counters needed to check for coalescing. Is there a particular reason you can’t upgrade to 1.1? It’s quite an improvement over 1.0.

Well, my OpenSuse 10.2 has had a lot of trouble with the latest CUDA driver so I have decided to keep the old one. Besides, I am not very competent compiling kernel modules in Linux… :(

Get the profiler docs (txt file) from 1.1 release and try the configuration with 1.0. You may get some of the functionality.

Thanks Paulius. Can you tell me how can I interpret the log? e.g How can I know if my memory access is coalesced or not? And I think to get the correct results from the profiler I should use the Released program not the Debugged program. Is that right?

My profiler log looks like this. I noticed that my cputime is too high. Can anyone tell me what does it mean?

method=[ memcopy ] gputime=[ 3.648 ]

method=[ memcopy ] gputime=[ 2.880 ]

method=[ memcopy ] gputime=[ 86.912 ]

method=[ memcopy ] gputime=[ 170.336 ]

method=[ memcopy ] gputime=[ 170.368 ]

method=[ _Z15integrateBodiesILb1EEvP6float4S1_S1_S1_fffffi ] gputime=[ 46.240 ] cputime=[ 374.574 ] occupancy=[ 0.333 ]

method=[ _Z15integrateBodiesILb1EEvP6float4S1_S1_S1_fffffi ] gputime=[ 46.656 ] cputime=[ 5157.686 ] occupancy=[ 0.333 ]

method=[ _Z15integrateBodiesILb1EEvP6float4S1_S1_S1_fffffi ] gputime=[ 48.576 ] cputime=[ 15298.748 ] occupancy=[ 0.333 ]

method=[ _Z15integrateBodiesILb1EEvP6float4S1_S1_S1_fffffi ] gputime=[ 47.776 ] cputime=[ 11397.357 ] occupancy=[ 0.333 ]

method=[ _Z15integrateBodiesILb1EEvP6float4S1_S1_S1_fffffi ] gputime=[ 48.288 ] cputime=[ 15828.707 ] occupancy=[ 0.333 ]

method=[ _Z15integrateBodiesILb1EEvP6float4S1_S1_S1_fffffi ] gputime=[ 46.880 ] cputime=[ 11169.108 ] occupancy=[ 0.333 ]

method=[ _Z15integrateBodiesILb1EEvP6float4S1_S1_S1_fffffi ] gputime=[ 48.576 ] cputime=[ 22038.609 ] occupancy=[ 0.333 ]

method=[ _Z15integrateBodiesILb1EEvP6float4S1_S1_S1_fffffi ] gputime=[ 48.608 ] cputime=[ 8370.658 ] occupancy=[ 0.333 ]

Not sure why your CPU time is so high. Normally, CPU time is a couple tens of microseconds extra.

To get coaleced/uncoalesced signals, set an environment variable to give the location and name of the configuration file. Inside the configuration file specify up to 4 signals you want to profile. You can get the specifics on the signals from the documentation file, which comes with CUDA 1.1 toolkit. As I said before, some of the functionality may be available in 1.0, so give it a try, but I’d recommend updating to CUDA 1.1 install to get all the benefits.


Apparently, my high cpu time is due to OpenGL commands according to this and this.
I managed to use profiler and visual profiler and found out that I have many uncoalesced accesses.