Shared Mem caching strategy Comparison of benchmark results

Hello everybody,

I would like to share some experience of using shared memory and hope to get some feedback or perhaps also a discussion about caching schemes.

Below I added results of my raycaster therefore - with and without using shared memory.

Since programming manual suggests using shared memory whereever possible, I also included it but now I am not sure anymore if it was a good idea…

Here the framerates:

1 cpu for caching, 64 threads : 10 fps

2 cpus for caching, 64 threads : 14 fps

4 cpus for caching, 64 threads : 17 fps

8 cpus for caching, 64 threads : 22 fps

16 cpus for caching, 64 threads : 33 fps

no caching, 64 threads : 45 fps

no caching, 128 threads : 50 fps

The results show that the cached version is surprisingly much slower than the

non-cached version.

My current caching strategy is very simple -

here a brief overview

(for 64 threads; the actual code caches more than 64 ints however)

...

struct SMEM

{

   unsigned int request [64];

   unsigned int data[64];

}

...

extern __shared__ int sdata[];

SMEM &smem =  *((SMEM*)sdata);

...

int  cpu_id = threadIdx.x;

// each cpu stores the requested offset in the array

smem.request [ cpu_id ] = some offset;

// be sure all are finished

__syncthreads;

// only 1 cpu reads to avoid concurrency

if ( cpu_id == 0 )

{

   int last_req = -1; // last requested offset

   int last_data;       // last requested data

  // loop over all cpus

   for ( int i = 0; i < 64; i++)

   {

       // necessary to read from global memory?

       if ( last_req != smem.request [ i ] )

       {

          last_req = smem.request [ i ];

          last_data = gmem[ last_req ];

       }

       smem.data [ i ] = last_data;

   }

}

__syncthreads;

...

Any suggestions or comments are welcome.

cheers, Sven

The way you are reading data is not so nice. Better to read in the data with all 64 threads using a texture. Like that you don’t have trouble when accessing the same memory location from different threads.

Also shared memory should be used when USEFUL. From the code snippet you posted, I think it will not do you any good, it is not like several threads need access to the same memory over and over again.

Thats true - reading from texture might be the best way. I did it in another example. It was however not that much faster ( I expected like 2x performance - at the end it turned out to be a few percent )

Actually I was trying to use the shared mem as I was reading in another thread that shared mem is superior to the texture cache; but that seems to depend on the application…

only if you are re-using the values you read more than once. If you use them only once, then not. You really should look at the SDK examples to see where you need to use shared memory. There are quite some situations where it does not buy you anything.
(and other situations where it is extremely beneficial)

Finally I also have some benchmarks for comparing global memory
and texture memory - however, the results are not really like what I expected…

Storing the data in global memory: 50-51ms rendering time / frame
Storing the data in texture memory: 49-50ms rendering time / frame

Its about 1-2% difference… not really outstanding.
Does anybody have more significant results using texture memory?
I would be interested to hear about some of your experiences.

Benchmarking one portion of my code that has random memory accesses results in ~20ms to compute. After reordering the particles in memory, the computation only takes ~3ms. Full details are in this paper: http://dx.doi.org/10.1016/j.jcp.2008.01.047 , sections 2.3 and 3.2 are the pertinent ones.

Are all of your texture accesses within a warp spatially local? You need this in order to get the full benefit of the texture cache.

How many FLOPs does your kernel perform for each memory access? 10’s, or 100’s? If the latter, you may be bound by computation, not memory.

What is your occupancy? If it is really low, say less than 33%, then you probably don’t have enough memory/computation interleaving going on to make full use of the memory bandwidth.

From 20ms to 3ms is pretty good. Yeah, it’s true that the occupation of my kernel is just 16%. I would like to increase it with more parallel threads, but the compiled code uses too many registers, which limits the number to 64-128 Threads.

The reads in texture memory of my kernel should be good for caching. Each thread is reading linearly (from a 2D start point to a 2D end point ) in the texture; there are also overlapping areas between the threads where the cache might help.

As the rendering mostly consists of range checks, its difficult to estimate the FLOPS - I guess a lot of time is spent with conditional jumps, which cut the occupancy down… I can’t get rid of them unfortunately…

Hmm, seems like you have everything covered.

The only other suggestion/question I have is: Are you using 2D textures with cudaArray for your 2D locality reads? A 1D texture bound to device memory (cudaBindTexture) will not perform well with a 2D read pattern.

Yes,it is a 2D int array ( texture<unsigned int, 2, cudaReadModeElementType> texture_array; ). I actually think it would be possible to get the same program running with less registers, to get more parallel threads - but then I would have to code everything by hand in assembly language, if this is even possible.

You can use the decuda (unofficial, not NVIDIA-supported) to disassemble/reassemble code cubin files:

http://www.cs.rug.nl/~wladimir/decuda/

(the author goes by “wumpus” in the forums)

Cubins are the output of ptxas, which does the register allocation.