Shared Mem caching strategy Comparison of benchmark results

spacerat · May 8, 2008, 11:25am

Hello everybody,

I would like to share some experience of using shared memory and hope to get some feedback or perhaps also a discussion about caching schemes.

Below I added results of my raycaster therefore - with and without using shared memory.

Since programming manual suggests using shared memory whereever possible, I also included it but now I am not sure anymore if it was a good idea…

Here the framerates:

1 cpu for caching, 64 threads : 10 fps

2 cpus for caching, 64 threads : 14 fps

4 cpus for caching, 64 threads : 17 fps

8 cpus for caching, 64 threads : 22 fps

16 cpus for caching, 64 threads : 33 fps

no caching, 64 threads : 45 fps

no caching, 128 threads : 50 fps

The results show that the cached version is surprisingly much slower than the

non-cached version.

My current caching strategy is very simple -

here a brief overview

(for 64 threads; the actual code caches more than 64 ints however)

...

struct SMEM

{

   unsigned int request [64];

   unsigned int data[64];

}

...

extern __shared__ int sdata[];

SMEM &smem =  *((SMEM*)sdata);

...

int  cpu_id = threadIdx.x;

// each cpu stores the requested offset in the array

smem.request [ cpu_id ] = some offset;

// be sure all are finished

__syncthreads;

// only 1 cpu reads to avoid concurrency

if ( cpu_id == 0 )

{

   int last_req = -1; // last requested offset

   int last_data;       // last requested data

  // loop over all cpus

   for ( int i = 0; i < 64; i++)

   {

       // necessary to read from global memory?

       if ( last_req != smem.request [ i ] )

       {

          last_req = smem.request [ i ];

          last_data = gmem[ last_req ];

       }

       smem.data [ i ] = last_data;

   }

}

__syncthreads;

...

Any suggestions or comments are welcome.

cheers, Sven

DenisR · May 8, 2008, 11:55am

The way you are reading data is not so nice. Better to read in the data with all 64 threads using a texture. Like that you don’t have trouble when accessing the same memory location from different threads.

Also shared memory should be used when USEFUL. From the code snippet you posted, I think it will not do you any good, it is not like several threads need access to the same memory over and over again.

spacerat · May 8, 2008, 2:40pm

Thats true - reading from texture might be the best way. I did it in another example. It was however not that much faster ( I expected like 2x performance - at the end it turned out to be a few percent )

Actually I was trying to use the shared mem as I was reading in another thread that shared mem is superior to the texture cache; but that seems to depend on the application…

DenisR · May 8, 2008, 7:42pm

only if you are re-using the values you read more than once. If you use them only once, then not. You really should look at the SDK examples to see where you need to use shared memory. There are quite some situations where it does not buy you anything.
(and other situations where it is extremely beneficial)

spacerat · May 11, 2008, 11:51am

Finally I also have some benchmarks for comparing global memory
and texture memory - however, the results are not really like what I expected…

Storing the data in global memory: 50-51ms rendering time / frame
Storing the data in texture memory: 49-50ms rendering time / frame

Its about 1-2% difference… not really outstanding.
Does anybody have more significant results using texture memory?
I would be interested to hear about some of your experiences.

MisterAnderson42 · May 11, 2008, 12:15pm

Benchmarking one portion of my code that has random memory accesses results in ~20ms to compute. After reordering the particles in memory, the computation only takes ~3ms. Full details are in this paper: [url=“Redirecting”]http://dx.doi.org/10.1016/j.jcp.2008.01.047[/url] , sections 2.3 and 3.2 are the pertinent ones.

Are all of your texture accesses within a warp spatially local? You need this in order to get the full benefit of the texture cache.

How many FLOPs does your kernel perform for each memory access? 10’s, or 100’s? If the latter, you may be bound by computation, not memory.

What is your occupancy? If it is really low, say less than 33%, then you probably don’t have enough memory/computation interleaving going on to make full use of the memory bandwidth.

spacerat · May 11, 2008, 12:34pm

From 20ms to 3ms is pretty good. Yeah, it’s true that the occupation of my kernel is just 16%. I would like to increase it with more parallel threads, but the compiled code uses too many registers, which limits the number to 64-128 Threads.

The reads in texture memory of my kernel should be good for caching. Each thread is reading linearly (from a 2D start point to a 2D end point ) in the texture; there are also overlapping areas between the threads where the cache might help.

As the rendering mostly consists of range checks, its difficult to estimate the FLOPS - I guess a lot of time is spent with conditional jumps, which cut the occupancy down… I can’t get rid of them unfortunately…

MisterAnderson42 · May 11, 2008, 3:12pm

Hmm, seems like you have everything covered.

The only other suggestion/question I have is: Are you using 2D textures with cudaArray for your 2D locality reads? A 1D texture bound to device memory (cudaBindTexture) will not perform well with a 2D read pattern.

spacerat · May 11, 2008, 5:50pm

Yes,it is a 2D int array ( texture<unsigned int, 2, cudaReadModeElementType> texture_array; ). I actually think it would be possible to get the same program running with less registers, to get more parallel threads - but then I would have to code everything by hand in assembly language, if this is even possible.

seibert · May 11, 2008, 6:44pm

You can use the decuda (unofficial, not NVIDIA-supported) to disassemble/reassemble code cubin files:

http://www.cs.rug.nl/~wladimir/decuda/

(the author goes by “wumpus” in the forums)

Cubins are the output of ptxas, which does the register allocation.

Topic		Replies	Views
Really slow constant memory Random access to constant memory CUDA Programming and Performance	13	4354	December 4, 2009
Shared Memory usage slows kernel with texture fetch CUDA Programming and Performance	8	4143	June 20, 2011
Question about texture/shared memory enhance the computing efficiency CUDA Programming and Performance	3	5381	December 4, 2007
CUDA texture memory performance CUDA Programming and Performance	4	33544	January 13, 2009
Confusion on using texture? CUDA Programming and Performance	14	4929	September 4, 2009
Memory performance in image processing example CUDA Programming and Performance	9	1603	March 24, 2011
Disappointing shared memory performance CUDA Programming and Performance	3	734	September 8, 2011
Texture memory performance CUDA Programming and Performance	4	4973	June 1, 2009
Shared memory problem CUDA Programming and Performance	3	2259	February 8, 2008
Texture vs. Global Memory CUDA Programming and Performance	4	2012	August 6, 2009

Shared Mem caching strategy Comparison of benchmark results

Related topics