Hello everybody,
I would like to share some experience of using shared memory and hope to get some feedback or perhaps also a discussion about caching schemes.
Below I added results of my raycaster therefore - with and without using shared memory.
Since programming manual suggests using shared memory whereever possible, I also included it but now I am not sure anymore if it was a good idea…
Here the framerates:
1 cpu for caching, 64 threads : 10 fps
2 cpus for caching, 64 threads : 14 fps
4 cpus for caching, 64 threads : 17 fps
8 cpus for caching, 64 threads : 22 fps
16 cpus for caching, 64 threads : 33 fps
no caching, 64 threads : 45 fps
no caching, 128 threads : 50 fps
The results show that the cached version is surprisingly much slower than the
non-cached version.
My current caching strategy is very simple -
here a brief overview
(for 64 threads; the actual code caches more than 64 ints however)
...
struct SMEM
{
unsigned int request [64];
unsigned int data[64];
}
...
extern __shared__ int sdata[];
SMEM &smem = *((SMEM*)sdata);
...
int cpu_id = threadIdx.x;
// each cpu stores the requested offset in the array
smem.request [ cpu_id ] = some offset;
// be sure all are finished
__syncthreads;
// only 1 cpu reads to avoid concurrency
if ( cpu_id == 0 )
{
int last_req = -1; // last requested offset
int last_data; // last requested data
// loop over all cpus
for ( int i = 0; i < 64; i++)
{
// necessary to read from global memory?
if ( last_req != smem.request [ i ] )
{
last_req = smem.request [ i ];
last_data = gmem[ last_req ];
}
smem.data [ i ] = last_data;
}
}
__syncthreads;
...
Any suggestions or comments are welcome.
cheers, Sven