comparision: shared mem <=> global mem actually no difference

Hi,

I wanted to test the difference in speed of using shared mem compared with global mem. And despite having quite little computations done, there is almost no difference. I wonder if this could really be true or if I had generally made a bad mistake. It even works for a small number of blocks and threads per block.
The first method “globalReadTest()” just starts the test.
The method “globalReadEach()” reads/writes one float from/to global mem.
The method “globalReadNone()” does not read/write from/to global mem.

Here`s the code:

void globalReadTest() { //method to test the speed to access shard and global //memory

int blockNum=16;
int threadNum=32;
	
CUT_DEVICE_INIT();

float* data;
cudaMalloc((void**)&data, 1024*1024*10);  // allocate some global mem

int iter=2000; //number of iterations for the test

unsigned int timer = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer));
CUT_SAFE_CALL( cutStartTimer( timer));
for (int u=0; u<iter; u++) {	//within this loop, a method is invoked, 

//accessing global mem within each thread
globalReadEach<<<blockNum, threadNum, 4*threadNum>>>(data);
}
CUT_SAFE_CALL( cutStopTimer( timer));
float timerDiff=cutGetTimerValue( timer);
CUT_SAFE_CALL( cutDeleteTimer( timer));

unsigned int timer2 = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer2));
CUT_SAFE_CALL( cutStartTimer( timer2));

for (int u=0; u<iter; u++) { within this loop, a method is invoked, 

//without accessing global mem at all
globalReadNone<<<blockNum, threadNum, 4*threadNum>>>(data);
}

CUT_SAFE_CALL( cutStopTimer( timer2));
float timerDiff2=cutGetTimerValue( timer2);			
CUT_SAFE_CALL( cutDeleteTimer( timer2));

cudaFree((void**)&data);

printf("Duration globalReadEach: %f\nDuration globalReadNone: %f\nA small difference in time might be caused by the order of execution!\n", timerDiff, timerDiff2);

}

global void globalReadEach(float* global) { //one read and write access to //global mem
extern shared float shared;

int myPos= blockDim.x * blockIdx.x + threadIdx.x; //coalesced position 

//for global mem access

shared[ threadIdx.x ]=global[ myPos ];  //read from global mem

__syncthreads();

float erg=0;
calc(&erg); //some calculations
global[myPos]=erg;  //write to global mem

}

global void globalReadNone(float* global) { //no global mem access
extern shared float shared;

shared[ threadIdx.x ]= myPos;

__syncthreads();

float erg=0;
calc(&erg); //some calculations

}

static device void calc(float* erg) { //just some calculations
extern shared float shared;
for (int b=0; b<1; b++) {
for (int i=0; i< 3/blockDim.x/; i++ ) {
erg=sin(shared[i])*cos(shared[i])/i/sin((float)b);
}
}
}

Add a cudaThreadSynchronize() before the cutStopTimer call. Kernel launches are asynchronous; I doubt you’re timing the kernels themselves at all, just the time it takes to launch them.

(shared memory is waaaaaay faster than global memory)

hi & thx for your fast reply.

I added the cudaThreadSynchronize() to my loops:

for (int u=0; u<iter; u++) {
globalReadEach<<<blockNum, threadNum, 4*threadNum>>>(data);
cudaThreadSynchronize();
}

for (int u=0; u<iter; u++) {
globalReadNone<<<blockNum, threadNum, 4*threadNum>>>(data);
cudaThreadSynchronize();
}

and tested different settings of iteration times, number of blocks and threads and the amount of calculations. The difference, however, was never more than 1%!
I use a 8800 Ultra & 2 Xeons(2 cores each; 3,2GHz)

Any suggestions? If someone doesn’t trust my results, pls try the code on your won.

I haven’t tried the code, but there’s an important point to understand about device memory access.

Devicel memory reads are not very slow! They just take a long time!

The above is not a joke… it’s an important behavior to understand. The “slow” part of global memory reads and writes is latency. The exact latency amount is mostly GPU specific, but 200-300 clocks is reasonable. But that huge delay before requesting and receiving the memory is also FAST because your thread is suspended while waiting for the reply, so you’re not wasting compute cycles. This is part of the excellent design of Nvidia’s modern GPUs… the latency is unavoidable but the easy, lightweight, and massive task parallelism of CUDA make it possible to hide the latency. So your reading thread is sleeping, waiting for its answer, but meanwhile your other warps are chugging away, doing math and computes. If you have enough computes to keep your cores busy, you’ll never notice the latency!

Admittedly there’s another complication with device access, dealing with coherent reads, and that’s a very important issue too, and one that often isn’t hidden. But understanding the latency of global memory and also how that latency is hidden is perhaps more important. [The programming guide concentrates on the coherent access issues… but you have to remember about latency yourself.]

So if you try to measure the “speed” of global memory you have to decide what you’re measuring. If you make some benchmark code using a single thread and a single block, you’ll strip out all latency hiding (and all coherency issues) and you’ll see slow slow device memory access since the latency becomes your complete bottleneck.

thank you very much.
I just wondered that latency could be hidden when I only have 32 threads and 16 blocks on a 8800 ultra.

What happens when I try to read a regular linear array from global memory in an overlapping manner? As in block 1 reads elements from 0 to 512, block 2 reads from 256 to 768 etc. Does the latency also get hidden with many threads and blocks?

This topic has been discussed many times. Please look at the CUDA Optimization slides from the presentation at Supercomputing 2007.

Paulius