comparision: shared mem <=> global mem actually no difference

pvc · July 11, 2008, 8:02pm

Hi,

I wanted to test the difference in speed of using shared mem compared with global mem. And despite having quite little computations done, there is almost no difference. I wonder if this could really be true or if I had generally made a bad mistake. It even works for a small number of blocks and threads per block.
The first method “globalReadTest()” just starts the test.
The method “globalReadEach()” reads/writes one float from/to global mem.
The method “globalReadNone()” does not read/write from/to global mem.

Here`s the code:

void globalReadTest() { //method to test the speed to access shard and global //memory

int blockNum=16;
int threadNum=32;
	
CUT_DEVICE_INIT();

float* data;
cudaMalloc((void**)&data, 1024*1024*10);  // allocate some global mem

int iter=2000; //number of iterations for the test

unsigned int timer = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer));
CUT_SAFE_CALL( cutStartTimer( timer));
for (int u=0; u<iter; u++) {	//within this loop, a method is invoked,

//accessing global mem within each thread
globalReadEach<<<blockNum, threadNum, 4*threadNum>>>(data);
}
CUT_SAFE_CALL( cutStopTimer( timer));
float timerDiff=cutGetTimerValue( timer);
CUT_SAFE_CALL( cutDeleteTimer( timer));

unsigned int timer2 = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer2));
CUT_SAFE_CALL( cutStartTimer( timer2));

for (int u=0; u<iter; u++) { within this loop, a method is invoked,

//without accessing global mem at all
globalReadNone<<<blockNum, threadNum, 4*threadNum>>>(data);
}

CUT_SAFE_CALL( cutStopTimer( timer2));
float timerDiff2=cutGetTimerValue( timer2);			
CUT_SAFE_CALL( cutDeleteTimer( timer2));

cudaFree((void**)&data);

printf("Duration globalReadEach: %f\nDuration globalReadNone: %f\nA small difference in time might be caused by the order of execution!\n", timerDiff, timerDiff2);

}

global void globalReadEach(float* global) { //one read and write access to //global mem
extern shared float shared;

int myPos= blockDim.x * blockIdx.x + threadIdx.x; //coalesced position

//for global mem access

shared[ threadIdx.x ]=global[ myPos ];  //read from global mem

__syncthreads();

float erg=0;
calc(&erg); //some calculations
global[myPos]=erg;  //write to global mem

}

global void globalReadNone(float* global) { //no global mem access
extern shared float shared;

shared[ threadIdx.x ]= myPos;

__syncthreads();

float erg=0;
calc(&erg); //some calculations

}

static device void calc(float* erg) { //just some calculations
extern shared float shared;
for (int b=0; b<1; b++) {
for (int i=0; i< 3/blockDim.x/; i++ ) {
erg=sin(shared[i])*cos(shared[i])/i/sin((float)b);
}
}
}

tmurray · July 11, 2008, 8:17pm

Add a cudaThreadSynchronize() before the cutStopTimer call. Kernel launches are asynchronous; I doubt you’re timing the kernels themselves at all, just the time it takes to launch them.

(shared memory is waaaaaay faster than global memory)

pvc · July 12, 2008, 8:18am

hi & thx for your fast reply.

I added the cudaThreadSynchronize() to my loops:

for (int u=0; u<iter; u++) {
globalReadEach<<<blockNum, threadNum, 4*threadNum>>>(data);
cudaThreadSynchronize();
}

for (int u=0; u<iter; u++) {
globalReadNone<<<blockNum, threadNum, 4*threadNum>>>(data);
cudaThreadSynchronize();
}

and tested different settings of iteration times, number of blocks and threads and the amount of calculations. The difference, however, was never more than 1%!
I use a 8800 Ultra & 2 Xeons(2 cores each; 3,2GHz)

Any suggestions? If someone doesn’t trust my results, pls try the code on your won.

SPWorley · July 12, 2008, 9:47pm

I haven’t tried the code, but there’s an important point to understand about device memory access.

Devicel memory reads are not very slow! They just take a long time!

The above is not a joke… it’s an important behavior to understand. The “slow” part of global memory reads and writes is latency. The exact latency amount is mostly GPU specific, but 200-300 clocks is reasonable. But that huge delay before requesting and receiving the memory is also FAST because your thread is suspended while waiting for the reply, so you’re not wasting compute cycles. This is part of the excellent design of Nvidia’s modern GPUs… the latency is unavoidable but the easy, lightweight, and massive task parallelism of CUDA make it possible to hide the latency. So your reading thread is sleeping, waiting for its answer, but meanwhile your other warps are chugging away, doing math and computes. If you have enough computes to keep your cores busy, you’ll never notice the latency!

Admittedly there’s another complication with device access, dealing with coherent reads, and that’s a very important issue too, and one that often isn’t hidden. But understanding the latency of global memory and also how that latency is hidden is perhaps more important. [The programming guide concentrates on the coherent access issues… but you have to remember about latency yourself.]

So if you try to measure the “speed” of global memory you have to decide what you’re measuring. If you make some benchmark code using a single thread and a single block, you’ll strip out all latency hiding (and all coherency issues) and you’ll see slow slow device memory access since the latency becomes your complete bottleneck.

pvc · July 15, 2008, 9:54am

thank you very much.
I just wondered that latency could be hidden when I only have 32 threads and 16 blocks on a 8800 ultra.

_Big_Mac · July 16, 2008, 2:54pm

What happens when I try to read a regular linear array from global memory in an overlapping manner? As in block 1 reads elements from 0 to 512, block 2 reads from 256 to 768 etc. Does the latency also get hidden with many threads and blocks?

paulius · July 21, 2008, 9:28pm

This topic has been discussed many times. Please look at the CUDA Optimization slides from the presentation at Supercomputing 2007.

Paulius

Topic		Replies	Views
global memory latency CUDA Programming and Performance	12	16156	December 13, 2007
Effective global memory bandwidth? CUDA Programming and Performance	17	17554	September 18, 2007
Global memory access cost CUDA Programming and Performance	4	2761	November 18, 2017
shared memory latency CUDA Programming and Performance	7	5864	May 18, 2011
Shared memory as slow as global memory CUDA Programming and Performance	8	4281	September 5, 2016
memory latency CUDA Programming and Performance	5	3925	March 21, 2007
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3300	January 10, 2010
Disappointing shared memory performance CUDA Programming and Performance	3	733	September 8, 2011
Reduction: shared VS global memory CUDA Programming and Performance	4	7715	June 1, 2008
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9345	January 7, 2008

comparision: shared mem <=> global mem actually no difference

Related topics