Effective global memory bandwidth?

I’m trying to reach the “global memory bandwidth” on a GeForce 8800 GTX using CUDA, by transfering data from the global memory (768 MB) to the processors.

The documentation says “86.4 GB/s”, which plays very nicely with 1.8 GHz * 6*64 bits / 8 bits/Byte.

I’ve succeeded in getting “42.4 GB/s” by loading 128 bit aligned data from 160 threads times 16 blocks, but I’m stuck there (see attachment… well, no, for some obscure reason the forum rejects *.cu and *.c files).

That’s not bad, but its pretty exactly half the expected bandwidth…


  • I’m missing something…

  • the “advertised” bandwidth is to be understand half up, half down,

    thus I’m there.

The other related issue is that ISTM that there is no “preload” possible, i.e. a global memory access is basically synchroneous.

float4 f = data[tid];


There is no way to do something else while “f” is being loaded, even if the variable is not used immediatly, possibly because of processors are very simple and do not implement such logic. The latency can only be hidden by other threads.

Thanks for any comment.

32-bit coalesced loads are faster than 128-bit ones in my testing. I achieve ~70GB/s reading, and the same speed writing (in a separate kernel). Texture reads with good locality can also approach ~70GB/s.

Indeed, I now get 66GB/s with 32 bit-aligned data.

Thanks a lot for the tip! :))

Although it required 512 threads per 16 blocks, which does not leave a lot of register availables… With a more realistic 256 threads I’m back to 40 GB/s.

Any idea about whether memory accesses are synchronous as they seem?

Thanks again.

You can have each thread read several values. Maintaining coalescing, of course. In many cases it helps to cluster the global memory reads together. A global memory read is non-blocking in a sense that succeeding instructions are issued for execution as long as they don’t require a value being read. So, by having a cluster of reads you effectively pipeline the memory operations.

Let me know if this improves your observed throughput.


That looks great, but I did not observe much benefit by trying things in that direction, say by switching:

while (...)


   data d = array[tid];

   array += step; // nthreads*nblocs

   // use d...


data dn = array[tid];

 array += step;

 while (...)


    data d = dn;

    dn = array[tid];

    array += step;

    // use d...

But I could not get better performances with that. It greaters the register pressure and I have to reduce the number of threads, and global performance is lower.

I understand from your point that the global memory reads are not “synchroneous”, but I cannot say that I have seen actual effects of that in my various experiments.

Or maybe I did not quite understand your suggestion?

So not really, but I guess that with 64 bits alignement and over 50GB/s throuput (for the theoretical 86 GB/s of the hardware) I’m quite as far as I can go with my code.

Thanks anyway for the suggestion!

You might see a difference going from each thread reading one value, to each thread reading two (or more) values.

I’d also suggest checking out the bandwidth sample in the SDK. It measures various memory transfers.


With this code snippet I can get 104 GB/s throughout (99%).

I never reached on a CPU a comparable exploitation of hardware performance:



unsigned int num_threads = THREAD_NUM;

unsigned int nnx = 5*1024;

unsigned int nny = 5*1024;

unsigned int grid_ny = 5*1024;

int tend=100;

unsigned int i,j,k,t;

// nx,ny haben jeweils eine ghost-layer  

// rechengebiet geht von 1..nx-2,1..ny-2

unsigned int nx=nnx;

unsigned int ny=nny;

unsigned int size_Mat = nx*ny;

unsigned int mem_size_Mat = sizeof(float) * size_Mat;

float ftime,durchsatz; 

printf("threads: %d\n", num_threads);

printf("grid: %d %d\n", 1 , grid_ny);

printf("schleifendurchlaeufe: %d %d\n", nnx/num_threads , nny/grid_ny);

printf("nx,ny: %d %d\n", nx,ny);

// 2*Massenanteile (fneu+falt) auf dem device, gespeichert in einem vector k = j*nx+i

float* fr0;

float* fr1;	

// Pointer zum tauschen falt<-->fneu

float* frNeu;

float* frAlt;	

// memory on host

// Geomat auf host

// 1*Massenanteile (fneu+falt) auf dem host, gespeichert in einem vector k = j*nx+i

float* frH  = (float*) malloc(mem_size_Mat);


// initialize host memory

for(j=0 ; j< ny ; j++){

	for(i=0 ; i< nx ; i++){

		k = nx*j+i;




// allocate device memory

CUDA_SAFE_CALL(cudaMalloc((void**) &fr0, mem_size_Mat));

CUDA_SAFE_CALL(cudaMalloc((void**) &fr1, mem_size_Mat));

// copy host memory to device

CUDA_SAFE_CALL(cudaMemcpy(fr0,  frH,  mem_size_Mat,  cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL(cudaMemcpy(fr1,  frH,  mem_size_Mat,  cudaMemcpyHostToDevice) );

// setup execution parameters

dim3 threads(num_threads, 1, 1);

dim3 grid(1, grid_ny);

unsigned int timer = 0;




	//Pointer setzen









	LBKernel2<<< grid, threads >>> ( 0, 0,nx,ny,frAlt,frNeu);

    // check if kernel execution generated and error

	CUT_CHECK_ERROR("Kernel execution failed");



//printf("Processing time: %f (ms)\n", cutGetTimerValue(timer));

ftime = cutGetTimerValue(timer);

durchsatz  = 2*4 * 1000.0 * t * (nx-2) * (ny-2) / (ftime*1.0E9);

printf("Processing time: %f (ms)\n", ftime);

printf("Durchsatz in GB/sec: %f\n", durchsatz);



global void LBKernel2( int xoff, int yoff, int nx, int ny, float* f0, float* f1)


// Zugriff auf arrays im device

int  i,j,k;

// Anzahl Elemente des Vektors fuer thread processing

int num_threads = blockDim.x;

// Anzahl Elemente des Gitters;

// int grid_nx = gridDim.x;	

int grid_ny = gridDim.y;	

// Thread index = lokaler i index

int tx = threadIdx.x;

// Block index y

int by = blockIdx.y;

// Globaler x-Index 

int xStart = tx;

// Globaler y-Index 

int yStart = by;

// if(tx==0) printf("Tid=0, xtart,ystart: %d %d\n",xStart,yStart);

// printf("xtart,ystart: %d %d\n",xStart,yStart);

float ftmp;

for(j=yStart; j<ny; j+=grid_ny){

	for(i=xStart; i<nx; i+=num_threads){

		// printf("Tid, i, j: %d %d %d\n",tx,i,j);

		// Load data from device memory to local shared memory

		k = nx*j + i;







I’m not entirely sure if this applies to your problem but shouldn’t there be a call to cudaThreadSynchronize() before you measure the time with cutStopTimer() to ensure that the kernel actually finished?

Indeed, I guess it should even be within the [font=“Courier”]t[/font] loop, otherwise I’m not quite sure about what is computed as concurrent kernels operate on the same data (even with the flip-flop).

With this code and the added host/device synchronization, I’m down to 70 GB/s (from 85 GB/s) on my hardware.

Really ? But I thought when leaving a kernel function, the kernel is definitly finished ! And I call the kernel 100 times within a for loop. I just used the mesearumentss as done in the examples … Did I something wrong ?

Section of the “CUDA Programming Guide 1.0” says:

See also Section entitled “Asynchronicity” in the same document. which

discusses cudaThreadSynchronize().

So it seems that what happens with your code is unclear.

Uff, Idid not read the manual properly. Thank you !

actually, CUT_CHECK_ERROR() already does a synchronize before it tells you if there’s been an error.

wow… incredibly impressive.

so i guess the lesson here is that you should alternate reads and writes?

strange, that seems the exact opposite of MisterAnderson’s advice. Or do you mean that you should repeat the loads several times consecutively? (float4 = a[tid] vs float = a[tid]; float = a[tid+256]; float = a[tid+512]; float = a[tid+512])

Tech articles say memory accesses run in a different thread in another part of the chip at 1/2 the frequency of the ALUs. GPUBench clearly shows that fetching and processing can overlap for OpenGL shaders. The PTX manual also tells you to place ld instructions as early as possible.

But whether this works for .cu is a whole other question. One thing i noticed is that the .cu compiler likes to assign each ld a different register in the ptx it produces. So it seems it’s then up to ptxas to order things properly. It’s a shame we can’t access cubin directly, but has anyone tried writing ptx to test overlap? Has anyone succeeded writing .cu that overlaps? And a related question I’d really like to know the answer to: is it possible to get the bandwidth benefits of 512 threads just by calling ld early enough?

p.s. another thought… is it possible that 104GB is achieved because jtoelke’s memcpy kernel somehow executes entirely on the fetch cores and doesn’t travel to the alus?

Only if [font=“Courier”]_DEBUG[/font] is defined, otherwise it is an empty macro, at least in my version of the CUDA SDK.

oh :( i c. Actually, wow. I should have realized this before. I’d been doing emulation in Debug config and hardware runs with Release, thinking that that was logical. Now it’s obvious why I never receive an error when the kernel fails. That was really bothering me, actually

Ok, performed the tests on my hardware, 8600GT with memory at 1.6GHz. Card also functions as graphics adapter. Windows 2k3. Release config.

Theoretical: 25.60 GB/s

Code snipped as posted: 23.49 GB/s (92%)

One threadsynchronize before stopping timer: 18.38 GB/s (72%)

Doubling the loop iterations: still, 18.38 GB/s

no sync, double iterations: 21 GB/s (82%)

no sync, half the iterations: 41 GB/s (160%)

yup… jtoelke just happened to pick a very lucky number.


tried paulius’ advice, but it seems to have the opposite effect.

doubling-up the fetches: 18.38 GB/s (same time as no doubling)

four fetches: 17.33 GB/s

eight fetches: 13.78 GB/s

//CASE 1:

for(i=xStart; i<nx; i+=2*num_threads){

	k = nx*j + i;






//CASE 2:

for(i=xStart; i<nx; i+=4*num_threads){

	k = nx*j + i;










Thanks a lot, I did alter my code now and I get on my overclocked Ultra (104 GB/sec Max) 86 GB/s throughput, so 83 % of the peak performance … nevertheless I#m very happy with this.

I’ll look at this when I get a moment. The idea is that multiple fetches from a thread can be overlapped (pipelined) as long as they are independent. That does increase register usage, since you need separate registers for each result. That, in turn, can reduce occupancy, which would reduce latency hiding.

Do you by any chance have register counts for the kernel cases above? Also, can you check what kind of occupancy you achieve during execution?