Effective global memory bandwidth?

Fabien_Coelho · August 24, 2007, 3:50pm

I’m trying to reach the “global memory bandwidth” on a GeForce 8800 GTX using CUDA, by transfering data from the global memory (768 MB) to the processors.

The documentation says “86.4 GB/s”, which plays very nicely with 1.8 GHz * 6*64 bits / 8 bits/Byte.

I’ve succeeded in getting “42.4 GB/s” by loading 128 bit aligned data from 160 threads times 16 blocks, but I’m stuck there (see attachment… well, no, for some obscure reason the forum rejects *.cu and *.c files).

That’s not bad, but its pretty exactly half the expected bandwidth…

So

I’m missing something…
the “advertised” bandwidth is to be understand half up, half down,

thus I’m there.

The other related issue is that ISTM that there is no “preload” possible, i.e. a global memory access is basically synchroneous.

float4 f = data[tid];

...

There is no way to do something else while “f” is being loaded, even if the variable is not used immediatly, possibly because of processors are very simple and do not implement such logic. The latency can only be hidden by other threads.

Thanks for any comment.

MisterAnderson42 · August 24, 2007, 5:37pm

32-bit coalesced loads are faster than 128-bit ones in my testing. I achieve ~70GB/s reading, and the same speed writing (in a separate kernel). Texture reads with good locality can also approach ~70GB/s.

Fabien_Coelho · August 24, 2007, 6:39pm

Indeed, I now get 66GB/s with 32 bit-aligned data.

Thanks a lot for the tip! External Media

Although it required 512 threads per 16 blocks, which does not leave a lot of register availables… With a more realistic 256 threads I’m back to 40 GB/s.

Any idea about whether memory accesses are synchronous as they seem?

Thanks again.

paulius · August 28, 2007, 2:45am

You can have each thread read several values. Maintaining coalescing, of course. In many cases it helps to cluster the global memory reads together. A global memory read is non-blocking in a sense that succeeding instructions are issued for execution as long as they don’t require a value being read. So, by having a cluster of reads you effectively pipeline the memory operations.

Let me know if this improves your observed throughput.

Paulius

Fabien_Coelho · August 28, 2007, 7:17pm

That looks great, but I did not observe much benefit by trying things in that direction, say by switching:

while (...)

 {

   data d = array[tid];

   array += step; // nthreads*nblocs

   // use d...

to:

data dn = array[tid];

 array += step;

 while (...)

 {

    data d = dn;

    dn = array[tid];

    array += step;

    // use d...

But I could not get better performances with that. It greaters the register pressure and I have to reduce the number of threads, and global performance is lower.

I understand from your point that the global memory reads are not “synchroneous”, but I cannot say that I have seen actual effects of that in my various experiments.

Or maybe I did not quite understand your suggestion?

So not really, but I guess that with 64 bits alignement and over 50GB/s throuput (for the theoretical 86 GB/s of the hardware) I’m quite as far as I can go with my code.

Thanks anyway for the suggestion!

paulius · August 28, 2007, 11:13pm

You might see a difference going from each thread reading one value, to each thread reading two (or more) values.

I’d also suggest checking out the bandwidth sample in the SDK. It measures various memory transfers.

Paulius

jtoelke · September 5, 2007, 8:35pm

With this code snippet I can get 104 GB/s throughout (99%).

I never reached on a CPU a comparable exploitation of hardware performance:

======================================

THREAD_NUM=128

unsigned int num_threads = THREAD_NUM;

unsigned int nnx = 5*1024;

unsigned int nny = 5*1024;

unsigned int grid_ny = 5*1024;

int tend=100;

unsigned int i,j,k,t;

// nx,ny haben jeweils eine ghost-layer  

// rechengebiet geht von 1..nx-2,1..ny-2

unsigned int nx=nnx;

unsigned int ny=nny;

unsigned int size_Mat = nx*ny;

unsigned int mem_size_Mat = sizeof(float) * size_Mat;

float ftime,durchsatz; 

printf("threads: %d\n", num_threads);

printf("grid: %d %d\n", 1 , grid_ny);

printf("schleifendurchlaeufe: %d %d\n", nnx/num_threads , nny/grid_ny);

printf("nx,ny: %d %d\n", nx,ny);

// 2*Massenanteile (fneu+falt) auf dem device, gespeichert in einem vector k = j*nx+i

float* fr0;

float* fr1;	

// Pointer zum tauschen falt<-->fneu

float* frNeu;

float* frAlt;	

// memory on host

// Geomat auf host

// 1*Massenanteile (fneu+falt) auf dem host, gespeichert in einem vector k = j*nx+i

float* frH  = (float*) malloc(mem_size_Mat);

//CUT_CHECK_DEVICE();

// initialize host memory

for(j=0 ; j< ny ; j++){

	for(i=0 ; i< nx ; i++){

		k = nx*j+i;

		frH[k]=1.0E12;

	}

}

// allocate device memory

CUDA_SAFE_CALL(cudaMalloc((void**) &fr0, mem_size_Mat));

CUDA_SAFE_CALL(cudaMalloc((void**) &fr1, mem_size_Mat));



// copy host memory to device

CUDA_SAFE_CALL(cudaMemcpy(fr0,  frH,  mem_size_Mat,  cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL(cudaMemcpy(fr1,  frH,  mem_size_Mat,  cudaMemcpyHostToDevice) );

// setup execution parameters

dim3 threads(num_threads, 1, 1);

dim3 grid(1, grid_ny);

unsigned int timer = 0;

CUT_SAFE_CALL(cutCreateTimer(&timer));

CUT_SAFE_CALL(cutStartTimer(timer));

for(t=0;t<=tend;t++){

	//Pointer setzen

	if(t%2==0){

		frAlt=fr0;

		frNeu=fr1;

	}

	else{

		frAlt=fr1;

		frNeu=fr0;

	}

	LBKernel2<<< grid, threads >>> ( 0, 0,nx,ny,frAlt,frNeu);

    // check if kernel execution generated and error

	CUT_CHECK_ERROR("Kernel execution failed");

}

CUT_SAFE_CALL(cutStopTimer(timer));

//printf("Processing time: %f (ms)\n", cutGetTimerValue(timer));

ftime = cutGetTimerValue(timer);

durchsatz  = 2*4 * 1000.0 * t * (nx-2) * (ny-2) / (ftime*1.0E9);

printf("Processing time: %f (ms)\n", ftime);

printf("Durchsatz in GB/sec: %f\n", durchsatz);

CUT_SAFE_CALL(cutDeleteTimer(timer));

////////////////////////////////////////////////////////////////////////////////

global void LBKernel2( int xoff, int yoff, int nx, int ny, float* f0, float* f1)

{

// Zugriff auf arrays im device

int  i,j,k;

// Anzahl Elemente des Vektors fuer thread processing

int num_threads = blockDim.x;

// Anzahl Elemente des Gitters;

// int grid_nx = gridDim.x;	

int grid_ny = gridDim.y;

// Thread index = lokaler i index

int tx = threadIdx.x;

// Block index y

int by = blockIdx.y;

// Globaler x-Index 

int xStart = tx;

// Globaler y-Index 

int yStart = by;



// if(tx==0) printf("Tid=0, xtart,ystart: %d %d\n",xStart,yStart);

// printf("xtart,ystart: %d %d\n",xStart,yStart);

float ftmp;

for(j=yStart; j<ny; j+=grid_ny){

	for(i=xStart; i<nx; i+=num_threads){

		// printf("Tid, i, j: %d %d %d\n",tx,i,j);

		// Load data from device memory to local shared memory

		k = nx*j + i;

		ftmp=0.5*f0[k];

		f1[k]=ftmp; 

	}

}

}

seb · September 5, 2007, 9:16pm

jtoelke,

I’m not entirely sure if this applies to your problem but shouldn’t there be a call to cudaThreadSynchronize() before you measure the time with cutStopTimer() to ensure that the kernel actually finished?

Fabien_Coelho · September 6, 2007, 6:41am

Indeed, I guess it should even be within the [font=“Courier”]t[/font] loop, otherwise I’m not quite sure about what is computed as concurrent kernels operate on the same data (even with the flip-flop).

With this code and the added host/device synchronization, I’m down to 70 GB/s (from 85 GB/s) on my hardware.

e.ping · September 7, 2007, 10:39am

Really ? But I thought when leaving a kernel function, the kernel is definitly finished ! And I call the kernel 100 times within a for loop. I just used the mesearumentss as done in the examples … Did I something wrong ?

Fabien_Coelho · September 7, 2007, 10:57am

Section 4.2.1.4 of the “CUDA Programming Guide 1.0” says:

See also Section 4.5.1.5 entitled “Asynchronicity” in the same document. which

discusses cudaThreadSynchronize().

So it seems that what happens with your code is unclear.

e.ping · September 10, 2007, 2:42pm

Uff, Idid not read the manual properly. Thank you !

alex_dubinsky · September 16, 2007, 3:43pm

actually, CUT_CHECK_ERROR() already does a synchronize before it tells you if there’s been an error.

wow… incredibly impressive.

so i guess the lesson here is that you should alternate reads and writes?

strange, that seems the exact opposite of MisterAnderson’s advice. Or do you mean that you should repeat the loads several times consecutively? (float4 = a[tid] vs float = a[tid]; float = a[tid+256]; float = a[tid+512]; float = a[tid+512])

Tech articles say memory accesses run in a different thread in another part of the chip at 1/2 the frequency of the ALUs. GPUBench clearly shows that fetching and processing can overlap for OpenGL shaders. The PTX manual also tells you to place ld instructions as early as possible.

But whether this works for .cu is a whole other question. One thing i noticed is that the .cu compiler likes to assign each ld a different register in the ptx it produces. So it seems it’s then up to ptxas to order things properly. It’s a shame we can’t access cubin directly, but has anyone tried writing ptx to test overlap? Has anyone succeeded writing .cu that overlaps? And a related question I’d really like to know the answer to: is it possible to get the bandwidth benefits of 512 threads just by calling ld early enough?

p.s. another thought… is it possible that 104GB is achieved because jtoelke’s memcpy kernel somehow executes entirely on the fetch cores and doesn’t travel to the alus?

Fabien_Coelho · September 16, 2007, 4:06pm

Only if [font=“Courier”]_DEBUG[/font] is defined, otherwise it is an empty macro, at least in my version of the CUDA SDK.

alex_dubinsky · September 17, 2007, 3:18am

oh :( i c. Actually, wow. I should have realized this before. I’d been doing emulation in Debug config and hardware runs with Release, thinking that that was logical. Now it’s obvious why I never receive an error when the kernel fails. That was really bothering me, actually

Ok, performed the tests on my hardware, 8600GT with memory at 1.6GHz. Card also functions as graphics adapter. Windows 2k3. Release config.

Theoretical: 25.60 GB/s

Code snipped as posted: 23.49 GB/s (92%)

One threadsynchronize before stopping timer: 18.38 GB/s (72%)

Doubling the loop iterations: still, 18.38 GB/s

no sync, double iterations: 21 GB/s (82%)

no sync, half the iterations: 41 GB/s (160%)

yup… jtoelke just happened to pick a very lucky number.

alex_dubinsky · September 17, 2007, 6:14am

Followup:

tried paulius’ advice, but it seems to have the opposite effect.

doubling-up the fetches: 18.38 GB/s (same time as no doubling)

four fetches: 17.33 GB/s

eight fetches: 13.78 GB/s

//CASE 1:

for(i=xStart; i<nx; i+=2*num_threads){

	k = nx*j + i;

	ftmp0=0.5*f0[k+0*num_threads];

	ftmp1=0.5*f0[k+1*num_threads];

	f1[k+0*num_threads]=ftmp0;

	f1[k+1*num_threads]=ftmp1;

}

//CASE 2:

for(i=xStart; i<nx; i+=4*num_threads){

	k = nx*j + i;

	ftmp0=0.5*f0[k+0*num_threads];

	ftmp1=0.5*f0[k+1*num_threads];

	ftmp2=0.5*f0[k+2*num_threads];

	ftmp3=0.5*f0[k+3*num_threads];

	f1[k+0*num_threads]=ftmp0;

	f1[k+1*num_threads]=ftmp1;

	f1[k+2*num_threads]=ftmp2;

	f1[k+3*num_threads]=ftmp3;

}

jtoelke · September 18, 2007, 6:10pm

Thanks a lot, I did alter my code now and I get on my overclocked Ultra (104 GB/sec Max) 86 GB/s throughput, so 83 % of the peak performance … nevertheless I#m very happy with this.

paulius · September 18, 2007, 6:19pm

I’ll look at this when I get a moment. The idea is that multiple fetches from a thread can be overlapped (pipelined) as long as they are independent. That does increase register usage, since you need separate registers for each result. That, in turn, can reduce occupancy, which would reduce latency hiding.

Do you by any chance have register counts for the kernel cases above? Also, can you check what kind of occupancy you achieve during execution?

Paulius

Topic		Replies	Views
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13629	June 2, 2008
8800GTX - strange global memory problem... :( memory copy host->device->device->host CUDA Programming and Performance	12	5921	September 26, 2008
Number of kilobytes transferred to/from shared memory twice the expected CUDA Programming and Performance	12	701	September 29, 2018
Using Shared Memory in CUDA C/C++ Technical Blog	36	1943	October 8, 2020
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29320	April 23, 2010
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3304	January 10, 2010
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9346	January 7, 2008
Maximum number of queued kernels CUDA Programming and Performance	21	7727	September 3, 2008
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5072	September 6, 2008
Speed improvement CUDA Programming and Performance	18	8267	December 5, 2008

Effective global memory bandwidth?

Related topics