Maximum bandwith?

Hi,

  1. I have a 8800GTX and I would like to know the maximum bandwidth I can achieve.

  2. Actually, I’d like to measure this bandwidth to see if I can speed-up my kernel.
    So, I have donne a simple kernel where each thread write 3 times in global memory and read 2 times in global memory.
    I know that the term “GB” is ambiguous. 1GB = 1000^3 btes or 1GB = 1024^3 byte?

Thanks,
Vince

Hi,

maximum bandwidth is 86,4 GB/s I achived so far 75 GB/s, where GB is = 1000^3.

To measure performance, simply count the accesses to memory (no matter whether read or write) multiply by 4 (thats for the float) and divide by the seconds taken.

If your kernel ist too fast, then simply do the same kernel again and again and mulitply by the number of iterations.

Hope that answers your question for a start?

Johannes

I do this

cudaEventRecord(start,0);	

for (int i=0;i<IT;i++){

	myKernel<<<grid,threads>>>(tab_dev);

}	

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsedTime, start, stop);

and I have an elapsed time of -1.#QNAN0 ms…

Is anything wrong in my code?

My kernel is the following

__global__ void myKernel(float* tab){

unsigned int xIndex = blockIdx.x * NB + threadIdx.x;

unsigned int yIndex = blockIdx.y * NB + threadIdx.y;

tab[yIndex * NB + xIndex] = 1.0;

tab[yIndex * NB + xIndex] = tab[yIndex * NB + xIndex] + 1;

tab[yIndex * NB + xIndex] = tab[yIndex * NB + xIndex] + 1;

tab[yIndex * NB + xIndex] = tab[yIndex * NB + xIndex] + 1;

tab[yIndex * NB + xIndex] = tab[yIndex * NB + xIndex] + 1;

}

I’ve done a very small code and I have these results :

Nb bytes r&w : 92160000 Bytes

Elapsed time : 0.938219 ms

Bandwidth    : 98.228658 GB/s

My function (see bellow) do only 9 read and write. Is that possible to have this kind of bandwidth? Do I have made a msitake?

#include <stdio.h>

#include "cuda.h"

#define NB 1600  // WIDTH AND HEIGHT

#define BS 16  // BLOCK SIZE

#define IT 1000  // NB OF ITERATIONS

__global__ void myKernel(float* tab){

	unsigned int xIndex = blockIdx.x * BS + threadIdx.x;

	unsigned int yIndex = blockIdx.y * BS + threadIdx.y;

	unsigned int l      = yIndex * NB + xIndex;

	tab[l] = 1.0;

	tab[l] = tab[l] + 1;

	tab[l] = tab[l] + 1;

	tab[l] = tab[l] + 1;

	tab[l] = tab[l] + 1;

}

int main(void){

	

	// My array

	float* tab_dev;

	cudaMalloc( (void **) &tab_dev, NB*NB*sizeof(float));

	// Bytes read and write

	int nbBytesReadAndWrite;

	// Grid and thread

	dim3 grid(NB/BS, NB/BS, 1);

	dim3 threads(BS, BS, 1);

	// Timer

	cudaEvent_t start, stop;

	float elapsedTimeInMs;

	cudaEventCreate(&start);

	cudaEventCreate(&stop);

	// Loop with timer

	cudaEventRecord(start,0);

	for (int i=0;i<IT;i++){

  myKernel<<<grid,threads>>>(tab_dev);

	}	

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&elapsedTimeInMs, start, stop);

	cudaEventDestroy(start);

	cudaEventDestroy(stop);

	// Bandwidth computation

	nbBytesReadAndWrite  = 9*sizeof(float)*NB*NB;

	elapsedTimeInMs     /= IT;

	printf("Nb bytes r&w : %d Bytes\n",nbBytesReadAndWrite);

	printf("Elapsed time : %f ms\n",elapsedTimeInMs);

	printf("Bandwidth    : %f GB/s\n",nbBytesReadAndWrite/(elapsedTimeInMs*1000000));

	// Free memory

	cudaFree(tab_dev);

}

The compiler may be optimizing out the multiple reads/writes to tab[1]. You can check by compiling with the -ptx option and examining the ptx code.

To obtain the bandwidth of ~70 GiB/s in a benchmark, you need fully coalesced reads and writes of a 4-byte or 8-byte type and you need to copy a large enough chunk of data so that the overhead of launching the kernel doesn’t affect the results.

For the whole 9 yards of bandwidth reads/writes/copies, see my bandwidth test code near the bottom of this thread: http://forums.nvidia.com/index.php?showtopic=52806