Maximum bandwith?

garciav · April 16, 2008, 7:39am

Hi,

I have a 8800GTX and I would like to know the maximum bandwidth I can achieve.
Actually, I’d like to measure this bandwidth to see if I can speed-up my kernel.
So, I have donne a simple kernel where each thread write 3 times in global memory and read 2 times in global memory.
I know that the term “GB” is ambiguous. 1GB = 1000^3 btes or 1GB = 1024^3 byte?

Thanks,
Vince

JHHPC · April 16, 2008, 8:05am

Hi,

maximum bandwidth is 86,4 GB/s I achived so far 75 GB/s, where GB is = 1000^3.

To measure performance, simply count the accesses to memory (no matter whether read or write) multiply by 4 (thats for the float) and divide by the seconds taken.

If your kernel ist too fast, then simply do the same kernel again and again and mulitply by the number of iterations.

Hope that answers your question for a start?

Johannes

garciav · April 16, 2008, 9:05am

I do this

cudaEventRecord(start,0);	

for (int i=0;i<IT;i++){

	myKernel<<<grid,threads>>>(tab_dev);

}	

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsedTime, start, stop);

and I have an elapsed time of -1.#QNAN0 ms…

Is anything wrong in my code?

My kernel is the following

__global__ void myKernel(float* tab){

unsigned int xIndex = blockIdx.x * NB + threadIdx.x;

unsigned int yIndex = blockIdx.y * NB + threadIdx.y;

tab[yIndex * NB + xIndex] = 1.0;

tab[yIndex * NB + xIndex] = tab[yIndex * NB + xIndex] + 1;

tab[yIndex * NB + xIndex] = tab[yIndex * NB + xIndex] + 1;

tab[yIndex * NB + xIndex] = tab[yIndex * NB + xIndex] + 1;

tab[yIndex * NB + xIndex] = tab[yIndex * NB + xIndex] + 1;

}

garciav · April 16, 2008, 11:38am

I’ve done a very small code and I have these results :

Nb bytes r&w : 92160000 Bytes

Elapsed time : 0.938219 ms

Bandwidth    : 98.228658 GB/s

My function (see bellow) do only 9 read and write. Is that possible to have this kind of bandwidth? Do I have made a msitake?

#include <stdio.h>

#include "cuda.h"

#define NB 1600  // WIDTH AND HEIGHT

#define BS 16  // BLOCK SIZE

#define IT 1000  // NB OF ITERATIONS

__global__ void myKernel(float* tab){

	unsigned int xIndex = blockIdx.x * BS + threadIdx.x;

	unsigned int yIndex = blockIdx.y * BS + threadIdx.y;

	unsigned int l      = yIndex * NB + xIndex;

	tab[l] = 1.0;

	tab[l] = tab[l] + 1;

	tab[l] = tab[l] + 1;

	tab[l] = tab[l] + 1;

	tab[l] = tab[l] + 1;

}

int main(void){

	

	// My array

	float* tab_dev;

	cudaMalloc( (void **) &tab_dev, NB*NB*sizeof(float));

	// Bytes read and write

	int nbBytesReadAndWrite;

	// Grid and thread

	dim3 grid(NB/BS, NB/BS, 1);

	dim3 threads(BS, BS, 1);

	// Timer

	cudaEvent_t start, stop;

	float elapsedTimeInMs;

	cudaEventCreate(&start);

	cudaEventCreate(&stop);

	// Loop with timer

	cudaEventRecord(start,0);

	for (int i=0;i<IT;i++){

  myKernel<<<grid,threads>>>(tab_dev);

	}	

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&elapsedTimeInMs, start, stop);

	cudaEventDestroy(start);

	cudaEventDestroy(stop);

	// Bandwidth computation

	nbBytesReadAndWrite  = 9*sizeof(float)*NB*NB;

	elapsedTimeInMs     /= IT;

	printf("Nb bytes r&w : %d Bytes\n",nbBytesReadAndWrite);

	printf("Elapsed time : %f ms\n",elapsedTimeInMs);

	printf("Bandwidth    : %f GB/s\n",nbBytesReadAndWrite/(elapsedTimeInMs*1000000));

	// Free memory

	cudaFree(tab_dev);

}

MisterAnderson42 · April 16, 2008, 1:05pm

The compiler may be optimizing out the multiple reads/writes to tab[1]. You can check by compiling with the -ptx option and examining the ptx code.

To obtain the bandwidth of ~70 GiB/s in a benchmark, you need fully coalesced reads and writes of a 4-byte or 8-byte type and you need to copy a large enough chunk of data so that the overhead of launching the kernel doesn’t affect the results.

For the whole 9 yards of bandwidth reads/writes/copies, see my bandwidth test code near the bottom of this thread: [url=“http://forums.nvidia.com/index.php?showtopic=52806”]http://forums.nvidia.com/index.php?showtopic=52806[/url]

Topic		Replies	Views
Measuring Effective Bandwidth CUDA Programming and Performance	1	4643	February 20, 2011
Bandwidth calculation Newbie question... CUDA Programming and Performance	10	5389	August 1, 2008
Question about bandwidth test CUDA Programming and Performance	8	331	April 2, 2024
Device Memory Bandwidth CUDA Programming and Performance	8	1846	March 29, 2015
Bandwith Problem CUDA Programming and Performance	7	2633	March 16, 2009
Speed-up and bandwidth CUDA Programming and Performance	12	9780	May 4, 2008
global memory bandwidth problem CUDA Programming and Performance	4	1409	March 2, 2010
Simple question: Is the official bandwidth 1-way or 2-way? CUDA Programming and Performance	3	2133	December 3, 2008
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13629	June 2, 2008
memory bandwidth device to SM bandwidth CUDA Programming and Performance	9	4720	June 10, 2008

Maximum bandwith?

Related topics