Speed-up and bandwidth

garciav · March 11, 2008, 6:35pm

Hi,
I use CUDA from 1 month now. My code is then 100 times faster using CUDA that the same code in C programming. That’s a good point!
I’d like to estimate the maximum speed-up I can achieve with CUDA.
I’ve read some things about bandwidth and I’ve read that I can speed-up my code up to the theorical maximum graphic card bandwidth. So, my questions are :

Is this correct (all what I said about bandwidth, etc.)?
How to estimate the bandwidth?
Thanks for you help.
Vincent

DenisR · March 11, 2008, 7:06pm

I have seen some slides about this. What was done was basically this:

calculate how many GByte/s you are processing (read & write together)
calculate how many GFLOPS you are processing

Compare both to the possible peak to see if you have achieved maximum performance for your algorithm. You need to do both, because your kernel can be either bandwith-bound or processing-bound.

garciav · March 12, 2008, 8:36am

Cool… But how to do that? How to compute GB/s and GFLOPS?

DenisR · March 12, 2008, 9:13am

Well how many data do you read and write in your kernel (GB) and divide it by the time it takes your kernel to run (s) → GB/s
Count how many FLOP you have in your kernel (FPLOP) and divide it by the time it takes your kernel to run (s) → FLOPS

garciav · March 12, 2008, 9:57am

Clear answer!
Juste a precision : Do I have to count the read and write of the global memory + the read and write od the shared memory?
Thanks

jordyvaneijk · March 12, 2008, 10:02am

you have to call all that is done on your GPU i think. so memcopies etc.

garciav · March 12, 2008, 10:44am

Thanks :)

DenisR · March 12, 2008, 2:37pm

global memory only, there you have the bottleneck of 70GB/s
shared mem is much faster

garciav · March 12, 2008, 2:54pm

There is something I don’t understand.
The time spent by a kernel is composed by reads/writes in memory and by some computations.
So, it’s not possible to measure only the time spent by the memory acces and the time spent for FLOPS?

MisterAnderson42 · March 12, 2008, 3:03pm

A kernel does not exclusively perform one or the other. Multiple warps are kept in flight on each multiprocessor. While some are reading memory, others are performing computations.

So, if you are limited by memory, you will have an effective bandwidth of 70 GiB/s and only, say 10 GFLOP/s. If you are instead limited by computations and not memory, then you will have ~150 GFLOP/s (or more if your code uses lots of MADDS) and only, say, 1 GiB/s.

In the first memory limited case, you get all of the computations “for free” because of the interleaving of execution and memory operations.

garciav · March 12, 2008, 5:11pm

I’ve done my first test and tell me if I’m wrong :

My kernel is :

__global__ void addOne(float* t, int nb){

	unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

	unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

	t[yIndex*nb+xIndex] += 1;

}

Given a matrix t of size nb*nb, the kernel add 1 to each element.

In the main function, I do the following :

dim3 grid(nb/BLOCK_DIM, nb/BLOCK_DIM, 1);

dim3 threads(BLOCK_DIM,BLOCK_DIM,1);

cudaEventRecord(start, 0);

addOne<<<grid,threads>>>(tab_dev,nb);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsedTime, start, stop);

printf("Duration  : %f ms\n", elapsedTime);

printf("Bandwidth : %f GBytes/s\n", ((nb*nb*2)*sizeof(float))/(elapsedTime*1000000));

printf("GFLOPS    : %f\n", (7*nb*nb)/(elapsedTime*1000000));

Each thread read and write one time in the global memory => 2nbnb read/writes

In the kernel, I count 7 operation (+ and ) per thread => 7nb*nb operations

So, for nb=8192 and BLOCK_DIM=16, I have these results

Duration  : 10.438016 ms

Bandwidth : 51.434192 Gbytes/s

GFLOPS    : 45.004918

Is this correct or do I have made a mistake?

DenisR · March 12, 2008, 6:27pm

__global__ void addOne(float* t, int nb){

	unsigned int xIndex = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

	unsigned int yIndex = __mul24(blockIdx.y, blockDim.y) + threadIdx.y;

	t[yIndex*nb+xIndex] += 1;

}

Is what I would write. But I don’t think that will make a lot of difference.

int num_loop = 100;

dim3 grid(nb/BLOCK_DIM, nb/BLOCK_DIM, 1);

dim3 threads(BLOCK_DIM,BLOCK_DIM,1);

addOne<<<grid,threads>>>(tab_dev,nb);

cudaEventRecord(start, 0);

for (int k = 0; k < num_loop; k++)

  addOne<<<grid,threads>>>(tab_dev,nb);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsedTime, start, stop);

printf("Average Duration  : %f ms\n", elapsedTime/num_loop);

printf("Bandwidth : %f GBytes/s\n", ((nb*nb*2)*sizeof(float))/(elapsedTime/num_loop*1000000));

printf("GFLOPS    : %f\n", (7*nb*nb)/(elapsedTime/num_loop*1000000));

But this will probably make a lot of difference. You were timing the first execution of the kernel which includes some one-time overhead, so you should always time subsequent calls. And also averaging a few kernel calls is also smart to do.

dmo · May 4, 2008, 10:25pm

For the GFLOPS calculation, would you count this kernel as having 7 flops or 1? It looks like 6 of them (*, +) are for index calculations…

Topic		Replies	Views
Maximum bandwith? CUDA Programming and Performance	4	4434	April 16, 2008
Measuring Kernel Bandwidth CUDA Programming and Performance	6	2316	September 21, 2010
Bandwidth calculation Newbie question... CUDA Programming and Performance	10	5405	August 1, 2008
simple question measure Flops, Bandwidth CUDA Programming and Performance	0	2010	January 28, 2011
Numerical estimatation of run time CUDA Programming and Performance	1	897	June 11, 2009
Why my program bandwidth exceeds the standard bandwidth? CUDA Programming and Performance	6	1017	April 3, 2015
global memory bandwidth problem CUDA Programming and Performance	4	1420	March 2, 2010
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37540	August 30, 2009
Why doesn't this kernel reach the bandwidth max for my GPU? CUDA Programming and Performance	2	151	May 31, 2024
any theory to predict the speedup CUDA Programming and Performance	4	1648	March 28, 2009

Speed-up and bandwidth

Related topics