 # Speed-up and bandwidth

Hi,
I use CUDA from 1 month now. My code is then 100 times faster using CUDA that the same code in C programming. That’s a good point!
I’d like to estimate the maximum speed-up I can achieve with CUDA.
I’ve read some things about bandwidth and I’ve read that I can speed-up my code up to the theorical maximum graphic card bandwidth. So, my questions are :

1. Is this correct (all what I said about bandwidth, etc.)?
2. How to estimate the bandwidth?
Thanks for you help.
Vincent

• calculate how many GByte/s you are processing (read & write together)
• calculate how many GFLOPS you are processing

Compare both to the possible peak to see if you have achieved maximum performance for your algorithm. You need to do both, because your kernel can be either bandwith-bound or processing-bound.

Cool… But how to do that? How to compute GB/s and GFLOPS?

Well how many data do you read and write in your kernel (GB) and divide it by the time it takes your kernel to run (s) -> GB/s
Count how many FLOP you have in your kernel (FPLOP) and divide it by the time it takes your kernel to run (s) -> FLOPS

Juste a precision : Do I have to count the read and write of the global memory + the read and write od the shared memory?
Thanks

you have to call all that is done on your GPU i think. so memcopies etc.

Thanks :)

global memory only, there you have the bottleneck of 70GB/s
shared mem is much faster

There is something I don’t understand.
The time spent by a kernel is composed by reads/writes in memory and by some computations.
So, it’s not possible to measure only the time spent by the memory acces and the time spent for FLOPS?

A kernel does not exclusively perform one or the other. Multiple warps are kept in flight on each multiprocessor. While some are reading memory, others are performing computations.

So, if you are limited by memory, you will have an effective bandwidth of 70 GiB/s and only, say 10 GFLOP/s. If you are instead limited by computations and not memory, then you will have ~150 GFLOP/s (or more if your code uses lots of MADDS) and only, say, 1 GiB/s.

In the first memory limited case, you get all of the computations “for free” because of the interleaving of execution and memory operations.

I’ve done my first test and tell me if I’m wrong :

My kernel is :

``````__global__ void addOne(float* t, int nb){

unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

t[yIndex*nb+xIndex] += 1;

}
``````

Given a matrix t of size nb*nb, the kernel add 1 to each element.

In the main function, I do the following :

``````dim3 grid(nb/BLOCK_DIM, nb/BLOCK_DIM, 1);

cudaEventRecord(start, 0);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsedTime, start, stop);

printf("Duration  : %f ms\n", elapsedTime);

printf("Bandwidth : %f GBytes/s\n", ((nb*nb*2)*sizeof(float))/(elapsedTime*1000000));

printf("GFLOPS    : %f\n", (7*nb*nb)/(elapsedTime*1000000));
``````

In the kernel, I count 7 operation (+ and ) per thread => 7nb*nb operations

So, for nb=8192 and BLOCK_DIM=16, I have these results

``````Duration  : 10.438016 ms

Bandwidth : 51.434192 Gbytes/s

GFLOPS    : 45.004918
``````

Is this correct or do I have made a mistake?

``````__global__ void addOne(float* t, int nb){

unsigned int xIndex = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

unsigned int yIndex = __mul24(blockIdx.y, blockDim.y) + threadIdx.y;

t[yIndex*nb+xIndex] += 1;

}
``````

Is what I would write. But I don’t think that will make a lot of difference.

``````int num_loop = 100;

dim3 grid(nb/BLOCK_DIM, nb/BLOCK_DIM, 1);

cudaEventRecord(start, 0);

for (int k = 0; k < num_loop; k++)

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsedTime, start, stop);

printf("Average Duration  : %f ms\n", elapsedTime/num_loop);

printf("Bandwidth : %f GBytes/s\n", ((nb*nb*2)*sizeof(float))/(elapsedTime/num_loop*1000000));

printf("GFLOPS    : %f\n", (7*nb*nb)/(elapsedTime/num_loop*1000000));
``````

But this will probably make a lot of difference. You were timing the first execution of the kernel which includes some one-time overhead, so you should always time subsequent calls. And also averaging a few kernel calls is also smart to do.

For the GFLOPS calculation, would you count this kernel as having 7 flops or 1? It looks like 6 of them (*, +) are for index calculations…