Bandwidth calculation Newbie question...

Hi there,

I want to calculate bandwidth for a simple kernel :

__global__

void kernel(float* a, float* b, float* c, float*  d)

{

  unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;

 a[index] = b[index] + c[index] * d[index];

}

As i have 1 write and 3 reads, i should calculate this :

bandwidth = 4 * sizeof(float) * arrayLength / executionTime

Is it the right way ?

I tried this and i get 120 GB/s although i have a FX4600 which has a 46GB/s local memory bandwidth. I do not understand External Image

Thanks for you help.

maybe wrong execution time? try both syncthreads() method and event method

I dont think time is wrong. I have tried the profiler time and also, a simple timer. I get the same results.

In my program, this code is called 1000 times, and each time, almost same execution time.

Around 1,11 ms for a 4194304 float array.

In my experience, usually when this happens you’ll be reading the wrong number of elements or something like that.

Just a guess.

Mmm, do you mean i do not map the blocks the good way ?

My launch parameters seems ok, and the results are the same as the CPU version :S

I’m really perplex. Is my formula good?

Thanks again.

Guessing but maybe you can read and write at the same time, so actually you should be multiplying by 3 not 4. Although this would give you 90gb/sec (which i though is moreless that what generally cuda cards achieve)

I’ve tried the Bandwidth test (in CUDA SDK projects), and i get 46 GB/s Device to Device bandwidth. That’s what i compare to. By the way, this is quite far from the theoretical 67.2 GB/s… :S

Yeah, reaching theoretical bandwidth is rarely, if ever, possible. From what I’ve gathered, 2/3 theoretical is about normal. I get around 48 GB/s. These differences can come from MoBo and driver issues, and other hardware issues. I presume there are other reasons as well, but these are common.

Maybe just an honest error in calculating the bandwidth?

I get:

4*4194304 floats * 4 bytes/float / 1.11e-3 seconds / 1024^3 bytes/GiB = 56.3063063 GiB/s

That is still higher than your 46 theoretical though, which is odd…

I have tried on a friend’s computer. He’s got a 8800GTX and he gets 80% of its theoretical bandwidth on bandwidth SDK example. Anyway, as you say, it may be coming from somewhere else…

You are right, i’ve done an error…

The thoretical bandwidth is 67.2 GB/s. Getting 56.3 is quite realistic !

The last i don’t get, is why i get a higher result with this kernel than with the bandwith SDK test ?

Anyway… Sorry for my mistake and thanks a lot for your help guys =)

oYo

Yep, the original 8800 GTX consistently gets 70 GiB/s bandwidth in kernels like this and the peak is 86 GiB/s.

The bandwidth SDK test is benchmarking using a device to device cudaMemcpy, which a little different than running a kernel so they don’t have to be the same.