Bandwidth & Kernel problems: performance degredation.

Hey folks,

I’m rather new to CUDA so as expected I’m running in to a few problems (hopefully minor).

I have a kernel which simply does a set amount of floating point operations and records the min, max and mean GLFOPS of each kernel run. Here is what the kernel was originally:

template <int REP>

__global__ void mykernel(float * __restrict__ a, int reps) {

int idx = blockIdx.x*blockDim.x + threadIdx.x;

float r = a[idx];

  #pragma unroll

  for (int n=0;n<REP;n++) {

    r = 0.0001f+r*1.00002f; // nb. the "f"s are very important!

  }

  a[idx] = r; // copy result from local to global memory

}

With this we got some pretty nice results (using Nvidia GTX 480 cards).

With ~7680 threads, a kernel doing 5000 FLOPS/thread gave us an average of about 1050 GLFOPS.

However the problem is, when we tried different memory layout patterns, the performance degraded vastly.

Here is an example of a new kernel we tried:

template <int REP>

__global__ void mykernel(float * __restrict__ a, int reps) {

int idx = blockIdx.x*blockDim.x + threadIdx.x;

float r = a[idx]; // copy data from global to local memory (ie. to a register)

  float s = a[N+idx];

  float t = a[(2*N)+idx];

  #pragma unroll

  for (int n=0;n<REP;n++) {

    r = 0.0001f+r*1.00002f; // nb. the "f"s are very important!

    s = 0.0001f+s*1.00002f;

    t = 0.0001f+t*1.00002f;

  }

  a[idx] = r; // copy result from local to global memory

  a[N+idx] = s;

  a[(2*N)+idx] = t;

}

It’s basically the same kernel but now operating on 3 floats instead of 1.

The calculations my supervisor made showed that this new kernel was operating with a bandwidth of around 2.4 to 2.6 GB/s (don’t worry, he gave the ok to put this stuff on this forum and I may have worded some stuff incorrectly but oh well!).

We’re both stumped as to why the performance has degraded so much, this 2.6GB/s is nowhere near the theoretical limit of ~177GB/s. That’s not to say we’re expecting to be near 177GB/s but we certainly weren’t expecting something as low as this.

So I guess my question is, any ideas on why this is happening?

The alternative kernel also produces bizarre results, as you can see from this chart:

http://imgur.com/xGvMV.png

The new kernel also behaves pretty strangely at ~220 FLOPs/thread

(vertical axis represents GFLOPS and horizontal axis represents FLOPS/thread)

As with before, 7680 threads were ran.

are you sure that your kernel is launched?
do you check error code by cudaGetLastError()?

I’m a bit confused about the different metrics you give for the kernels you compare. The first performs at 1050 GFLOP/s, while the second achieves 2.4 GB/s.
Doing the maths backwards (and assuming no parameters changed between the two kernels), I arrive at 2.4 GB/s / 8 bytes * 5000 FLOP = 1500 GFLOP/s, which is above the theoretical maximum of the GTX 480.
Can you give the full numbers?

And of course you are not getting close to the theoretical bandwidth, because these kernels are computation limited, not bandwidth limited.

Hi, thanks for the replies. The kernel is definitely launched.
As for the numbers, our mathematics may be wrong and only approximations were used since we did the maths quickly on paper:

of bytes: 432*7680 (size of float on our system * number of floats used * number of reads/writes to memory * # of threads) = approx 180kb

time = [difficult to read unfortunately since we were in a rush but it involves 7680*5000 as the numerator of a fraction] = approx 80milliseconds

=> 180/80 = 2.25gb/s

Sorry if I wasn’t clear enough: In order to compare anything, you need to give the same quantity for both kernels. E.g., give a bandwidth number (in GB/s) for both, or an arithmetic throughput (GFLOP/s) for both.
Or, if you don’t want to compare, give both numbers for the same kernel, and we will be able to determine whether that kernel is computationally bound or bandwidth bound.

Hi CarlosTheDuck,

what is N in this code?

Woops, silly me! Thank you for the replies again.
Lets call the kernels k1 and k2.
k1 gives ~1050 GFLOPs for 7680 threads and 5000 FLOPs per thread
k2 gives ~380 GFLOPs for 7680 threads and 5000 FLOPs per thread
The only difference between k1 and k2 is 2 more float variables.
edit: umm I only just realised something. If k2 works on 3 times more floats than k1…would that actually mean it’s a 15000 FLOP kernel rather than a 5000 FLOP one?

The graph for k1 (x-axis being FLOPs per thread and y-axis being GFLOPs) produces a graph that looks like a smooth natural log graph (similar to this: http://www.wolframalpha.com/input/?i=Plot+ln(x%2B0.25)+%2B+1+from+x%3D+0+to+6). It is mostly smooth.
The graph for k2 produces a bizarre kink at around 250 FLOPs per thread (http://i.imgur.com/xGvMV.png) but after that produces a graph that is similar to natural log but much less smoother (i.e. a lot of small jumps). k2’s graph uses the same axes as k1’s graph.

N in the code is 6 (we’ll be using 6D vectors and 6*6 matrices in a lot of the calculations).
Although it doesn’t seem to matter which N we choose as long as appropriate steps have been taken (such as making sure we dont access an out of bounds index or making sure we have allocated enough memory etc.), the results in terms of performance seem to be the same.

Ok, so you figured yourself that k2 actually achieves ~1140 GFLOP/s, which is slightly more that k1 because it does more work overall.

The kink is due to the kernel changing from bandwidth bound to computationally bound. I’d guess you will also see it with k1 once you fix the GFLOP/s scale. k1 probably produces a smoother plot because of the smaller quantization.

I understood, that you have silly results.

This is one point, the performance another.

I might be wrong,

but r,s and are retrieved from array a

as a[idx], a[N+idx] and a[2*N+x].

Then they are written back with different values

to the same Location.

The reading part of thread x might

begin, before the writing part of thread x minus whatever

is finished, and it would , since we are dealing with parallel

threads, x - whatever write too late into the same array element.

You could write 3 Kernels, one for r, one for s and one for t,

it looks like the calcualtion of s,r and t is independent

from each other within one thread.

The overall performance is compromised

by uncoalesced access to global memory.

The threads are GPU-organized in warps,

which hold 32 Threads.

Whenever a thread accesses the global memory,

a 32 int/float wide memory segment=128byte is adressed.

If e.g. thread 12 of the warp acceses the 12th float

of the segment, this is a tremendous effort,

the segment has to be allocated, prepared or whatever

but we have to do this anyway, since we want result back in

global memory.

Fine, when the thread #11 writes to int #11 of the segment

no additional work for the GPU, the segment is already prepared and so on.

Fast Process, like in K1

But when threads of a warp write to floats into the segment,

which do not match the thread no,

then it is some more work for the GPU.

Even worse, when threads try to reach

floats outside the 128 segment, that have been

adressed before, i.e. the warp has to address

other new segements all the time.

thread 0,idx is 0, a[000],a[006],a[012], this fits into one segment

                a[001],a[007],a[013] still

                a[002],a[008],a[014] still

thread 30 a[030],a[036],a[042] has to address a new segment and has to write to two segments

and so on.

You have to write results to shared memory

and then you have to write it back to global memory,

that just one segment is accessed by one warp.

See also the best practice guide: Ensure global memory accesses are coalesced whenever possible. (Section 3.2.1)

http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Best_Practices_Guide.pdf

May split into 3 kernels already helps the problem.