Hey folks,
I’m rather new to CUDA so as expected I’m running in to a few problems (hopefully minor).
I have a kernel which simply does a set amount of floating point operations and records the min, max and mean GLFOPS of each kernel run. Here is what the kernel was originally:
template <int REP>
__global__ void mykernel(float * __restrict__ a, int reps) {
int idx = blockIdx.x*blockDim.x + threadIdx.x;
float r = a[idx];
#pragma unroll
for (int n=0;n<REP;n++) {
r = 0.0001f+r*1.00002f; // nb. the "f"s are very important!
}
a[idx] = r; // copy result from local to global memory
}
With this we got some pretty nice results (using Nvidia GTX 480 cards).
With ~7680 threads, a kernel doing 5000 FLOPS/thread gave us an average of about 1050 GLFOPS.
However the problem is, when we tried different memory layout patterns, the performance degraded vastly.
Here is an example of a new kernel we tried:
template <int REP>
__global__ void mykernel(float * __restrict__ a, int reps) {
int idx = blockIdx.x*blockDim.x + threadIdx.x;
float r = a[idx]; // copy data from global to local memory (ie. to a register)
float s = a[N+idx];
float t = a[(2*N)+idx];
#pragma unroll
for (int n=0;n<REP;n++) {
r = 0.0001f+r*1.00002f; // nb. the "f"s are very important!
s = 0.0001f+s*1.00002f;
t = 0.0001f+t*1.00002f;
}
a[idx] = r; // copy result from local to global memory
a[N+idx] = s;
a[(2*N)+idx] = t;
}
It’s basically the same kernel but now operating on 3 floats instead of 1.
The calculations my supervisor made showed that this new kernel was operating with a bandwidth of around 2.4 to 2.6 GB/s (don’t worry, he gave the ok to put this stuff on this forum and I may have worded some stuff incorrectly but oh well!).
We’re both stumped as to why the performance has degraded so much, this 2.6GB/s is nowhere near the theoretical limit of ~177GB/s. That’s not to say we’re expecting to be near 177GB/s but we certainly weren’t expecting something as low as this.
So I guess my question is, any ideas on why this is happening?
The alternative kernel also produces bizarre results, as you can see from this chart:
The new kernel also behaves pretty strangely at ~220 FLOPs/thread
(vertical axis represents GFLOPS and horizontal axis represents FLOPS/thread)
As with before, 7680 threads were ran.