You have a race condition in that kernel. Either that, or out of bounds shared memory accesses (depending whether threadId is simply threadIdx.x or threadIdx.x + blockDim.x*blockIdx.x).
Your timing is wrong because kernels are launched asynchronously. With that many kernels launched, it may be irrelevant because the queue fills up anyway.
You’re launching blocks which sizes aren’t multiples of warp size. That’s a performance loss out of the box.
You haven’t understood what shared memory is for. It’s not necessary in a calculation like this at all.
You just can’t code in CUDA. Don’t blame it on the hardware or the environment.
By the way, this can’t be the full code, since you’re not supplying any device pointer to test in runtest, only N.
Both NVIDIA and AMD access memory in a similar fashion - using coordinated (coalesced) reads/writes of blocks of memory. It’s absolutely the fastest way to make SIMD throughput machines.
There are plenty. Let’s do a little math.
To reach 1000 GFLOPS on an NVIDIA GPU, one has to basically pump out MAD and MUL instructions and they may be executed concurrently.
A MAD instruction is:
a = b + c*d
And a MUL
e = f * g;
There are five arguments. Now, if you wanted to sustain 1000 GFLOPS, you’d need to pull 1000 * 5 floats from memory per secod. Thats 20 000 GB/s.
A GTX 285 has memory bandwidth of 155 GB/s. An AMD HD 4890 has 122 GB/s. See the problem? And this assumes your program consists entirely of interleaves MADs and MULs and that everything is perfect (no latencies, register dependencies, perfect scheduling, you don’t write the results back etc.).
This is why real apps, be it on AMD or NVIDIA GPUs, using any programming model, shaders, assembly, whatever, usually don’t reach peak GFLOPS. FFT is an example. Unless AMD’s FFT does 1TFLOP?
Of course you could get more GFLOPS per GB/s by only working on registers/shared memory/cached data and not fetching much global memory, but real apps generally want to go through huge datasets, not small ones repeatedly. And caches don’t help here if that’s a throughput issue. Neither does DMA, it’s just a way of fetching data, it doesn’t go around the physical width of the memory bus and frequency of the memory, only allows some asynchronous access. NVIDIA cards do memory accesses asynchronously as well, automatically. Proof being that it’s relatively easy to get 80-90% of peak physical bandwidth.
In a real life situation, you’d most likely need around 200+ arithmetic instructions per global memory access to even be close to saturating the ALUs, otherwise you’ll only hit big GB/s (which is still good and might get you 100x+ speed-ups compared to CPU). This is the same for AMD cards and it’s one of the reasons why GFLOPS are neither the only nor the best performance metric there is.
Does FFT do 200+ arithmetic operations on every data element it loads? If so, for what sizes of datasets and what N?
BTW: I’ve modified your code so that it made sense. I reach 412 GFLOPS out of the max 640, effectively maxing out the hardware with 2 instructions per clock, save for the concurrent MUL that could’ve been added for third instruction per clock. I’m reaching 97% of the theoretical FLOPS possible without dual issuing MAD+MUL. Here’s the code
__global__ void test(float* in, int N)
{
int threadId = threadIdx.x + blockDim.x*blockIdx.x;
float data = in[threadId];
float sum =0;
#pragma unroll 200 //pretty important for performance here
for (int i=0; i< N; ++i)
{
sum = sum+data*data;
}
in[threadId] = sum;
}
When I add a mul there, just for fun, I get 520 GFLOPS, reaching 81% of peak GFLOPS. Only 80% because dual issue doesn’t work perfectly in compute capability 1.1 cards. I hear that’s been worked on in 1.2+?