I am beginner in CUDA and have been trying to write a kernel that finds the absolute value of a complex number.
I have an array of 1024x1024 and based on the matrixmultiplication examples and CUDA videos, I am using 64x64 blocks with BLOCK_SIZE of 16.
This code is taking about 28ms which is rather long when compared to other complex operations (like a complex matrix multiplication of 2 1024x1024 matrixes that takes about 1ms).
First off - I’d do it all in 1D. Fewer sums are required on the kernel side. This means you want a 1D block (dim3 threads2(256)) and then just have a 1D grid (dim3 grid2(4096)). Pos is then simply bx*256 + tx. It doesn’t save much computation, but it seems a lot more understandable to me.
(I should probably point out here that I’m not entirely certain what type cuComplex is under the hood. I’d guess it’s a float2. Having said that, it shouldn’t really matter for the rest of this post.)
Now… the rest. I’m a bit confused as to why it’s slow. You have two coalesced reads, and one coalesced store. Have you been messing with your input pointers at all (as in, are they the same pointers as assigned by cudaMaloc)? If so you could messed up alignment and that would slow you down, especially on older cards.
Failing that, it may be the compiler being foolish. I would have thought that it would have to be really quite foolish though. Might be an idea to do it a bit more explictally:
cuComplex value = B[pos]
A[pos] = sqrt(B[value.xvalue.x + value.yvalue y);
Shared memory won’t help at all - in fact, it should give you the wrong results. Shared memory is only useful for intra-block communiction. What you’ve done is ok, just ditch the shared specifier. Given you saw no improvement, then I think the compiler probably wasn’t being stupid.
Next question - how are you doing your timings? Are you sure they’re correct?