Is my kernel too simple to get a speed increase from CUDA?

It only runs marginally faster than on the CPU. I tried doing for example, eight of these atan2 calculations in one kernel, which just made it slower. Is this kernel too simple?

Which hardware is this run on?
You may want to use atan2f insteas of atan2 to make sure it is not executed in double precision (if you have GT200 hardware).
If not, you are getting uncoalesced memory reads in both of youre memory transaction, which will significantly slow things down.

It depends: which GPU, which CPU, which input array size? But generally - yes, the kernel is such that even without much thinking one would say that not much of the performance could be expected out of it: the number of numeric operations is small, on the other side atan2() is used which is going to be calculated on SFU unit (only 2 of these per SM, versus 8 SP units per SM), access to the global memory locations for input values in a pattern that may be somewhat problematic regarding coalescing, etc.

I am using a Geforce 8800GT, and I notice I still get a small speed boost if I use atan2f over atan2. I’ll look into coalesced memory reads for global memory.

It’s not to simple, you should use shared memory which would speed thing up, I have a similar kernel performing a sigmoid on every element in an array and I get 250ms execution for 1,000,000 floats unthreaded on a 2.6Ghz CPU, vs 0.5ms on a Geforce 8600GT, so thats 500 times faster!

Would shared memory actually help in my kernel? I am only accessing my array elements once, should it still help anyways?

No, using shared memory won’t help here.

Actually it might. In the code thread 0 access data [0] and [1], thread 1 access data [2] and [3], thread 2 access data [4] and [5] and so on…

This is obviously not coalesced certainly not on an old 8800.

you might try something like this

__shared__ float smdata[ BLOCK_SIZE * 2 ];

// Something along those lines...

smdata[ threadIdx.x ] = inputArray[ threadIdx.x ];

smdata[ threadIdx.x * 2 ] = inputArray[ BLOCK_SIZE + threadIdx.x ];

__syncthreads();

// and now access shared memory data instead of global memory...

You can also use textures. On old hardware maybe the shared mem approch will work better.

eyal

Why is BLOCK_SIZE needed? Why do the indexes for smdata include threadIdx? I am confused.

I stand corrected - what eyalhir74 suggested may help indeed. The idea is to try to utlize shared memory to coalesce global memory accesses, and to do that you need to have successive threads in the block to access successive memory locations in the global memory. So - as each of your threads consume two (successive) floats from global memory, you allocate shared memory (shared memory is allocated per block) of 2*BLOCK_SIZE floats. Then, in each block you first copy data from global memory to this shared memory, but this time trying to have successive threads to read successive data from global memory. As shared memory is used per-block, you need to use threadIdx only (and not blockIdx) in the indices on the left side of above statement. On the other side, I think eyalhir74 made some small mistakes in indexing - I think your kernel should be actually changed to:

__global__

void detectionBlock_Phase_Kernel(float* inputArray, float *outputArray, int arrayLength)

{

	__shared__ float smdata[BLOCK_SIZE*2];

	int idx = threadIdx.x+blockIdx.x*blockDim.x;

	int blockStartIdx = blockIdx.x*blockDim.x*2;

	// here, you don't care about which data is going to be used by which thread,

	// you only try to coalesce global memory accesses

	smdata[threadIdx.x] = inputArray[blockStartIdx+threadIdx.x];

	smdata[threadIdx.x+BLOCK_SIZE] = inputArray[blockStartIdx+threadIdx.x+BLOCK_SIZE];

	__syncthreads();

// now, use smdata[2*threadIdx.x] and smdata[2*threadIdx.x+1] to calculate outputArray[idx];

   // note that shared memory accesses won't be coalesced now, but still it is better to have them

   // coalesced for global memory

}

But: please make sure you understand the idea first, and then double check the code above - it’s way too easy to make a mistake when posting un-tested code, so I may have made some in the code above too. Also: if BLOCK_SIZE not constant, then allocate shared memory in your host code, and instead of BLOCK_SIZE, use blockDim.x in the kernel code.

The code works, but it only appears to be speeding me up to .01 secs at most, and at the worst, as fast as my previous best time. It seems that this function is a weak candidate for CUDA. The CPU takes .307 secs vs the GPU at best .225 secs. Thank you all for the help. Is there anything else I can try?

What does the .225 timing include? PCI as well?

Can you post the host code, including your grid settings?

eyal

What do you mean by PCI?

My timing is measured from right before I load the float array on the card, and right after I unload it.

Here is my code:

__global__

	void detectionBlock_Phase_Kernel(float* inputArray, float *outputArray, int arrayLength)

	{

		__shared__ float smdata[256*2];

		int idx = threadIdx.x+blockIdx.x*blockDim.x;

		int blockStartIdx = blockIdx.x*blockDim.x*2;

		// here, you don't care about which data is going to be used by which thread,

		// you only try to coalesce global memory accesses

		smdata[threadIdx.x] = inputArray[blockStartIdx+threadIdx.x];

		smdata[threadIdx.x+256] = inputArray[blockStartIdx+threadIdx.x+256];

		__syncthreads();

		if(idx < (arrayLength/2))

		{

			outputArray[idx] = atan2f(smdata[2*threadIdx.x + 1], smdata[2*threadIdx.x]);

		}

	}

	void detectionBlock_Phase(float *inputArray, float *outputArray, int arrayLength)

	{

		dim3 block(256);

		dim3 grid((unsigned int)ceil((arrayLength/2)/(float)block.x));

		detectionBlock_Phase_Kernel<<<grid, block>>>(inputArray, outputArray, arrayLength);

	}

You move data from the CPU RAM to the GPU over the PCI. but as you say you measure

the timings from right before you load the float array on the card, and right after you unload it,

so that includes the PCI overhead - which has its price :)

what is the value of arrayLength?

eyal

I include the PCI overhead, for my GPU times.

Arraylength is just how much elements I have to process.

Ok let me reprashe this :)

What is ArrayLength’s value??? you need to create enough blocks (obviously depending on your data set size)

to keep the GPU working. so what is the common/max value for this parameter?

eyal

I have it set to 15 million now for testing purposes. I hoping this is enough, I can’t get it much higher than 30 million even though I have a 512mb 8800 GT

15M / 256 == 58593 blocks. This is a bit near to the limit of blocks you can run on the new GTX cards - I don’t know

how much blocks you can run concurrently on the 8800 (run deviceQuery from the SDK to see this value).

If you’re running too many blocks - your kernel might not run at all - make sure it runs correctly in release mode and the results

are correct and there are no errors returned.

Can you break the timings? how much does it take to upload the data to the GPU? how much to run the kernel itself (make sure you have

a cudaThreadSync after the kernel and before the timer stop) and how much it takes to get the data from the GPU to the CPU?

I guess most of the time is indeed the PCI. If this is the case there’s probably not much to do and your initial guess that the kernel

is too simple is right :)

eyal

0.099200 seconds elapsed when I didn’t take into account the loading to and from the device. That is more than 3x faster than the CPU. That means that more time is spent moving data to and from the device than calculating the results :(

I am guessing this is something future architectures will lessen in the future.