Non Sequential Copy to Shared Memory


I have a float2 output from FFT that I want to take the .x part only and copy that in to constant memory. I want to then take the data from constant memory and save it in shared memory but increment the data every 8 samples. Is there an elegant way to do this other than using for loops in my two copy operations?


Copying from constant to shared is certainly slower than from global to shared because accesses to different addresses of the constant space are serialised.

That aside, looks like you are dealing with a very small output array. Why do you think for loops are not elegant?

Note that after writing results to constant memory you will need to invoke a new kernel before being able to read back correct values (it is called constant memory for a reason…). Also copying from constant memory to shared memory will be quite slow as accesses will get completely serialized. Why do you want to go through constant memory at all?

What is the compute capability of your device?

I did a speed test using event timing on a 1.1 compute device and I had significantly faster kernel completion time in the case where I kept my read-only data in constant memory (70ms when in device mem, compared to 32ms in constant mem). The data size is 8192 and is needed by all of my threads (<<<8, 400>>>).

I just got a compute capability 2.1 device today so i might test it again. Although if it aint broke, why fix it?

For loop requires that I read the data out to my CPU, arrange it properly, then copy it out (to constant or device mem). I was hoping there was a data address generator that I can use to copy non-sequential data in to a sequential block. I suppose even in the case where I am moving data non-sequentially within device mem, I need to do a bunch of cudaMemcpyDeviceToDevice.

Write your own FFT code and output the .x part of all elements in a single array. You can do the increment while outputting this array. Then just make a call on host to copy it to the constant symbol.