Non-sequencial memory access coalescing

Hi fellows,

Please I need your help in something here, I want to know if is possible to move data from one global memory location to another in single transaction as follows:

out[Idy] = data[Idx];

where Idy = 0 128 256 384 512 640 768 896 1024 and Idx = 0 1 2 3 4 5 6 7 8, and number of threads per block is equal to the total number of elements to be copied.

And my GPU device is Geforce GT630.


I use the command cudamempy with the flag devicetodevice in the end. It works for my problem.

Hi Pasoleatis,

Is it really possible to use cudaMemcpy inside kernel function? My case here is not just copying the data to another location, but copying elements from a specific memory location to another specific memory location. Lets say I have 1D image data of size 64 x 64, I want to process 1st row of this image of an index 0,1,2,3,4,5,…63 and copy the data to 1st column of output buffer of an index 0,64,128,192,256,…4032.

The number of threads per block is equal to the number of elements per transaction, therefore my question here is how possible to copy row of data from a memory (consecutive in the memory) to column of the output buffer (non-consecutive in the memory) in single transaction, this means all threads will participate.

The write transaction would violate memory coalescing rules. Even if you did it within one warp or half warp, the memory controller would have to split up the transaction in individual writes. This serialition will be detrimental to performance.

If I understand correctly, you would like to transpose a matrix of integer elements representing a 2D image. If so, you may be interested in the following whitepaper on efficient matrix transposition:

A recent blog post on the same topic: