Ahh thanks for the explaination. The add() was just an example, what I’m trying to do is actually a bit different. Do you have any suggestions on how to handle this? see below
So what I’m trying to do is a 2 step algorithm. The first step takes elements from the array 2 at a time, and performs some kind of function on them. The second step, then takes elements from the array 2 at a time, offsetted by 1, and performs another function on them. In other words, the 2nd step’s operation overlaps the first steps operation, and depends on the results from the first step. Its kind of hard to explain, here’s a visual guide:
array = ABCDEFGHI…
1st step:
function(A, B )
function(C, D )
function(E, F )
function(G, H )
//The above functions are the same and are performed in parallel
2nd step:
function(B, C )
function(D, E )
function(F, G )
function(H, I )
//The above functions are the same and are performed in parallel
Simple example:
float *d_array;
cudaMalloc((void**)&d_array, 1025*sizeof(float));
...
// Step 1
add2<<<2, 256>>>((float2) d_array, 1);
// Step 2
add2<<<2, 256>>>((float2) d_array+1, 1);
__global__ add2(float2 *array, int amount){
int tid = blockIdx.x * blockDim.x + threadIdx.x;
array.x = array.x + amount;
array.y = array.x * amount;
}
See in this example, I’ve casted array from float to float2 when calling add2. This way I can load in 2 float values from the array in a coalesced manner and all in 1 execution. It works fine in step 1, the problem arrises in step 2. By offsetting 1 float, I’ve messed up alignment like you said, and the kernel call is no longer coalesced. Got any ideas how I can avoid this problem?
Could I avoid this by creating a second copied array using cudaMemcpy? If so, is cudaMemcpyDeviceToDevice fast enough for me not to worry about it? example below:
float *d_array, *d_array2;
cudaMalloc((void**)&d_array, 1025*sizeof(float));
cudaMalloc((void**)&d_array2, 1025*sizeof(float));
...
// Step 1
add2<<<2, 256>>>((float2) d_array, 1);
// Step 2
cudaMemcpy(d_array2, d_array+1, 1024*sizeof(float), cudaMemcpyDeviceToDevice);
add2<<<2, 256>>>((float2) d_array2, 1);
Any hints on that? Thanks for the reply btw.