Thanks for ur good ideas!
I tried out three of them:
First:
__global__ void CopyUnsignedChar3ToUchar4( unsigned char* input, uchar4 *output)
{
const int x = blockDim.x * blockIdx.x + threadIdx.x;
const int y = blockDim.y * blockIdx.y + threadIdx.y;
const int index = Index_ZS(y, x);
*(output + y * IMAGE_WIDTH + x ) = make_uchar4( *(input + index),
*(input + index + 1),
*(input + index + 2),
0x00 );
}
Second:
__global__ void CopyUnsignedChar3ToUchar4_S( unsigned char* input, uchar4 *output)
{
const int x = blockDim.x * blockIdx.x + threadIdx.x;
const int y = blockDim.y * blockIdx.y + threadIdx.y;
const int index = Index_ZS(y, x);
__shared__ uchar4 storage[8][8];
storage[x][y] = make_uchar4( *(input + index),
*(input + index + 1),
*(input + index + 2),
0x00 );
__syncthreads();
*(output + y * IMAGE_WIDTH + x ) = storage[x][y];
}
Third:
cudaMemcpy2D(dest_uchar4, 4, pcInputImage, 3, 704, 576, cudaMemcpyDeviceToDevice);
Results:
Method Nr: 1 1 2 2 3 3
Number of measurements: 250 10000 250 10000 250 10000
Median [ms] 1,080130 1,080480 1,076800 1,082360 0,008363 0,008508
Variance [ms] 0,046579 0,062817 0,006150 0,076088 0,001715 0,003705
Unsureness [ms] 0,002946 0,000628 0,000389 0,000761 0,000108 0,000037
I tried out using the shared memory as buffer in a kernel, that performs some more calculations the the one above. This gave me an improvement of about 1-2 ms. The kernel processed an 704x576x3 (unsigned char) image.
I used a 8800GTS with 640 MB.