I want my cuda program output to be an unsigned short array but my cuda program input program is a float array. Casting float elements to unsigned short directaly does not give a good performance (due to alignment problems).
Which can the solution be for getting a good performance?
I don’t quite understand your question. You want your output to be short but the input is float? Casting float values to shorts should be fast in CUDA.
Alignment problems? is the short boxed with something else or why is this a problem… if the problem is that going from 2bytes to 4 bytes takes you across a pageshift, the put in some patting or make the struct larger to make room, wasting some room to get the data alignned could be a good idea.
float *d_Result,
unsigned short *d_Data,
int data_w
){
int i = IMUL(IMUL(blockIdx.y, blockDim.y) + threadIdx.y, data_w) + IMUL(blockIdx.x, blockDim.x) + threadIdx.x;
d_Result[i] = (float) d_Data[i];
}
Using the Visual Profiler you can note a high number of gld_incoherent(~6554112 with 15 calls), which is relevant for performance. If I use IPP from intel it will be faster, of course we are talking about milliseconds (msec are important for me :( )
int i = IMUL(IMUL(blockIdx.y, blockDim.y) + threadIdx.y, data_w) + IMUL(blockIdx.x, blockDim.x) + threadIdx.x;
d_Result[i] = (float) d_Data[i];
}
Using the Visual Profiler you can note a high number of gld_incoherent(~6554112 with 15 calls), which is relevant for performance. If I use IPP from intel it will be faster, of course we are talking about milliseconds (msec are important for me )
read 2 values in each thread, store to shared memory (make sure you keep things coalesced)
convert 2 (adjacent) values per thread from shared memory, and store as a single 32 bit value to global memory. That way you will keep things coalesced.
Other option is to use a compute 1.2 or higher device (like GTX280) the new relaxed coalescing rules will probably help a lot in this case.