from float to unsigned short

Hi, I have a simple question:

I want my cuda program output to be an unsigned short array but my cuda program input program is a float array. Casting float elements to unsigned short directaly does not give a good performance (due to alignment problems).

Which can the solution be for getting a good performance?

Thanks

I don’t quite understand your question. You want your output to be short but the input is float? Casting float values to shorts should be fast in CUDA.

Alignment problems? is the short boxed with something else or why is this a problem… if the problem is that going from 2bytes to 4 bytes takes you across a pageshift, the put in some patting or make the struct larger to make room, wasting some room to get the data alignned could be a good idea.

Martin

Sorry I was not clear before. Here is an example:

code:

//converting unsigned short to float array

global void Conver16To32(

float *d_Result, 

unsigned short *d_Data, 

int data_w

){

int i = IMUL(IMUL(blockIdx.y, blockDim.y) + threadIdx.y, data_w) + IMUL(blockIdx.x, blockDim.x) + threadIdx.x;

d_Result[i] = (float) d_Data[i];

}

Using the Visual Profiler you can note a high number of gld_incoherent(~6554112 with 15 calls), which is relevant for performance. If I use IPP from intel it will be faster, of course we are talking about milliseconds (msec are important for me :( )

So, is it a problem?

Thanks

Sorry I was not clear before. Here is an example:

code:

//converting unsigned short to float array

global void Conver16To32(

float *d_Result,

unsigned short *d_Data,

int data_w

){

int i = IMUL(IMUL(blockIdx.y, blockDim.y) + threadIdx.y, data_w) + IMUL(blockIdx.x, blockDim.x) + threadIdx.x;

d_Result[i] = (float) d_Data[i];

}

Using the Visual Profiler you can note a high number of gld_incoherent(~6554112 with 15 calls), which is relevant for performance. If I use IPP from intel it will be faster, of course we are talking about milliseconds (msec are important for me )

So, is it a problem?

Thanks

ahhh sorryyyyyyy the previous example was 16 to 32 bits but I have solved that problem already reading the manual :).

Actually the problem is realated with going from 32 to 16 bits. Again here is the example:

global void Convert_32f16u(
unsigned short *d_Result,
float *d_Data,
int data_w
){
int i = IMUL(IMUL(blockIdx.y, blockDim.y) + threadIdx.y, data_w) + IMUL(blockIdx.x, blockDim.x) + threadIdx.x;
d_Result[i] = (unsigned short) d_Data[i];
}

and I noted using the visual profiler that this kernel has a high number of gst_incoherent (17055232 with 15 calls).

In other words, I have too many non-coalesced accesses to gloabal memory :(

Option is:

read 2 values in each thread, store to shared memory (make sure you keep things coalesced)

convert 2 (adjacent) values per thread from shared memory, and store as a single 32 bit value to global memory. That way you will keep things coalesced.

Other option is to use a compute 1.2 or higher device (like GTX280) the new relaxed coalescing rules will probably help a lot in this case.