from float to unsigned short

calimirave · November 24, 2008, 2:15pm

Hi, I have a simple question:

I want my cuda program output to be an unsigned short array but my cuda program input program is a float array. Casting float elements to unsigned short directaly does not give a good performance (due to alignment problems).

Which can the solution be for getting a good performance?

Thanks

Simon_Green · November 25, 2008, 11:23am

I don’t quite understand your question. You want your output to be short but the input is float? Casting float values to shorts should be fast in CUDA.

martinsk · November 25, 2008, 11:34am

Alignment problems? is the short boxed with something else or why is this a problem… if the problem is that going from 2bytes to 4 bytes takes you across a pageshift, the put in some patting or make the struct larger to make room, wasting some room to get the data alignned could be a good idea.

Martin

calimirave · November 25, 2008, 1:20pm

Sorry I was not clear before. Here is an example:

code:

//converting unsigned short to float array

global void Conver16To32(

float *d_Result, 

unsigned short *d_Data, 

int data_w

){

int i = IMUL(IMUL(blockIdx.y, blockDim.y) + threadIdx.y, data_w) + IMUL(blockIdx.x, blockDim.x) + threadIdx.x;

d_Result[i] = (float) d_Data[i];

}

Using the Visual Profiler you can note a high number of gld_incoherent(~6554112 with 15 calls), which is relevant for performance. If I use IPP from intel it will be faster, of course we are talking about milliseconds (msec are important for me :( )

So, is it a problem?

Thanks

calimirave · November 25, 2008, 1:21pm

Sorry I was not clear before. Here is an example:

code:

//converting unsigned short to float array

global void Conver16To32(

float *d_Result,

unsigned short *d_Data,

int data_w

){

int i = IMUL(IMUL(blockIdx.y, blockDim.y) + threadIdx.y, data_w) + IMUL(blockIdx.x, blockDim.x) + threadIdx.x;

d_Result[i] = (float) d_Data[i];

}

Using the Visual Profiler you can note a high number of gld_incoherent(~6554112 with 15 calls), which is relevant for performance. If I use IPP from intel it will be faster, of course we are talking about milliseconds (msec are important for me )

So, is it a problem?

Thanks

calimirave · November 25, 2008, 1:29pm

ahhh sorryyyyyyy the previous example was 16 to 32 bits but I have solved that problem already reading the manual :).

Actually the problem is realated with going from 32 to 16 bits. Again here is the example:

global void Convert_32f16u(
unsigned short *d_Result,
float *d_Data,
int data_w
){
int i = IMUL(IMUL(blockIdx.y, blockDim.y) + threadIdx.y, data_w) + IMUL(blockIdx.x, blockDim.x) + threadIdx.x;
d_Result[i] = (unsigned short) d_Data[i];
}

and I noted using the visual profiler that this kernel has a high number of gst_incoherent (17055232 with 15 calls).

In other words, I have too many non-coalesced accesses to gloabal memory :(

E.D_Riedijk · November 25, 2008, 1:42pm

ahhh sorryyyyyyy the previous example was 16 to 32 bits but I have solved that problem already reading the manual :).

Actually the problem is realated with going from 32 to 16 bits. Again here is the example:

global void Convert_32f16u(
unsigned short *d_Result, 

float *d_Data, 

int data_w
){
int i = IMUL(IMUL(blockIdx.y, blockDim.y) + threadIdx.y, data_w) + IMUL(blockIdx.x, blockDim.x) + threadIdx.x;

d_Result[i] = (unsigned short) d_Data[i];
}

and I noted using the visual profiler that this kernel has a high number of gst_incoherent (17055232 with 15 calls).

In other words, I have too many non-coalesced accesses to gloabal memory :(

Option is:

read 2 values in each thread, store to shared memory (make sure you keep things coalesced)

convert 2 (adjacent) values per thread from shared memory, and store as a single 32 bit value to global memory. That way you will keep things coalesced.

Other option is to use a compute 1.2 or higher device (like GTX280) the new relaxed coalescing rules will probably help a lot in this case.