Do I have to consider the time spent on data type conversion when calculating the ideal running time of a kernal?

Hi, all

My data is "char"type in global memory and converted to “float” for calculation in the kernal. Where is type conversion executed? Is it parallel to calculation? Ignore the time spent on global memory visiting in this question.

This is what you describe: load char → convert char to float → float computation

On a per-thread basis: that is a dependency chain that doesn’t allow for parallel computation. The char-to-float conversion will require an instruction I2F that needs to go through the single-precision execution unit. Keep in mind that the most expensive part of this dependency chain is likely the load, unless the “float computation” part is fairly involved.

You should be able to easily check trade-offs between data storage formats and conversion overhead by doing a quick prototype. It is generally best to load 32 bits (or more) of data at once, so it is worth examining whether you could arrange storage as uchar4 and process four items at a time.