Fast way (on device) to convert from byte to float

Does anyone have any experience converting data types on the card?

I’m attempting to implement something and the transfer time to the board is killing us, we have byte data, but need float for the FFT on the board.

What I’d like to do is send byte data to the board, then convert to float on the board, then FFT the data.

Any suggestions on what might be a fast way to do this on the device?

-Jordan

Such a conversion function, which will have to read and write to global memory, is probably going to be I/O bound anyway, so the most important thing will be to ensure your memory access pattern is optimal. You’ll want to pack the bytes into 32 bit words, and ensure that the conversion kernel has each thread read them in contiguous order so that the reads can be coalesced.

You could try the following:

  1. load the byte data to the device
  2. run a conversion program:
    2a) copy subblocks of your byte data into shared memory
    - make sure to coalesce: each thread should read multiples of 4 bytes, every 16 threads should read a contiguous aligned memory region.
    2b) convert bytes to floats (with a cast) and write to the output memory location
    - again, make sure that your writes to global memory are coalesced

Coalescing is very important - for a program as this, which is doing mostly memory reads/writes, you’ll see about a 10x speedup when coalesing vs uncoalesced accesses.

Let me know how this works out, or if you run into trouble with the code. I’m very interested to see what kind of speed you achieve.

Paulius

Ok, I’ve got an initial second cut at this finally. I’ve managed to make a program that will convert N samples from byte (unsigned char) to float, thanks to some help in a different thread. Now I just need to work on the coalesce stuff.

Currently I’m grabbing the bytes, using a constant lut (which is only 256 floats in size) and then writing the floats back out to global memory.

for 16MB (in byte format) my timings are:
8.4ms (before I converted from using a “lut” in memory to a constant static lut)

7.3ms which is pretty good, but it’d be nice to reign it in a little more.

I need to get this integrated into a bigger program, then I’ll mess with the “speedup” (it still represents a 28.5ms gain over sending floats to the board)

Ok, did a quick overhaul.
Did a quick cast from uchar to uchar4 and float to float4 and now it takes ~1.9 ms to do the conversion on 16MB. According to my calculations, if there was no processing invoved, only read and write to memory the max would be on the order of 1.1ms or so, I think given that’s it’s not perfectly straightforward like that, 1.9 is excellent.

Sounds good. If I had to guess, I’d say that the final speedup came from casting to uchar4 (float and float4 should give pretty much the same performance when reading/writing global memory). If you get a chance, can you verify the timeings by using uchar4 but scalar float instead of float4.

Paulius

The issue with that is that it creates some nastiness in addressing for the kernel, especially since I’m doing a kernel “wrapper” trick to allow for N > 65535*256 for our stuff. I’ll see how this goes this week, I’ve got to figure out how to knock down our total timing for this project or it’s going to lose out to other solutions. :( If I get some extra time I’ll throw together a test and let you know how it goes.