Fast way (on device) to convert from byte to float

SrJsignal · August 13, 2007, 3:15pm

Does anyone have any experience converting data types on the card?

I’m attempting to implement something and the transfer time to the board is killing us, we have byte data, but need float for the FFT on the board.

What I’d like to do is send byte data to the board, then convert to float on the board, then FFT the data.

Any suggestions on what might be a fast way to do this on the device?

-Jordan

seibert · August 13, 2007, 7:20pm

Such a conversion function, which will have to read and write to global memory, is probably going to be I/O bound anyway, so the most important thing will be to ensure your memory access pattern is optimal. You’ll want to pack the bytes into 32 bit words, and ensure that the conversion kernel has each thread read them in contiguous order so that the reads can be coalesced.

paulius · August 16, 2007, 8:47pm

You could try the following:

load the byte data to the device
run a conversion program:
2a) copy subblocks of your byte data into shared memory
- make sure to coalesce: each thread should read multiples of 4 bytes, every 16 threads should read a contiguous aligned memory region.
2b) convert bytes to floats (with a cast) and write to the output memory location
- again, make sure that your writes to global memory are coalesced

Coalescing is very important - for a program as this, which is doing mostly memory reads/writes, you’ll see about a 10x speedup when coalesing vs uncoalesced accesses.

Let me know how this works out, or if you run into trouble with the code. I’m very interested to see what kind of speed you achieve.

Paulius

SrJsignal · August 17, 2007, 1:42pm

Ok, I’ve got an initial second cut at this finally. I’ve managed to make a program that will convert N samples from byte (unsigned char) to float, thanks to some help in a different thread. Now I just need to work on the coalesce stuff.

Currently I’m grabbing the bytes, using a constant lut (which is only 256 floats in size) and then writing the floats back out to global memory.

for 16MB (in byte format) my timings are:
8.4ms (before I converted from using a “lut” in memory to a constant static lut)

7.3ms which is pretty good, but it’d be nice to reign it in a little more.

I need to get this integrated into a bigger program, then I’ll mess with the “speedup” (it still represents a 28.5ms gain over sending floats to the board)

SrJsignal · August 17, 2007, 4:32pm

Ok, did a quick overhaul.
Did a quick cast from uchar to uchar4 and float to float4 and now it takes ~1.9 ms to do the conversion on 16MB. According to my calculations, if there was no processing invoved, only read and write to memory the max would be on the order of 1.1ms or so, I think given that’s it’s not perfectly straightforward like that, 1.9 is excellent.

paulius · August 17, 2007, 10:02pm

Sounds good. If I had to guess, I’d say that the final speedup came from casting to uchar4 (float and float4 should give pretty much the same performance when reading/writing global memory). If you get a chance, can you verify the timeings by using uchar4 but scalar float instead of float4.

Paulius

SrJsignal · August 20, 2007, 6:50pm

The issue with that is that it creates some nastiness in addressing for the kernel, especially since I’m doing a kernel “wrapper” trick to allow for N > 65535*256 for our stuff. I’ll see how this goes this week, I’ve got to figure out how to knock down our total timing for this project or it’s going to lose out to other solutions. :( If I get some extra time I’ll throw together a test and let you know how it goes.

Topic		Replies	Views
coalescing memory in short to float conversion CUDA Programming and Performance	3	4478	January 23, 2009
Type conversions on-board the GPU What's the most efficient way? CUDA Programming and Performance	3	3539	February 27, 2009
Performance of conversion byte->float vs float->byte CUDA Programming and Performance	3	3984	April 29, 2014
Do I have to consider the time spent on data type conversion when calculating the ideal running time of a kernal? CUDA Programming and Performance	1	348	January 15, 2019
Another question about coalesced reads/writes CUDA Programming and Performance	10	2129	August 18, 2009
Coalescing on Devices with Compute Capability 1.2 CUDA Programming and Performance	1	2198	July 10, 2008
Library for conversion of cuda datatypes? CUDA Programming and Performance	9	2328	October 16, 2019
Working on Floats as Integers Tips needed CUDA Programming and Performance	10	9090	January 15, 2008
Register / Shared memory question memory copy max performance CUDA Programming and Performance	6	8146	September 13, 2009
Memory bank access during int to short conversion CUDA Programming and Performance	2	658	June 23, 2011

Fast way (on device) to convert from byte to float

Related topics