Tested the time to convert a 32 bit float original 1248x960x496 matrix (just under 2.4 GB) into a 16 bit float matrix of the same size, and then convert back to see the loss of accuracy.
Wrote my own naive kernels to perform the conversions both directions and timed and ran from MATLAB via a mex file;
Overall the times seems very good, but it takes slightly longer to convert from 16 bit float to 32 bit float;
time to convert 1248x960x496 from 32 bit float to 16 bit float =0.013000
time to convert back 1248x960x496 from 16 bit float to 32 bit float =0.014000
mean absolute error =
max absolute error =
time is expressed in terms of seconds, and it takes 10-13 ms for 32-16 conversion and 13-17 ms for 16-32 conversion.
Using CUDA 7.5 with latest driver, Windows 7, Titan X using TCC driver
So pretty good results in both performance and in error after two conversions.
If anyone is interested here are the histograms of the values, with the first being the histogram of the original buffer in 32 bit form, and the second histogram being the result of the conversion done in CUDA from 32 bit to 16 bit and back to 32 bit;
The type conversion instructions are quarter throughput on Maxwell which is plenty fast enough to keep up with device bandwidth. I suspect your speed tests would show no difference in a basic test with the same amount of IO. I think __float2half_rn(float) and __half2float(unsigned short) and the underlying ptx cvt instructions have been around for quite a while now. I think the main thing cuda 7.5 added was the half and half2 datatypes to replace the old unsigned short format. They also added fp16 support in cublas and cudnn.
What I’d especially like them to add support for is exposing the fp16x2 atomics that exist on sm_52 hardware. I can’t really see a good reason for not having done that in the 7.5 release. It’s the only reason I’m writing fp16 pooling kernels in assembly. The code would be just as fast in cuda c and easier to maintain.
Since this is a “new” type (unsigned short behind the scenes) will __ldg() work correctly without casting to a supported type?
I tried using it as is with __ldg(&halfArray[idx]) and it compiled and returned the correct result, but when I instead decorated the pointers (including the half pointer) to const restrict and removed the __ldg() command that seemed to be faster.
The texture cache really shines when your memory access patterns are strided or are non-uniform in some way. For your benchmark code I imagine you just had simple complete transaction access patterns. So having LDG.E.CI.U16 vs LDG.E.U16 instructions in your sass probably wouldn’t matter too much. Were you averaging your times over many iterations?
I still just use unsigned short in my cuda code and __ldg() so I’m not sure where all half types have been added.
These loads are in curved diagonal patterns, with the slope and curve determined by input projection matrices as well as by current 3D position(x,y,z) in the launch grid.
In my algorithm the writes are coalesced but the reads are giving me fits.
In general using the texture cache helps, but still trying to find some magic heuristic I can use to bring the values I need into shared memory for re-use by threads in block.
To further complicate the problem the loads are done in a bi-linear interpolation pattern, i.e. (x,y), (x,y+1), (x+1,y), (x+1,y+1).
This is for the RabbitCT benchmark, and (through my company) I am submitting my 256^3, 512^3 and 1024^3 algorithms for the scorecard. At this point I have a 300 ms time for the 256^3 division using a single GPU with a mean squared error of 0.0009. This puts my implementation in second place behind the #1 guy who used a multi gpu implementation with a mean squared error of 0.16.
In 4th place for the 512^3 and in 3rd place for 1024^3 using one GPU with the same mean error.
I think the top places are using some type of approximation and avoiding work. This would explain the huge difference in error between my implementations and their posted implementations.
Aside from optimizing shared memory use, you could also trying to maximize L2 cache hits. You can try something as simple as remapping your block indexes. Say every odd y you could reverse x (W-x-1) to give more of a zigzag pattern of access. Or there may be other peculiarities of the data you can exploit. Just try to avoid hitting ddr as much as possible.
Hey, I am trying to use half2, but I run into an error, namely,
error: class "__half2" has no member "y"
The section of code where the error occurs is as follows:
uint8_t V_ ; // some elements (uint8), to save space
float V_C; // storing the diff to use later
half2 *C_ = C.elements; // D halfs stored as half2, to be read
Cvalue = 0.0;
for (d = 0; d < D; d+=2)
V_C [d ] = V_[d] - __half2float(C_[d/2].x) ;
V_C [d+1] = V_[d+1] - __half2float(C_[d/2].y) ;
Cvalue += V_C [d] * V_C [d];
I’m curious did you end up using 16-bit floats in your RabbitCT implementation? The memory bandwidth to texturing throughput seems to be steadily declining (2 bytes/texel on GTX 670, 1.75 bytes/texel on the GTX TITAN X and now just 1.25 bytes/texel on the GTX 1080) so I’m seriously considering 16-bit floats as a quick fix.
Yes I use the 16-bit float type for the filtered input view set, and I do not use texture interpolation or directly use texture memory at all.
My performance times for RabbitCT include compressing/re-ordering the input set down to a 16-bit representation. When used in the kernels I cast the view pixel data back up to 32 bit and perform the bi-linear interpolations in 32 bit. The reconstruction volumes are always represented as 32-bit.
I did everything in that implementation the ‘hard way’, because in real-world practice we have a corresponding forward projector and found that using 9-bit texture interpolation caused to much ‘fuzz’ in the process of iterative reconstruction.
I’m really impressed with the performance you’re getting without using the hardware bilinear interpolation but you’re not using the texture cache either? You’re not using textures or surfaces for your input data? Yikes!
I’ve been wondering about bypassing the texture unit myself - not so much for the precision but because the throughput doesn’t seem to be increasing as quickly as the raw TFLOPS. I figured that with the increased register pressure and the loss of the texture cache it probably wasn’t worth it yet though.
It’s very interesting that you’ve seen the 9-bit texture interpolation affecting iterative reconstruction. I’m not aware of anyone reporting that in the literature.
Thanks, but you still hold 2/3 RabbitCT records using older GPUs which is more impressive.
I use the ‘const restrict’ pointer qualifiers for the view loads instead of directly using textures. I did have a texture based approach as well, but my texture based implementation was not significantly faster than the posted approach.
I’ve posted a result for the 1024^3 reconstruction on a single GTX 1080. It’s already exceeding the theoretical texturing throughput of the GTX TITAN X and I’m hoping that use of 16-bit floats might yield a further 30% improvement.