I’m trying to do some computations in FP16 on a TX2 board, but they are running slower than in FP32. I have values already converted to half2 which are passed to the kernel.
Option 1 takes 10.7ms for all of the input data, and option 2 takes 18.19ms. Replacing the data with float2 and adding (same format as option 1), only takes 6.8ms. There’s about 2M data points in all, if that makes a difference. Used the Visual Profiler in Nsight for kernel timing.
Any ideas why fp16 is taking longer than fp32? It also takes longer for multiplication operations.
In Nsight 9.0, the Settings only include up to 6.1 (which helped a bit, but was still longer than expected). 6.2 is not listed in the Properties/Settings/CUDA tab. It only lists up to 6.0 in Nsight 9.1. I’ve been using 9.0 because that is what is available for the TX2 through Jetpack and the versions need to match so I can profile the application. Is there somewhere else that I can find the 6.2 setting?
I don’t have a makefile outside of Nsight, but I was able to get it to compile in Nsight with sm_62 (which required re-creating the project in the IDE from scratch - 6.2 was available in the initial options, but not in properties after I had chosen the max 6.0 the first time).
I reran the profiling after compiling with 6.2 and the compute time did not change (for fp16 or fp32).
OK. Looking over your kernel, you may want to start by simplifying the indexing and block/grid dimensions to 1-dimensional at first to prove that FP16 optimizations are having an impact for you.
// idx = blockIdx.x*blockDim.x+threadIdx.x+blockIdx.y*N; //coalesced memory access
const int idx = blockIdx.x*blockDim.x+threadIdx.x; // 1D indexing for testing
What block and grid dimensions are you launching with?
Then I also recommend saving the global memory values to intermediate register, like so:
I wrote the following in a separate project to test just the half vs float add time (so it’s definitely independent of the rest of my program), and half still takes longer (134 vs 50 us).
I ran the same function with the two __hadd commands replaced with __hadd2, and it took 0.0139777 ms. So that is shorter than the separate __hadd lines, but still longer than adding floats.
I found this thread interesting because I have recently accelerated my application on the TX2 using half floats. I observed run time reduction in my application close to the expected 2x speedup.
I ran the code posted earlier here and observed the same results as reported with the half precision runtime either the same or slower than the single precision. However, if I use half2 and __hadd2(), give the kernel launches more blocks to run and the kernel more work to do with an internal loop, I observe close to the expected 2x speedup with half floats.