Nice to see your project improving rapidly…
Just a few questions and remarks:
Back when you were working on the Cell backend, you generated SIMD instructions and handled branch divergence in software, right? Do you plan to do so with this translator?
I think LLVM supports vector instructions and registers.
Since most CUDA codes are data-parallel programs already optimized for SIMD execution, and the hardware industry is heading for general-purpose cores with wide SIMD extensions, I believe that makes an interesting research direction (and just figuring the best way to implement branches and predication should keep a few PhD students busy for some time ;)).
You observe that strided accesses are much slower than sequential accesses on the CPU. Do you think it would be possible to detect at least some coalesced memory accesses in the PTX code through static analysis, and then translate them into sequential/vector loads and stores on the CPU side?
I don’t think your implementation of rounding works as it stands. Think of what happens at a midpoint between two integers. Also, cvt.rni.f32.f32 need to work with big numbers too. My suggestion is to implement all conversions as library functions based on nearbyint() and lrint(), as you already do in the emulator, and list that among the “supported in hardware on the GPU but not the CPU” stuff.
What was the range of the random inputs for the special function throughput benchmarks?
Interesting that rsqrt ends up being faster than sqrt even for scalar code…
In my opinion, the ultimate CUDA->CPU translator should:
take advantage of SIMD instructions when possible and efficient and select the appropriate SIMD width,
figure out memory access patterns to emit the most efficient memory instructions,
provide a target-dependent library of data-parallel functions such as reduction and scan, math functions and such. I think not allowing this is the most prevalent limitation of PTX at the moment.