Translating CUDA Programs to Other Architectures than GPUs Tech Report

For anyone who is interested, I did a back-end compiler/translator last semester that retargets CUDA programs to IBM Cell. For people interested in building back-end CUDA tools (dealing with PTX), the following tech report describing our implementation may be useful.

[url=“http://www.cercs.gatech.edu/tech-reports/tr2009/abstracts/01.html”]http://www.cercs.gatech.edu/tech-reports/t...stracts/01.html[/url]

Very nice. I’m making an attempt to write my own compiler (compile to PTX) right now…it would be neat to see the PTX JIT engine get implemented on lots of different architectures so that the kernels are more portable.

As another idea, I wonder if PTX could even scale to distributed computing if the block sizes were made very large to overcome the network latency (or perhaps that might not even matter). The thread barrier (__syncthreads()) could have the cluster master wait until all nodes have returned data, then distribute the work back out again, etc.

We considered the idea of splitting PTX kernels across multiple devices (GPUs, not distributed nodes, though the problem is similar), but ended up abandoning it because of difficulties with the shared memory model used by PTX. Because any ordering of CTAs is valid in the programming model, it is possible to broadcast all cudamemcpys from the host to all devices and then split the total number of CTAs across different devices. The problem is then in combining the results copied back via cudamemcpy device to host. In the worst case, CTAs on different cards will interleave writes into the data section being copied back. Implementing a merge on the data would involve copying the section of memory from all devices, then comparing each byte with an unmodified copy and selectively replacing only the byes that differ. Up until sm_11 I think it was doable, just incredibly annoying.

Atomic operations further complicate the process, and from what I can tell, make it impossible to do in the general case on GPUs. This is because there is no way (without instrumenting the code) to determine which global addresses are updated via atomic instructions. To implement correctly across multiple GPUs, it would require doing a reduction on only the bytes that were modified via an atomic operation across all of the copies from all GPUs, combined with the merge operation described above.

Some of the other work that I am doing that I think is more promising ( http://www.gdiamos.net/classes/harmony/hpd…5hot-diamos.pdf ) looks at CUDA programs with multiple kernels to try to determine which can be launched in parallel on multiple GPUs or possibly multiple nodes.

Very impressive! Great work.