Translating CUDA Programs to Other Architectures than GPUs Tech Report

Gregory_Diamos · January 22, 2009, 9:33pm

For anyone who is interested, I did a back-end compiler/translator last semester that retargets CUDA programs to IBM Cell. For people interested in building back-end CUDA tools (dealing with PTX), the following tech report describing our implementation may be useful.

[url=“http://www.cercs.gatech.edu/tech-reports/tr2009/abstracts/01.html”]http://www.cercs.gatech.edu/tech-reports/t...stracts/01.html[/url]

jack · January 22, 2009, 10:01pm

Very nice. I’m making an attempt to write my own compiler (compile to PTX) right now…it would be neat to see the PTX JIT engine get implemented on lots of different architectures so that the kernels are more portable.

As another idea, I wonder if PTX could even scale to distributed computing if the block sizes were made very large to overcome the network latency (or perhaps that might not even matter). The thread barrier (__syncthreads()) could have the cluster master wait until all nodes have returned data, then distribute the work back out again, etc.

Gregory_Diamos · January 22, 2009, 10:33pm

We considered the idea of splitting PTX kernels across multiple devices (GPUs, not distributed nodes, though the problem is similar), but ended up abandoning it because of difficulties with the shared memory model used by PTX. Because any ordering of CTAs is valid in the programming model, it is possible to broadcast all cudamemcpys from the host to all devices and then split the total number of CTAs across different devices. The problem is then in combining the results copied back via cudamemcpy device to host. In the worst case, CTAs on different cards will interleave writes into the data section being copied back. Implementing a merge on the data would involve copying the section of memory from all devices, then comparing each byte with an unmodified copy and selectively replacing only the byes that differ. Up until sm_11 I think it was doable, just incredibly annoying.

Atomic operations further complicate the process, and from what I can tell, make it impossible to do in the general case on GPUs. This is because there is no way (without instrumenting the code) to determine which global addresses are updated via atomic instructions. To implement correctly across multiple GPUs, it would require doing a reduction on only the bytes that were modified via an atomic operation across all of the copies from all GPUs, combined with the merge operation described above.

Some of the other work that I am doing that I think is more promising ( http://www.gdiamos.net/classes/harmony/hpd…5hot-diamos.pdf ) looks at CUDA programs with multiple kernels to try to determine which can be launched in parallel on multiple GPUs or possibly multiple nodes.

_Big_Mac · January 23, 2009, 1:50pm

Very impressive! Great work.

Topic		Replies	Views
PTX Code Transformations CUDA Programming and Performance	2	5147	September 16, 2009
Ability to run PTX directly CUDA Programming and Performance	2	4391	November 11, 2009
Is there a reverse engineering tool which gives (approximate) CUDA C++ code from PTX code? CUDA Programming and Performance cuda , kernel	1	151	September 4, 2024
How do You Run a CUDA Program on Multiple Systems? CUDA Programming and Performance	8	6317	August 16, 2011
New to Tesla/CUDA questions Just a few questions. CUDA Programming and Performance	7	7916	October 24, 2007
Thread Synchronization CUDA Programming and Performance	4	8495	October 2, 2009
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204316	April 13, 2009
Determining correct compute capability for a loaded PTX file/kernel ? CUDA Programming and Performance	10	2608	February 11, 2015
CUDA Refresher: The CUDA Programming Model Technical Blog	2	653	January 26, 2023
CUDA for Non-programmers? I am an undergrad physics major... CUDA Programming and Performance	6	2123	June 19, 2011

Translating CUDA Programs to Other Architectures than GPUs Tech Report

Related topics