How do You Run a CUDA Program on Multiple Systems?

So, I have written a program that uses CUDA and everything is great. It loads a predetermined file, performs calculation on the data from the file, and then saves the results to another file. It runs much faster than the equivalent CPU program and I am very happy. I use Visual Studio 2008 and the 4.0 toolkit. The code is compatible with compute capability 1.0 but I have a card with 2.0 capability (not that I think this matters).

Now here is the thing, how do I make this same executable run on other windows systems that have CUDA-capable cards? I have been very unsuccessful in finding any information on this. And my attempts have failed (I have a few win7-x64 machines with different hardware at my disposal). Is there a way to compile the code once so that it includes all of the information it would need to run on any other system? I cannot install visual studio and the CUDA toolkit on every machine that my program will end up on; it just isn’t practical. I would ideally like to have it work on all compute capabilities, for both 32bit and 64bit systems, and for XP/Vista/7. I don’t actually mind if I have to have a different executable for each of those parameters so long as I can be sure that there is an executable that will work. I found some references to making a device code repository but I am not sure that is what I need and I couldn’t figure it out.

Your help is much appreciated. Thanks!

Your CUDA code should run on any GPUs which have the same compute capability as the one you compiled for. If this is not always the case, you can compile to PTX for some minimum compute capability and then use the driver API to load and run the PTX versions of your kernels. The PTX will then be JIT compiled at run-time, and will run on any GPU which has a compute capability at least as high as the minimum you compiled for.

Thanks so much!! That gave me a hint of where to look (just in time compilation). After reading another post on the forums, I created the environmental variable CUDA_FORCE_PTX_JIT and set it to 1. I then reloaded visual studio and compiled towards compute capability 1.0. Now, my executables work on 2 of my 3 computers! So that is a lot of progress.

But I am still not quite done. I want to know why it isn’t working on the third computer. I think it either has to do with the driver version or the fact that I am dealing with a Tesla. But here are the stats of the computers (all 3 are windows 7 Enterprise x86-64bit but with very different hardware):

Computer 1:

“GeForce GTX 465”

CUDA Driver Version / Runtime Version 4.0 / 4.0

CUDA Capability Major/Minor version number: 2.0

graphics driver: 275.33

Computer 2:

“GeForce GTS 450”

CUDA Driver Version / Runtime Version 4.0 / 4.0

CUDA Capability Major/Minor version number: 2.1

graphics driver: 275.33

Computer 3:

“Tesla C2050”

CUDA Driver Version / Runtime Version 3.20 / 3.20

CUDA Capability Major/Minor version number: 2.0

graphics driver: 267.24

So, the code will compile individually on all 3 but when I compile it on the #3, it won’t run on #1 or #2. The converse is also true, the code that is compiled on either #1 or #2 won’t run on #3. But if the code is compiled on #1, it will run on #2 and vice versa. So, why is this happening? What can I do to ensure compatibility on all systems with CUDA-capable cards?

shameless self bump…

Someone please help me out!

Please, someone help me out.

PLEASE! What is going on with my code?

Well, I waited a few days this time before self-bumping but it seemed to not pay off. Almost 800 views but still no answer. Does nobody have any ideas? I will gladly accept vague hints if that is all you have.

My guess is you should update the cuda library files to the same version, e.g. 4.x.

Here is an example of what I get on my Linux machine when i ask what a program compiled 3 months ago with cuda 3.2 depens on:

zkoza@mormi:~$ ldd zkcopy

	linux-vdso.so.1 => (0x00007ffffa7ff000)

	libcudart.so.3 => not found

	libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007fae6d9ea000)

	libm.so.6 => /lib/libm.so.6 (0x00007fae6d767000)

	libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fae6d550000)

	libc.so.6 => /lib/libc.so.6 (0x00007fae6d1cc000)

	/lib64/ld-linux-x86-64.so.2 (0x00007fae6dd25000)

As you can see, the loader cannot find the library file libcudart.so.3

The ending number, 3, is typical of cuda 3.x.

But since I installed cuda 4.0, my cuda library files were upgraded to version 4:

zkoza@mormi:~$ find /usr -name "libcudart.so.*"

/usr/local/cuda/lib64/libcudart.so.4.0.17

/usr/local/cuda/lib64/libcudart.so.4

/usr/local/cuda/lib/libcudart.so.4.0.17

/usr/local/cuda/lib/libcudart.so.4

/usr/local/cuda3.2/lib64/libcudart.so.3.2.16

/usr/local/cuda3.2/lib64/libcudart.so.3

/usr/local/cuda3.2/lib/libcudart.so.3.2.16

/usr/local/cuda3.2/lib/libcudart.so.3

As you can see, I have both cuda 3.x and 4.x installed in the system, 32 and 64 bit (even though I’m on 64 bit), however,

my loader, to stay sane, does not see the files from version 3.x nor from CUDA 32-bit libraries:

zkoza@mormi:~$ echo $LD_LIBRARY_PATH 

/usr/local/cuda/lib64:/usr/lib64:/lib/:/home/zkoza/lib

The loader looks at /usr/local/cuda/lib64 and not at /usr/local/cuda3.2/lib64 where the 64-bit libcudart.so.3 actually resides.

I could easily make my program run by changing the value of LD_LIBRARY_PATH environment variable.

So, even if you’re on windows, I guess the same mechanism works.

Your PTX/CUDA code is actually portable among all machines, but the CPU code is not - it depends

on some special library files which must ba (a) present on the host system and (b) visible to the loader.

cudart should be provided with your application, usually in your program folder (on Linux, you may have to add a LD_LIBRARY_PATH of .) Then there will be no versioning problems if the driver is at least at the same major/minor version level as your bundled cudart library and if your binary embeds the PTX source code (or has cubins built for all compute capability versions that you expect to run on).

I find that my code built with CUDA SDK 2.3 runs just fine even when deployed on Fermi architectures. This is enough for what we do and makes us quite happy.

Christian