How to make OpenCL works on cluster with Nvidia Tesla?

Hi all!

I have been using OpenCL on my personal computer for a while, but I was given an access to a cluster of GPU for a student project.
According to the owner, it is based on Centos with 7 nodes with 7 NVidia Tesla, and Cuda 4.0 installed. This version is probably outdated, however, the cluster doesn’t seem to be used frequently.
Anyway, I don’t have a root access, so I am currently attempting to build some programs using OpenCL with it.

I am not used to using a cluster at all, so I took a look over the internet, and a so called “VirtualCL” may well be a solution to make them run.
But still, before attempting to use it, I would be more likely to think that a basic program is supposed to work on a node, am I right?
Thus I tried to log in, I found libOpenCL.so in /usr/lib64/nvidia, CL.h in /usr/include/cuda, lspci told me :

0c:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
11:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
12:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
83:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
84:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
87:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
88:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)

Then I managed to build a simple program, and there was hitherto nothing unexpected until I tried to launch it: the first call to an OpenCL function failed, “clGetPlatformIDs” returned -1001.
There are many subjects on that error on the internet, but none of them seem to be dealing with a cluster, and according to me, it seems rather odd it would stem from the setup of Cuda since someone may have successfully been using it before me.

In fact, I am currently thinking I am misunderstanding the usage of OpenCL on a cluster. Is there at least a way to check if cuda is working on it?

CUDA 4.0 is pretty old, but it did contain both CUDA SDK sample codes and OpenCL SDK sample codes. I would try to see if you can locate those sample codes. They should have Makefiles with them, so you should be able to compile them, although it may require root access. To do it without root access, you will need to see if you can copy the sample codes to your own user space, and fix up file/path references to compile from there.

If you can’t find the SDK anywhere, you can download the SDK from the NVIDIA archives:

https://developer.nvidia.com/cuda-toolkit-40

The SDK itself is not distro specific, so here is the direct download link:

http://developer.download.nvidia.com/compute/cuda/4_0/sdk/gpucomputingsdk_4.0.17_linux.run

You can download that file to your own user space, make it executable (chmod +x …) and run it.

After the install operation is complete, you will have to make the samples. Just change into the directory that they were unzipped to and type make (or even better, make -k).

After that you can browse around and try running some of the samples (both CUDA and OpenCL) to see if things are working. I would start with the deviceQuery sample.

Before doing any of the above, you might try running:

nvidia-smi -a

If you get any sort of error codes, you’re not going to have any luck on that cluster until the driver issue is sorted out, which might require root access. If the cluster was not set up properly, it may be necessary to run the nvidia-smi command as root, before the GPUs become “available” for ordinary users. The details of this are covered in a variety of places, including the driver release notes/FAQ.

I believe the reason for the error is because your distribution is missing a file that indicates where the OpenCL library lives… see:

https://github.com/MrMEEE/ironhide/issues/128
http://stackoverflow.com/questions/4959621/error-1001-in-clgetplatformids-call
https://bugs.launchpad.net/darktable/+bug/1039684
http://stackoverflow.com/questions/10776230/getplatformids-returns-1001-even-though-nvidia-icd-exists-and-contains-libcu

The fact that none of the solutions are dealing with a cluster is irrelevant in this case… The solution should be to point to the right OpenCL library.

OK,

I couldn’t locate the samples files, but the cuda version was updated a few days ago. However, I was still experiencing the same bug!

I tried to see what I could do with the subjects dealing with the same error, but in fact, it appears to me that the nvidia drivers were not running or not installed properly in spite of the fact that I was using a node with 7 NVidia Tesla which was supposed to be tested.

So, I asked to reinstall the drivers and apparently it fixed the issue.

Thank you txbob and vacaloca for your answers!