How to make OpenCL works on cluster with Nvidia Tesla?

antoinexp · March 1, 2014, 8:58pm

Hi all!

I have been using OpenCL on my personal computer for a while, but I was given an access to a cluster of GPU for a student project.
According to the owner, it is based on Centos with 7 nodes with 7 NVidia Tesla, and Cuda 4.0 installed. This version is probably outdated, however, the cluster doesn’t seem to be used frequently.
Anyway, I don’t have a root access, so I am currently attempting to build some programs using OpenCL with it.

I am not used to using a cluster at all, so I took a look over the internet, and a so called “VirtualCL” may well be a solution to make them run.
But still, before attempting to use it, I would be more likely to think that a basic program is supposed to work on a node, am I right?
Thus I tried to log in, I found libOpenCL.so in /usr/lib64/nvidia, CL.h in /usr/include/cuda, lspci told me :

0c:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
11:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
12:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
83:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
84:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
87:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)
88:00.0 3D controller: NVIDIA Corporation Tesla M2090 (rev a1)

Then I managed to build a simple program, and there was hitherto nothing unexpected until I tried to launch it: the first call to an OpenCL function failed, “clGetPlatformIDs” returned -1001.
There are many subjects on that error on the internet, but none of them seem to be dealing with a cluster, and according to me, it seems rather odd it would stem from the setup of Cuda since someone may have successfully been using it before me.

In fact, I am currently thinking I am misunderstanding the usage of OpenCL on a cluster. Is there at least a way to check if cuda is working on it?

Robert_Crovella · March 1, 2014, 11:17pm

CUDA 4.0 is pretty old, but it did contain both CUDA SDK sample codes and OpenCL SDK sample codes. I would try to see if you can locate those sample codes. They should have Makefiles with them, so you should be able to compile them, although it may require root access. To do it without root access, you will need to see if you can copy the sample codes to your own user space, and fix up file/path references to compile from there.

If you can’t find the SDK anywhere, you can download the SDK from the NVIDIA archives:

[url]https://developer.nvidia.com/cuda-toolkit-40[/url]

The SDK itself is not distro specific, so here is the direct download link:

[url]http://developer.download.nvidia.com/compute/cuda/4_0/sdk/gpucomputingsdk_4.0.17_linux.run[/url]

You can download that file to your own user space, make it executable (chmod +x …) and run it.

After the install operation is complete, you will have to make the samples. Just change into the directory that they were unzipped to and type make (or even better, make -k).

After that you can browse around and try running some of the samples (both CUDA and OpenCL) to see if things are working. I would start with the deviceQuery sample.

Before doing any of the above, you might try running:

nvidia-smi -a

If you get any sort of error codes, you’re not going to have any luck on that cluster until the driver issue is sorted out, which might require root access. If the cluster was not set up properly, it may be necessary to run the nvidia-smi command as root, before the GPUs become “available” for ordinary users. The details of this are covered in a variety of places, including the driver release notes/FAQ.

vacaloca · March 2, 2014, 5:36pm

I believe the reason for the error is because your distribution is missing a file that indicates where the OpenCL library lives… see:

[url]https://github.com/MrMEEE/ironhide/issues/128[/url]
[url]opencl - Error -1001 in clGetPlatformIDs Call ! - Stack Overflow
[url]https://bugs.launchpad.net/darktable/+bug/1039684[/url]
[url]c++ - getPlatformIDs() returns -1001 even though nvidia.icd exists and contains 'libcuda.so' - Stack Overflow

The fact that none of the solutions are dealing with a cluster is irrelevant in this case… The solution should be to point to the right OpenCL library.

antoinexp · March 23, 2014, 3:57pm

OK,

I couldn’t locate the samples files, but the cuda version was updated a few days ago. However, I was still experiencing the same bug!

I tried to see what I could do with the subjects dealing with the same error, but in fact, it appears to me that the nvidia drivers were not running or not installed properly in spite of the fact that I was using a node with 7 NVidia Tesla which was supposed to be tested.

So, I asked to reinstall the drivers and apparently it fixed the issue.

Thank you txbob and vacaloca for your answers!

Topic		Replies	Views
Problem running OpenCL programs CUDA Programming and Performance	9	16233	April 6, 2011
Unable to use OpenCL/Cuda on Ubuntu 18.04 Linux	3	14311	November 25, 2018
OpenCL clGetPlatformIDs. Error code: -1001 CUDA 7.5 CUDA Setup and Installation	6	4062	April 17, 2016
Ubuntu 16.10 CUDA Toolkit install with OpenCL CUDA Setup and Installation	7	19802	April 26, 2018
OpenCL not working with the latest version 3.2 CUDA Programming and Performance	7	13953	December 6, 2010
NVIDIA OpenCL SDK deployment so 90ies CUDA Setup and Installation	1	719	November 5, 2016
Can we run NVidai OpenCL samples on CPU (AMD/Intel) Are the Nvidia opencl samples runs on CPU? CUDA Programming and Performance	10	5976	May 10, 2012
Not yet a forum for OpenCL (so I am putting here this thread, even so related with CUDA) CUDA Programming and Performance	3	572	January 8, 2018
Critical Error in GPU Linux	20	1393	November 6, 2023
Problem with OpenCL with CUDA7.5 Ubuntu14.04 CUDA Setup and Installation	2	2054	June 8, 2016

How to make OpenCL works on cluster with Nvidia Tesla?

Related topics