How used my four gpu node

azzulrd · April 3, 2018, 5:50am

Hi there.

My cluster administrator told me that we have a cluster with one CPU and 4 GPU nodes (Tesla K20m),

When I get acc num devices, I got 2 devices available by node.

I have been reading about multi gpu programing, MPI+openacc, or selecting the device with openACC.

I have a question, How can I select one device from a GPU if I choose to set the device only in openACC - do not use MPI - ( I have 8 devices available in the whole cluster ).

Is there any specific configuration for the cluster, in order to get the 8 devices available instead of 2 by node ??

How should I exploit the 4 GPU nodes ?
I need some advice.

Thanks in advance for your help.

============== source code to get num devices=======

#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>

int main()
{

#ifdef _OPENACC
int mygpu, myrealgpu,num_devices;
acc_device_t my_device_type = acc_device_nvidia;

acc_set_device_type(my_device_type);
num_devices= acc_get_num_devices(my_device_type);
fprintf(stderr,“number of device available: %d \n”, num_devices);

#endif

}

===================
[azzulrd@jmercurio02 eicctest]$ ./a.out
number of device available: 2

(and on)

pgaccelinfo output:

CUDA Driver Version: 9000
NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.81 Sat Sep 2 02:43:11 PDT 2017

Device Number: 0
Device Name: Tesla K20m
Device Revision Number: 3.5
Global Memory Size: 4972937216
Number of Multiprocessors: 13
Number of SP Cores: 2496
Number of DP Cores: 832
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 705 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 2600 MHz
Memory Bus Width: 320 bits
L2 Cache Size: 1310720 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc35

Device Number: 1
Device Name: Tesla K20m
Device Revision Number: 3.5
Global Memory Size: 4972937216
Number of Multiprocessors: 13
Number of SP Cores: 2496
Number of DP Cores: 832
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 705 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 2600 MHz
Memory Bus Width: 320 bits
L2 Cache Size: 1310720 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc35

\

pgcc --version

pgcc 17.10-0 64-bit target on x86-64 Linux -tp sandybridge
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.

===========

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

MatColgrove · April 3, 2018, 3:24pm

Hi azzulrd,

I have been reading about multi gpu programing, MPI+openacc, or selecting the device with openACC.

This is the recommended method (and easiest) for multi gpu programming. There are several online tutorials and talks if you need guidance on how to do this.

For example: http://on-demand.gputechconf.com/gtc/2017/presentation/S7546-jeff-larkin-multi-gpu-programming-with-openacc.pdf

Class #2: https://developer.nvidia.com/openacc-advanced-course

Alternatively, you can use OpenMP to create multiple CPU threads and then assign devices to each thread.

I have a question, How can I select one device from a GPU if I choose to set the device only in openACC - do not use MPI - ( I have 8 devices available in the whole cluster ).

You would call the OpenACC API’s “acc_set_device_num” routine.

Note you would want to use “acc_set_device_num” in MPI as well so that you can have different ranks use different devices.

Alternatively, you can use the environment variables “ACC_DEVICE_NUM” or “CUDA_VISIBLE_DEVICES” to set the device each rank uses.

Is there any specific configuration for the cluster, in order to get the 8 devices available instead of 2 by node ??

Each cluster management tools will have different ways of partitioning nodes so you’ll need to ask your system admin.

How should I exploit the 4 GPU nodes ?

I highly recommend you use MPI+OpenACC. It’s easiest way to program multi-gpus and allows your code to scale across multiple nodes.

-Mat

azzulrd · April 4, 2018, 5:30am

Hi Mat,

Thanks for answer quickly.

I have my openAcc version executable file.

buy I am little confuse,
How should I run and where, in my cpu node or GPU node.

if a run un a CPU I got this error :

[azzulrd@mercurio eicctest]$ pgcc -ta=tesla:cc35,time -acc -Minfo=accel -fast -Msafeptr -o dos dos.c -c99

[azzulrd@mercurio eicctest]$ ./dos
Current file: /home/azzulrd/eicctest/dos.c
function: ccoordinate
line: 172
Current region was compiled for:
NVIDIA Tesla GPU sm30 sm35
Available accelerators:
device[1]: Native X86
The accelerator does not match the profile for which this program was compiled

Also when I use acc_set_device_num (CPU node), I got 0.

number of device available: 0

That is the reason for the question about the cluster configuration.

Well I have to say I am almost new in openAcc.

Thanks in advance for your help.

MatColgrove · April 4, 2018, 3:16pm

In this case, you’re trying to run a binary built to target a cc35 device on a CPU. Though this wont work since a GPU binary can’t run on a CPU.

What you can do, is build a unified binary targeting a CPU and GPU by compiling “-ta=multicore,tesla:cc35”. When run on a system with a cc35 device, the code will run on the GPU. When run on a system without a GPU, the code will run in parallel across the systems CPUs.

You can also use “-ta=host,tesla:cc35”, in which case the CPU version will run sequentially.

Note that the “time” option is no longer needed. Instead set the environment variable “PGI_ACC_TIME=1” when you want the simple profile.

Hope this helps,
Mat

azzulrd · April 5, 2018, 7:20am

Hi Mat,

Thanks for your fast answer .

I followed your recommendations.

What you can do, is build a unified binary targeting a CPU and GPU by compiling “-ta=multicore,tesla:cc35”. When run on a system with a cc35 device, the code will run on the GPU. When run on a system without a GPU, the code will run in parallel across the systems CPUs, and PGI_ACC_TIME=1"

Finally I could run my code in a GPU. It was only in a device. Also ran my could in a CPU.

On the other hand

Maybe is a misunderstanding of mine. I want to see 8 devices available when I run acc_get_num_devices.

Is that possible?

For example I want to assign a loop to a specific device ( 3 ,4,7,6) how can I do that?

Until Now, I just connect to the GPU node and run.
I guess that If a run in CPU it should send data to GPU. How this work?

I Known that I am doing or understanding something wrong, but I cannot see what is it.

Thank you.

MatColgrove · April 5, 2018, 3:50pm

Maybe is a misunderstanding of mine. I want to see 8 devices available when I run acc_get_num_devices.
Is that possible?

Yes, assuming the devices are visible. It sounds like either you’re mistaken about the number of devices on the system, or your cluster management tool is partitioning the system so that you only have 1 or 2 devices available. Again, you’ll need to talk with your system admin to see which is true. Running “pgaccelinfo” or “nvidia-smi” will show you how many devices are available.

For example I want to assign a loop to a specific device ( 3 ,4,7,6) how can I do that?

Same answer as above. Use “acc_set_device_num” within the program to set the device, or use the environment variable “ACC_DEVICE_NUM” per MPI process to assign different ranks to different devices.

I guess that If a run in CPU it should send data to GPU. How this work?

With OpenACC data directives.

I Known that I am doing or understanding something wrong, but I cannot see what is it.

I’m thinking that you would benefit from reviewing some of the OpenACC tutorials and/or online classes. Data management is a bit too big of a topic to cover in a UF post. After watching the training, if you have a specific question, I’ll be happy to help.

See: https://www.openacc.org/resources

-Mat

azzulrd · April 21, 2018, 9:46pm

Hi Mat,

Finally I found the problem, it is that there is not a GPU cluster, just four GPU working independently and connected in a strange
way.

I really appreciate your help.

Topic		Replies	Views
How to specificy which GPUs to run on Legacy PGI Compilers	5	7462	December 8, 2010
OpenMP, OpenACC and acc_set_device_num Legacy PGI Compilers	12	10776	March 15, 2013
OpenACC: Best way to parallelize nested DO loops (continued) nvc, nvc++ and nvfortran	22	1598	March 28, 2023
Using multiple GPUs Legacy PGI Compilers	7	22076	August 11, 2009
Unified binary for accelerators, serial? Legacy PGI Compilers	7	8353	November 6, 2013
accelerate a single loop with mpi and gpu Legacy PGI Compilers	21	15857	July 19, 2013
Parallelize across CPU and GPU cores simultaneously Legacy PGI Compilers	3	5220	January 6, 2016
Multi-GPU MPI launch failing when UVM enabled Legacy PGI Compilers	5	3770	January 2, 2019
Dealing with allocatable arrays with OpenACC Legacy PGI Compilers	8	1848	November 30, 2020
OpenMP + OpenACC problem Legacy PGI Compilers	9	5262	April 17, 2019

How used my four gpu node

(and on)

pgaccelinfo output:

Related topics