Hi there.
My cluster administrator told me that we have a cluster with one CPU and 4 GPU nodes (Tesla K20m),
When I get acc num devices, I got 2 devices available by node.
I have been reading about multi gpu programing, MPI+openacc, or selecting the device with openACC.
I have a question, How can I select one device from a GPU if I choose to set the device only in openACC - do not use MPI - ( I have 8 devices available in the whole cluster ).
Is there any specific configuration for the cluster, in order to get the 8 devices available instead of 2 by node ??
How should I exploit the 4 GPU nodes ?
I need some advice.
Thanks in advance for your help.
============== source code to get num devices=======
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
int main()
{
#ifdef _OPENACC
int mygpu, myrealgpu,num_devices;
acc_device_t my_device_type = acc_device_nvidia;
acc_set_device_type(my_device_type);
num_devices= acc_get_num_devices(my_device_type);
fprintf(stderr,“number of device available: %d \n”, num_devices);
}
===================
[azzulrd@jmercurio02 eicctest]$ ./a.out
number of device available: 2
(and on)
pgaccelinfo output:
CUDA Driver Version: 9000
NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.81 Sat Sep 2 02:43:11 PDT 2017
Device Number: 0
Device Name: Tesla K20m
Device Revision Number: 3.5
Global Memory Size: 4972937216
Number of Multiprocessors: 13
Number of SP Cores: 2496
Number of DP Cores: 832
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 705 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 2600 MHz
Memory Bus Width: 320 bits
L2 Cache Size: 1310720 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc35
Device Number: 1
Device Name: Tesla K20m
Device Revision Number: 3.5
Global Memory Size: 4972937216
Number of Multiprocessors: 13
Number of SP Cores: 2496
Number of DP Cores: 832
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 705 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 2600 MHz
Memory Bus Width: 320 bits
L2 Cache Size: 1310720 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc35
\
pgcc --version
pgcc 17.10-0 64-bit target on x86-64 Linux -tp sandybridge
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
===========
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17