OpenMP, OpenACC and acc_set_device_num

Hi!

I’m doing some test with openACC combined with openMP in order to use multiple GPU devices but i get some problem during the execution of the program

        #pragma omp parallel num_threads(2)
        {
                int th= omp_get_thread_num();
#if _OPENACC
                acc_set_device_num(th,acc_device_nvidia);
#endif
                fprintf(stdout,"THREAD(%d) - Launched thread.\n",th);
                fprintf(stdout,"THREAD(%d) - Device selected: %d\n",th,acc_get_device_num(acc_device_nvidia));

And the result is:

THREAD(0) -Launched thread.
THREAD(0) - Device selected: 0
THREAD(1) - Launched thread.
THREAD(1) - Device selected: 0

Seems to me that the ‘acc_set_device_num’ is not working, the program always is running the device from my ACC_DEVICE_NUM environment variable

Hi Neldan,

I’m not sure since your code seems to work fine for me:

THREAD(0) - Launched thread.
THREAD(1) - Launched thread.
THREAD(0) - Device selected: 0
THREAD(1) - Device selected: 1

What’s the output from the command “pgaccelinfo”? What compiler version are you using?

  • Mat

PGI Release 12.4-0



CUDA Driver Version: 5000
NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012

Device Number: 0
Device Name: GeForce GTX 580
Device Revision Number: 2.0
Global Memory Size: 1610285056
Number of Multiprocessors: 16
Number of Cores: 512
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1544 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 2004 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 1
Unified Addressing: Yes
Initialization time: 310428 microseconds
Current free memory: 1542184960
Upload time (4MB): 2385 microseconds ( 732 ms pinned)
Download time: 1530 microseconds ( 694 ms pinned)
Upload bandwidth: 1758 MB/sec (5729 MB/sec pinned)
Download bandwidth: 2741 MB/sec (6043 MB/sec pinned)

Device Number: 1
Device Name: Tesla C2075
Device Revision Number: 2.0
Global Memory Size: 5636554752
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1566 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Initialization time: 310428 microseconds
Current free memory: 5570056192
Upload time (4MB): 2278 microseconds ( 713 ms pinned)
Download time: 1428 microseconds ( 697 ms pinned)
Upload bandwidth: 1841 MB/sec (5882 MB/sec pinned)
Download bandwidth: 2937 MB/sec (6017 MB/sec pinned)

Device Number: 2
Device Name: Tesla C2075
Device Revision Number: 2.0
Global Memory Size: 5636554752
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1566 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Initialization time: 310428 microseconds
Current free memory: 5570027520
Upload time (4MB): 1860 microseconds ( 899 ms pinned)
Download time: 1323 microseconds (1040 ms pinned)
Upload bandwidth: 2255 MB/sec (4665 MB/sec pinned)
Download bandwidth: 3170 MB/sec (4032 MB/sec pinned)

Device Number: 3
Device Name: GeForce GTX 460
Device Revision Number: 2.1
Global Memory Size: 1073414144
Number of Multiprocessors: 7
Number of Cores: 224
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1350 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 1800 MHz
Memory Bus Width: 256 bits
L2 Cache Size: 524288 bytes
Max Threads Per SMP: 1536
Async Engines: 1
Unified Addressing: Yes
Initialization time: 310428 microseconds
Current free memory: 1039273984
Upload time (4MB): 1500 microseconds ( 722 ms pinned)
Download time: 1294 microseconds ( 695 ms pinned)
Upload bandwidth: 2796 MB/sec (5809 MB/sec pinned)
Download bandwidth: 3241 MB/sec (6034 MB/sec pinned)

any ideas about what I can be doing wrong?

any ideas about what I can be doing wrong?

I’m not sure. Can you post or send to PGI Customer Service (trs@pgroup.com) a complete example?

Thanks,
Mat

i just update to newest version of pgcc and now seems it works, but still i’m having a problem with the execution

During the execution the program print a “Invalid handle” error

my code is this:

        int sizeR = numRows1*numRows2;

        #pragma omp parallel num_threads(2) private(result)
        {
                int th= omp_get_thread_num();
#if _OPENACC
                acc_init(acc_device_nvidia);
                acc_set_device_num(th+1,acc_device_nvidia);
#endif
                fprintf(stdout,"THREAD(%d) - Launched thread.\n",th);
                fprintf(stdout,"THREAD(%d) - Selected device: %d\n",th,acc_get_device_num(acc_device_nvidia));
                int bI = th*(numRows1/2);
                int eI = numRows1/((!th)+1);
                fprintf(stdout,"THREAD(%d) - begin I: %d, end I: %d\n",th,bI,eI);
                int bR = th*(sizeR/2);
                int eR = (sizeR/((!th)+1));
                fprintf(stdout,"THREAD(%d) - size R: %d, begin R: %d, end R: %d\n",th,sizeR,bR,eR);
                result = &result[bR];

                #pragma acc kernels copyin(m1[0:numRows1*numColumns1],m2[0:numRows2*numColumns2]), copyout(result[0:eR-bR])
                {
                        int i = bI;
                        #pragma acc loop gang vector(256), independent
                        for (i=0;i<eI;i++)
                        {
                                int j;
                                #pragma acc loop gang vector(2) independent
                                for(j=0;j<numRows2;j++)
                                {
                                        real_t acum = 0;
                                        int k;
                                        for(k=0;k<numColumns1;k++) {
                                                acum += m1[i+k*numColumns1] * m2[j*numColumns2+k];
                                        }
                                        result[(i-bI)*numRows1+j] = acum;
                                }
                        }
                }
        }

I use a matriz size 5000x5000

and the output is this:

THREAD(0) - Launched thread.
THREAD(0) - Selected device: 1
THREAD(0) - begin I: 0, end I: 50
THREAD(0) - size R: 10000, begin R: 0, end R: 5000
THREAD(1) - Launched thread.
THREAD(1) - Selected device: 2
THREAD(1) - begin I: 50, end I: 100
THREAD(1) - size R: 10000, begin R: 5000, end R: 10000
call to cuLaunchKernel returned error 400: Invalid handle
call to cuMemFree returned error 700: Launch failed

Hi Neldan,

Unfortunately, all this tells me is that the kernel failed for some reason. To narrow down the issued, can you try running with a single OpenMP thread? Also, try removing the schedule clauses, i.e the gang and vector and let the compiler schedule the loop.

  • Mat

With a single openmp thread the kernel works fine

i have been doing some test using ‘fork’ instead of openMP, and works fine. So i think that the problem is on the kernel’s call from openMP

i have been doing some test using ‘fork’ instead of openMP, and works fine. So i think that the problem is on the kernel’s call from openMP

Ok. Can you you send a reproducible example to PGI Customer service (trs@pgroup.com) so we can determine the issue?

Thanks,
Mat

with the 13.3 release it not get the same error, but now get another

call to cuMemcpyDtoHAsync returned error 1: Invalid value
call to cuMemcpyHtoDAsync returned error 4: Deinitialized

Hi Neldan,

Try setting the environment variable “PGI_ACC_SYNCHRONOUS=1”.

There’s still a problem with coordinating asynchronous streams in a parallel context even if the “async” clause is not specifically used.

  • Mat

It’s work! thank you! xD