Run-time error for multi-gpu programming with openmp (pgfort

Praveen_Kulkarni · May 15, 2013, 2:27pm

Hi All,
We are using pgfortran compiler version 12.9(64 bit) on windows system.

We are spawning 4 openmp threads and each openmp thread is supposed to run on its respective GPU. We are calling this omp parallel region over NSTEPS steps.

With larger NSTEPS value (NSTEPS =100000) the probability of program failing at run-time increases.

With smaller NSTEPS value (NSTEPS =100) the probability of program failing at run-time decreases and most of the times it successfully finishes execution.

The compiled exe keeps on giving different run-time errors as follows
0: DEV_MKDESC: copyin Memcpy FAILED:11(invalid argument)
0: DEV_MKDESC: allocate FAILED:30(unknown error)
0: DEV_MKDESC: allocate FAILED:30(unknown error)
0: ALLOCATE: 4000 bytes requested; status = 30(unknown error)
0: DEV_MKDESC: copyin Memcpy FAILED:11(invalid argument)

PFA the code,
We used following script for compile and execute.
set OMP_NUM_THREADS=4
pgfortran -Mcuda -mp test.CUF
test.exe

Thanks and regards,
Praveen

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!11
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!11  KERNEL
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!11

module m_Kernels
use cudafor
	real, device, allocatable :: VAR_1_D(:)
        
contains

ATTRIBUTES(GLOBAL) SUBROUTINE ProcessArray_Kernel_1(VAR_1_D)
        real , device :: VAR_1_D(:)
       
END SUBROUTINE ProcessArray_Kernel_1

end module m_Kernels

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!11
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!11  main
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!11

program main
        use omp_lib
        use cudafor
        use m_Kernels
        
        integer :: threadsPerBlock, numOfThreads,numOfBlocks, ierr, array_size, num_of_omp_threads, NSTEPS, omp_thread_id
        
        
        real , allocatable :: VAR_1(:)
        !$OMP THREADPRIVATE(VAR_1)
        
        ! openmp shared VAR_1iables
        num_of_omp_threads = 4
        NSTEPS = 100000
        array_size=1000
        
        
        ! setting number of threads
        CALL omp_set_num_threads(num_of_omp_threads)


        !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
        !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
        !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
        !running omp parallel region NSTEPS times
        DO NSTEP = 1, NSTEPS
		!omp parallel region that will be called NSTEPS times
		!$OMP Parallel Do &
		!$OMP SHARED( num_of_omp_threads, NSTEPS,array_size) &
		!$OMP PRIVATE( VAR_1_D, VAR_2_D, threadsPerBlock, numOfThreads,numOfBlocks) 
		DO omp_thread_id = 1, num_of_omp_threads
		
                   !setting device
                   ierr =  CUDASETDEVICE((omp_thread_id-1))
                   if ( ierr /= cudaSuccess ) then 
                   write (* ,*) cudaGetErrorString ( ierr )
                   else
                   write (* ,*) 'device was set to:', (omp_thread_id-1), 'nstep is:',nstep
                   end if 		    
                   
                   !allocating arrays
                   allocate(VAR_1_D(array_size))
                   allocate(VAR_1(array_size))
                   
                   !CPU to GPU copy
                   VAR_1=1.0
                   VAR_1_D = VAR_1
                   
                   !Kernel Call
                   threadsPerBlock = 512
                   numOfThreads = array_size 
                   numOfBlocks = CEILING(real(numOfThreads) / threadsPerBlock)
                   call ProcessArray_Kernel_1<<<numOfBlocks, threadsPerBlock>>>(VAR_1_D)
                   ierr=cudaThreadSynchronize()
                   if ( ierr /= cudaSuccess ) write (* ,*) cudaGetErrorString ( ierr )
                  
                   !deallocating arrays
                   deallocate(VAR_1_D)
                   deallocate(VAR_1)
                   
		END DO
		!$OMP END Parallel Do         
        END DO
        
end program

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!11
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!11
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!11

MatColgrove · May 15, 2013, 10:20pm

Hi Praveen,

I was able to recreate the errors here but unfortunately wasn’t able to determine the cause. I’ll need to pass this on to a compiler engineer to since what ever is wrong seems to occur in the run time libraries, either PGI or NVIDIA.

I filed this as TPR#19033.

Thanks,
Mat

Praveen_Kulkarni · May 16, 2013, 5:27am

Thanks Mat

Praveen_Kulkarni · May 16, 2013, 5:52am

Dear Mat,
Where can I check the bug you filed with id TPR#19033.
Sorry I am not aware where to check this.

MatColgrove · May 16, 2013, 5:12pm

Where can I check the bug you filed with id TPR#19033.

We don’t have a external view into our issue tracker (pgroup.com is on it’s own network that can’t access our internal network), so for updates, please either post on the UF or send a note to customer service. Customer service typically contacts end users once issues have been resolved and any fixed TPRs appear in the release notes.

Best Regards,
Mat

Praveen_Kulkarni · May 17, 2013, 6:34am

Dear Mat,
I have sent a note to customer service for follow-up.
Since this is bit critical for us, can you please let us know if there is a workaround or quick fix for this problem.
Keep us posted on this thread.

MatColgrove · May 17, 2013, 5:24pm

Hi Praveen,

I added a note and increased the priority. I’ll try and keep an eye on it for any updates. Note, I made a typo above. I logged this as TPR#19346, not 19033.

Mat

MatColgrove · May 17, 2013, 7:43pm

I’ve had some exchanges with engineering and we found a workaround. Put your kernel launch in a critical section.

!$OMP CRITICAL
                   call ProcessArray_Kernel_1<<<numOfBlocks, threadsPerBlock>>>(VAR_1_D)
!$OMP END CRITICAL
                   ierr=cudaThreadSynchronize()

Hope this helps,
Mat

Praveen_Kulkarni · May 20, 2013, 5:31am

Thanks Mat,
Workaround works. But as you know it slows down our performance. Will be waiting for the fix.

Praveen_Kulkarni · July 30, 2013, 1:42pm

Dear Mat,
Any updates on bug id TPR#19346

MatColgrove · August 1, 2013, 5:01pm

Hi Praveen,

Sorry for the delay. I sent a note to engineering to see where they are at on this. Last note I see in the TPR system was from June 20th where they sent Girish the following:

An update on the TPR from engineering

we made a couple of changes and got the code to run to completion.

First, we moved the declaration of VAR_1_D out of the module and into the main program. The user is passing it in, so we don’t know that it needs to be in a module.

It may be that we do some things at init time that are incompatible with multi-threaded, multi-device programming.

Second, we added ierr to the OMP PRIVATE list.

Let your customer know we are making progress.

I’ll let you know once I know more.

Mat

tull · January 25, 2014, 12:54am

Hello,

We thought we fixed this before, and had to reopen TPR 19346.
We believe the problem is now fixed in the 14.1 release.

thanks,
dave

Topic		Replies	Views
CUDA + OpenMP oddity - looks like a compiler bug. Legacy PGI Compilers	6	12217	April 12, 2010
Problem using CUDA Visual Profiler for CUDA Fortran Legacy PGI Compilers	1	5113	August 6, 2012
Problems on OpenMP and multi-GPU Legacy PGI Compilers	5	4556	August 15, 2012
OpenMP Multi-GPU, not getting speedup expected CUDA Programming and Performance	5	5905	July 15, 2011
optimization errors in cuda fortran Legacy PGI Compilers	2	3187	November 19, 2011
Problem declaring device types inside OpenMP parallel loop. Legacy PGI Compilers	4	8962	April 12, 2010
PGI ACC release 11.0: Multiple GPUs using openmp Legacy PGI Compilers	2	4111	January 4, 2011
OpenMP + CUDA Fortran issue Legacy PGI Compilers	1	3203	February 8, 2011
OpenMP compile error Legacy PGI Compilers	3	2760	December 30, 2010
Is it possible to use both OpenMP + CUDA in PGI fortran ? Legacy PGI Compilers	4	7738	December 18, 2010

Run-time error for multi-gpu programming with openmp (pgfort

An update on the TPR from engineering

Second, we added ierr to the OMP PRIVATE list.

Related topics