Compiler hanging with CUDA Fortran

I’m developing a CUDA Fortran version of a computationally expensive portion of a research code. Its a large MPI code which uses some external libraries.
I’ve been successful with the addition of the first kernel (compiling and running). However, after adding a second kernel, the compiler hangs on the file containing the cuda fortran code (kernels, declaration and allocation of device arrays, wrapper routines to call kernels). If I remove the code for the second kernel, the compiler completes and generates code as expected.
The code is arranged as follows

module CudaStuff

define device variables

kernel 1

kernel 2

end module

In the same file, but outside of the module are routines to allocate the device variables, call cuda init and wrappers to invoke the kernels.

Has anyone had similar issues with the compiler hanging, and is there a work around.
Thanks,
Karen

Hi Karen,

Can you please send an example of the code to PGI Customer Service (trs@pgroup.com)? Multiple kernels can be included in a single module so there is something specific about your code that’s causing the problem. Does it still occur if you remove kernel 1?

Thanks,
Mat

Thanks Matt,
The compiler does not hang when I have just Kernel2 (simplified version). I fixed a few minor things, will try with both Kernels again. Meanwhile, I’m checking with my research collaborator, the code developer, before sending the code to customer service.
Karen

Hi Mat,
I’ve created a toy code based on mat mul that exhibits the problem that I’m seeing. I’ve sent it to trs. The problem seems to be related to using blockdim%x as part of an array declaration for a parameter to the kernel. e.g. something like D below.

attributes(global) subroutine mmul_kernel2( A, B, C, D, N, M, L )
real :: A(N,M), B(M,L), C(N,L)
real :: D(blockdim%x,blockdim%x)
integer, value :: N, M, L

Is the use of blockdim or other predefined variables in this way illegal?

Karen

Hi Karen,

Is the use of blockdim or other predefined variables in this way illegal?

It is legal and shouldn’t cause the compiler to hang. I’ll ask Customer Support to send a report to engineering.

Though I’m wondering if this is really what you want? The code as written will write beyond the end of the D array when you have more then one block.

In 11.6 we added support for Dynamic Shared Memory. Is this what you’re looking for? A way to dynamically adjust Asub and Bsub to match your block dimension?

  • Mat

Hi Mat,
I know this code has some logical errors. I put it together quickly to provide a toy example that I could share, that exhibited the compiler bug I was seeing.
So no this is not the program I am trying to compile, it is just to reproduce the problem for the compiler folks.

Thanks for the information regarding 11.6. I’ll contact our system administrators regarding whether we can upgrade to a newer version. Dynamic allocation within a kernel might be helpful, but most of the arrays that I’m dimensioning this way will reside in Global and be passed in as parameters to a kernel, as they may be generated in one kernel and used in another.
Karen