cudaSetDevice seems completely broken

I’ve been programming using CUDA C for several years now and I’ve just been given the task of porting a large program over to CUDA from fortran, so I decided to give CUDA Fortran a go, however I have pretty much instantly hit a rather large roadblock.

As far as I can tell cudaSetDevice becomes inoperable as soon as I link a module which uses device memory. As far as I can tell I’m not doing anything particularly strange - just linking a module which never even gets used. Really struggling to understand how anybody could have successfully done this.

test.f90:

program test
      use cudafor
      implicit none

      integer istat
      istat = cudaSetDevice(0)
      print *,cudaGetErrorString(iStat)
end program test

module.cuf:

module breaker
      integer, device :: itWillNotWork

contains
      attributes(global) subroutine EMPTY

      end subroutine EMPTY
end module breaker

And the output:

$ pgfortran -o test -Mcuda test.f90 module.cuf
$ ./test
 setting the device when a process is active is not allowed
$ pgfortran -o test -Mcuda test.f90
$ ./test
 no error

In a slightly unrelated thing I’m finding the memory management and scoping to be really quite restrictive. In CUDA C it is possible to allocate device memory at any point in the program and use it at any other point. It seems in Fortran I have to have all my CUDA memory and all my CUDA kernels contained within one module, which clashes badly with how the code is currently structured. I should point out it’s inherited and I’m not a native Fortran programmer, but the current structure seems sane… except completely incompatible with the how CUDA Fortran wants my modules to be organised.

Hi Jeremy,

Thank you for bring this problem to our attention. The problem is that in order to support data initialization of module device variables, allocation of all module device variables needs to occur at during the program’s initialization phase at start-up. As you have discovered, this will prevent a user from being able change the device. I have sent a report to our engineers (TPR#16895) and they agree that this is a serious problem. We will give it our highest priority.


In CUDA C it is possible to allocate device memory at any point in the program and use it at any other point.

I don’t believe this is entirely correct. It’s my understanding that CUDA C only allows you do this only within the same file scope. Granted, you can include all your source files into a single file. Fortran doesn’t have file scoping. The closest approximation to file scope is module scope.

Unfortunately, there isn’t a linker for device code. Hence, there is no way to associate external device symbols. This is why you can not call device function contained in different modules or directly access device data from other modules.

Device data managed by the host code can be passed to any CUDA Fortran or CUDA C global routine. Can you rearrange the code so that the shared device data be managed by the host?

  • Mat

Thanks. I think I must have got confused somewhere with respect to moving device data around. I’m not quite sure why I thought it worked like I thought it worked. I think I understand now though!

Another thing that is currently annoying me is that there seems to be an odd glitch when passing scaler variables by reference from a device subprogram with attributes global to one with attributes device. The variable declaration on the global side for some reason needs to be declared with a double colon otherwise you get a type mismatch error.

module test
contains
        attributes(device) subroutine dTest(a)
                real a !or real :: a
                a = a + 1
        end subroutine


        attributes(global) subroutine gTEST
                ! Comment as required
                !real :: a ! Works
                real    a ! Fails
                call dTEST(a)
        end subroutine
end module



$ pgfortran -c refdevpass.cuf
PGF90-S-0188-Argument number 1 to dtest: type mismatch (refdevpass.cuf: 12)
  0 inform,   0 warnings,   1 severes, 0 fatal for gtest

As previously mentioned I’m not as fluent in Fortran as I perhaps could be but I was under the impression that the double colon wasn’t necessary in such a case. I have some host code that compiles fine without the double colon.

Hi Jeremy,

Thanks, you found another one. Yes we should still be supporting the F77 syntax. I’ve sent a report to our engineers (TPR#16898). We’ve missed the cut-off to get fixes into the May release (10.5), but hopefully we can have it be fixed by 10.6 (June).

We definitely appreciate the feed-back so if you have any other things that don’t seem quite right, please let up know.

  • Mat

Just another small one - the -C compiler flag seems to cause:

PGF90-F-0000-Internal compiler error. unsupported procedure

When compiling a file containing CUDA kernels. Tested it on the matrix multiply example.

If I notice any more should I just tack them on to this thread or should I make new threads?

Hi Jeremy,

I added TPR#16922 for the bounds-checking ICE.

If I notice any more should I just tack them on to this thread or should I make new threads?

I would prefer a new post per issue to make it easier for other users to follow and find similar issues. However, if you prefer a single thread to make it easier for you to follow, then this is fine as well.

  • Mat

Hi,

We just purchased a machine with an NVIDIA Quadro NVS 295 and two Tesla C2050 cards. Does TPR#16895 mean that I can only use the Quadro GPU and that I cannot address the other two GPUs?

Claude Knaus
Computer Graphics Group
University of Bern

Hi cknaus

Does TPR#16895 mean that I can only use the Quadro GPU and that I cannot address the other two GPUs?

Unfortunately in CUDA Fortran you can currently only use the first GPU (note the PGI Accelerator Model can use all devices). This is high priority issue for our engineers and they are actively working on the problem. However, due to the complexity of the problem, we don’t expect a fix to be available until August or September of 2010.

With apologies,
Mat

Hi Mat,

Thanks for the prompt reply. Can you clarify “first GPU”: if I remove the Quadro and replace it with a non-CUDA capable graphics card, will I then be able to use one of the installed Tesla cards with CUDA Fortran?

Cheers,

Claude Knaus
Computer Graphics Group
University of Bern

This may not be true, otherwise my multi-GPU CUDA Fortran program and my unit test with device selection would be failing to choose devices in my Tesla S. I think you can “fix” this by issuing a cudaThreadExit() call.

Namely, in re the original program in this thread, if I run it, I get:

> pgfortran -o test -Mcuda test.f90 module.cuf
test.f90:
module.cuf:
> ./test
setting the device when a process is active is not allowed

Now we add a cudaThreadExit:
testwithexit.f90:

program test
      use cudafor
      implicit none

      integer istat
      
      ! Add this command
      istat = cudaThreadExit()
      print *,cudaGetErrorString(iStat)

      istat = cudaSetDevice(0)
      print *,cudaGetErrorString(iStat)
end program test

And:

> pgfortran -o testwithexit -Mcuda testwithexit.f90 module.cuf
testwithexit.f90:
module.cuf:
> ./testwithexit 
 no error

 no error

My guess is that the cudaThreadExit releases the context that occurs when…well…I don’t know. Maybe when PGI first does an initialization of the GPUs? I tried a whole bunch of stuff–most of which resulted in spectacular crashes–when I first encountered this.

Whatever, though, a single cudaThreadExit right before you do anything seems to clear up the lock and then you can device select to your heart’s content.

ETA: Hmm. I just realized that maybe this won’t work if the machine has an actual video card. The machine I work on is just a boring ol’ server with integrated graphics. Perhaps having a real video card does something…

FYI, I just verified that TPR#16895 will be fixed in the 10.8 release. This will allow users to again select at runtime the device to use.

  • Mat

Jeremy,

We closed this as of release 10.8.

regards,
dave

Jeremy,

TPR 16898 - CUDA: Type mismatch error when declaring device variables without “::”
has been corrected in the current 11.0 release.

thanks again for the report.

dave