Change L1 cache size in Fermi

Hi,

I am using PGI workstation on a Fermi (C2050) box. In Fermi, the default memory configuration is 16 KB L1/48 KB shared memory, but it is also possible to change the configuration to 48 KB L1/16KB shared mem. How can it be done using PGI Fortran?

Hi,

Is there a support of Fermi architecture for PGI Fortran 10.9 ?

Is there a support of Fermi architecture for PGI Fortran 10.9 ?

Yes. Fermi support was first added to 10.6.

  • Mat

There is a function cudaFuncSetCacheConfig()
for change of L1 cache size for Fermi architecture.

Will this function support for Cuda Fortran ?

Hi Faustus,

You should be able to access any CUDA C function from CUDA Fortran. For CUDA C functions without a built-in CUDA Fortran interface, such as cudaFuncSetCacheConfig, you simply need to write an explicit interface to the function before calling it.

  • Mat

Dredging up an old thread…

Mat, et al,

I’m trying to try and use cudaFuncSetCacheConfig to see if it affects my code at all or not (on my way to maybe trying cudaThreadSetCacheConfig). So, I first tried whipping up an interface:

module extracuda

interface
   integer function cudafuncsetcacheconfig(func, cacheconfig) bind(c,name='cudaFuncSetCacheConfig')
      use iso_c_binding

      character(len=*) :: func
      integer :: cacheconfig
   end function cudafuncsetcacheconfig
end interface

end module extracuda

And then in my code I did:

      use extracuda
...
      write (*,*) "Got here!"

      istat = cudaFuncSetCacheConfig('soradcuf',2)

      call soradcuf<<<dimGrid, dimBlock>>>(...args...)
      istat = cudaGetLastError()
      if (istat /= 0) then
         write (*,*) "Kernel Call failed: ", cudaGetErrorString(istat)
         stop
      end if

where I’m pretty sure cudaFuncCachePreferL1 = 2 in CUDA enum speak.

I linked everything up with the usual -lcudart and -lcuda and I get:

> ./runsorad-cudafor-flxy-lessconstants-funccache-DPvDPorig.exe 
 blocksize:           256
 Current Device:             0

Device:Tesla M2070, 1147.0 MHz clock, 4096.0 MB memory.

 Iteration:             1
 eps:    1.0000000E-06
 Current Device:             0
 Got here!
 Kernel Call failed: 
 invalid device function                                                                                                         
Warning: ieee_underflow is signaling
Warning: ieee_inexact is signaling
FORTRAN STOP

Now, if I comment out my cudaFuncSetCacheConfig call, the code works just fine, so I broke something with the interface (as per usual when I try to do C interfacing).

Any help from the gurus out there of what I might have screwed up?

Hi Matt,

In 11.4 we added an interface to cudaFuncSetCacheConfig in the cudafor module. You can now call the routine directly.

Though, the error suggests that cudaFuncSetCacheConfig is getting called but is using an unknown kernel. While I don’t know details, in looking at our wrapper function it seems we manipulate the Fortran string into a integer array which is then passed to the CUDA C cudaFuncSetCacheConfig function.

Can you try using the CUDA Fortran module’s interface and see if that works around the issue?

Thanks,
Mat

I’ll let you know soon (the Fermi system is getting 11.5 installed on it as soon as the sysadmins have the time)!

Thanks,
Matt

Okay, the compiler recognizes the call, which is good, but it crashes in varying ways trying to use it. First off, the only difference between the working code and the non-working code with the cudaFuncSetCacheConfig call is:

write (*,*) "Got here!"

istat = cudaFuncSetCacheConfig(soradcuf,cudaFuncCachePreferL1)

Following the cudaFuncSetCacheConfig call is my kernel call to soradcuf:

call soradcuf<<<dimGrid, dimBlock>>> (...args...)
istat = cudaGetLastError()
if (istat /= 0) then
   write (*,*) "Kernel Call failed: ", cudaGetErrorString(istat)
   stop
end if

When I compile that (with no errors) and run it I get:

> ./runsorad-cudafor-flxy-lessconstants-funccache-DPvDPorig.exe 
 blocksize:           256
 Current Device:             0

Device:Tesla M2070, 1147.0 MHz clock, 4096.0 MB memory.

 Iteration:             1
 eps:    1.0000000E-06
 Current Device:             0
 Got here!
Segmentation fault (core dumped)

Hmm. This cored out before even getting to the error write under the kernel call. Maybe it needs strings and integers? Let’s try:

istat = cudaFuncSetCacheConfig('soradcuf',2)

which leads to:

> ./runsorad-cudafor-flxy-lessconstants-funccache-DPvDPorig.exe
 blocksize:           256
 Current Device:             0

Device:Tesla M2070, 1147.0 MHz clock, 4096.0 MB memory.

 Iteration:             1
 eps:    1.0000000E-06
 Current Device:             0
 Got here!
 Kernel Call failed: 
 invalid device function
Warning: ieee_underflow is signaling
Warning: ieee_inexact is signaling
FORTRAN STOP

Well, that died out trying to call the function and got to the error write and stop.

Any hints?

Matt

Hi Matt,

I’m not sure what’s wrong. I haven’t used this function much but my little test cases seem fine. Instead of me trying to figure out how to recreate the error, can you send me your tests?

Thanks,
Mat

A tarball is whizzing its way to you!

Thanks for the help,
Matt

Hi Matt,

For the seg fault, you’re suppose to be able to pass in a function pointer and the CUDA runtime makes the association, but I don’t think we’re do this correctly. I add TPR#17878 and sent it to our engineers.

For the ‘invalid device function’ error, this is because the function name is actually ‘soradmod_soradcuf’, not ‘soradcuf’ since it’s in a module. It seems to be happy if I use:

istat = cudaFuncSetCacheConfig(‘soradmod_soradcuf’,cudaFuncCachePreferL1)

Can you give this a try and see if it works for you?

  • Mat