Using -i8 flag with CUDA Fortran code

Hi there,

I have developed a CUDA Fortran code and I’m now trying to use it with very large datasets.

The code has been working swimmingly up until now but as I need to test it with large datasets it is necessary for me to have integer(8) variables.

To do this I tried compiling with -i8 but my CUDA Fortran module doesn’t like this and fails to compile with the error:

PGF90-F-0000-Internal compiler error. Unexpected runtime function call       0 (scf.f: 3317)
PGF90/x86-64 Linux 12.10-0: compilation aborted

The module

scf.F

which is failing to compile consists of host code followed by several global routines. The line number it is failing on

3317

is the end of the very first global (device) subroutine. The routine uses only supported intrinsics and variables that have been explicitly transferred to the device.

This is definitely a problem with -i8 as it compiles perfectly without it.

Do I need to modifiy anything in my CUDA code to be able to use -i8?

Any help/suggestions more than welcome.

Cheers,
Crip_crop

Hi Crip_crop,

The only time I’ve seen this error is when an automatic was in the device code and the compiler was trying to put in an implicit allocate/deallocate. While I doubt that’s the case here, there’s probably some other compiler run time call being generated. Other than the allocate issue, I don’t see any technical problem reports for this, so I don’t know what the routine is and will need to ask for a reproducer in order to track it down.

My best guess is that by promoting the default integer kind to 8, one of the intrinsics you’re using (maybe atomicadd?) doesn’t have a GPU version available. It’s possible that later versions have this corrected. Are you able to try 13.6?

Another possible solution is to not use “-i8” and instead use INTEGER(8) explicitly where needed.

  • Mat

Hi Mat,

I managed to get access to 13.6 and compiling with it gives me the same error although a slightly better description of it:

PGF90-F-0155-Compiler failed to translate accelerator region > (see -Minfo messages)> : Unexpected runtime function call (scf.f: 1)
PGF90/x86-64 Linux 13.6-0: compilation aborted

So as suggested I compiled with -Minfo but I’m still none the wiser. Here is the output for the problem routine:

  • data_trans:
    241, Memory copy idiom, loop replaced by call to __c_mcopy8
    242, Memory copy idiom, loop replaced by call to __c_mcopy8
    247, Memory copy idiom, loop replaced by call to __c_mcopy8
    248, Memory copy idiom, loop replaced by call to __c_mcopy8
    249, Memory copy idiom, loop replaced by call to __c_mcopy8
    250, Memory copy idiom, loop replaced by call to __c_mcopy8
    251, Memory copy idiom, loop replaced by call to __c_mcopy8
    252, Memory copy idiom, loop replaced by call to __c_mcopy8
    253, Memory copy idiom, loop replaced by call to __c_mcopy8
    254, Memory copy idiom, loop replaced by call to __c_mcopy8
    255, Memory copy idiom, loop replaced by call to __c_mcopy8
    256, Memory copy idiom, loop replaced by call to __c_mcopy8
    257, Memory copy idiom, loop replaced by call to __c_mcopy8
    258, Memory copy idiom, loop replaced by call to __c_mcopy8
    259, Memory copy idiom, loop replaced by call to __c_mcopy8
    260, Memory copy idiom, loop replaced by call to __c_mcopy8
    261, Memory copy idiom, loop replaced by call to __c_mcopy8
    262, Memory copy idiom, loop replaced by call to __c_mcopy8
    263, Memory copy idiom, loop replaced by call to __c_mcopy8
    264, Memory copy idiom, loop replaced by call to __c_mcopy8
    265, Memory copy idiom, loop replaced by call to __c_mcopy8
    266, Memory copy idiom, loop replaced by call to __c_mcopy8
    267, Memory copy idiom, loop replaced by call to __c_mcopy8
    268, Memory copy idiom, loop replaced by call to __c_mcopy8
    269, Memory copy idiom, loop replaced by call to __c_mcopy8
    270, Memory copy idiom, loop replaced by call to __c_mcopy8
    271, Memory copy idiom, loop replaced by call to __c_mcopy8
    272, Memory copy idiom, loop replaced by call to __c_mcopy8
    273, Memory copy idiom, loop replaced by call to __c_mcopy8
    274, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    275, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    276, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    277, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    278, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    279, Memory copy idiom, loop replaced by call to __c_mcopy8
    280, Memory copy idiom, loop replaced by call to __c_mcopy8
    281, Memory copy idiom, loop replaced by call to __c_mcopy8
    282, Memory copy idiom, loop replaced by call to __c_mcopy8
    283, Memory copy idiom, loop replaced by call to __c_mcopy8
    284, Memory copy idiom, loop replaced by call to __c_mcopy8
    285, Memory copy idiom, loop replaced by call to __c_mcopy8
    286, Memory copy idiom, loop replaced by call to __c_mcopy8
    287, Memory copy idiom, loop replaced by call to __c_mcopy8
    288, Memory copy idiom, loop replaced by call to __c_mcopy8
    289, Memory copy idiom, loop replaced by call to __c_mcopy8
    290, Memory copy idiom, loop replaced by call to __c_mcopy8
    291, Memory copy idiom, loop replaced by call to __c_mcopy8
    292, Memory copy idiom, loop replaced by call to __c_mcopy8
    293, Memory copy idiom, loop replaced by call to __c_mcopy8
    294, Memory copy idiom, loop replaced by call to __c_mcopy8
    295, Memory copy idiom, loop replaced by call to __c_mcopy8
    296, Memory copy idiom, loop replaced by call to __c_mcopy8
    297, Memory copy idiom, loop replaced by call to __c_mcopy8
    298, Memory copy idiom, loop replaced by call to __c_mcopy8
    299, Memory copy idiom, loop replaced by call to __c_mcopy8
    300, Memory copy idiom, loop replaced by call to __c_mcopy8
    301, Memory copy idiom, loop replaced by call to __c_mcopy8
    302, Memory copy idiom, loop replaced by call to __c_mcopy8
    303, Memory copy idiom, loop replaced by call to __c_mcopy8
    304, Memory copy idiom, loop replaced by call to __c_mcopy8
    305, Memory copy idiom, loop replaced by call to __c_mcopy8
    306, Memory copy idiom, loop replaced by call to __c_mcopy8
    307, Memory copy idiom, loop replaced by call to __c_mcopy8
    308, Memory copy idiom, loop replaced by call to __c_mcopy8
    313, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    314, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    315, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    316, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    317, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    318, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    319, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    320, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    321, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    322, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    323, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    iter_cuda:
    830, Loop not fused: function call before adjacent loop
    Generated vector sse code for the loop
    832, Loop unrolled 3 times (completely unrolled)
    890, Loop not fused: function call before adjacent loop
    Loop not vectorized: may not be beneficial
    Generated 2 alternate versions of the loop
    Unrolled inner loop 4 times
    907, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
    908, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
    909, Memory zero idiom, loop replaced by call to __c_mzero8
    910, Memory zero idiom, loop replaced by call to __c_mzero8
    911, Memory zero idiom, loop replaced by call to __c_mzero8
    912, Memory zero idiom, loop replaced by call to __c_mzero8
    918, maxval reduction inlined
    Loop not fused: function call before adjacent loop
    Unrolled inner loop 4 times
    Generated a prefetch instruction for the loop
    924, Memory zero idiom, loop replaced by call to __c_mzero8
    925, Memory copy idiom, loop replaced by call to __c_mcopy8
    927, maxval reduction inlined
    Loop not fused: function call before adjacent loop
    Unrolled inner loop 4 times
    Generated a prefetch instruction for the loop
    958, Loop not fused: function call before adjacent loop
    959, Loop unrolled 4 times
    978, Memory copy idiom, loop replaced by call to __c_mcopy8
    1094, Memory copy idiom, loop replaced by call to __c_mcopy8
    1098, Memory copy idiom, loop replaced by call to __c_mcopy8
    1200, Loop not fused: function call before adjacent loop
    Loop unrolled 8 times
    1315, Memory zero idiom, loop replaced by call to __c_mzero8
    1329, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
    1372, Loop not fused: different loop trip count
    Loop unrolled 2 times
    1379, Loop not fused: complex flow graph
    1449, Memory zero idiom, loop replaced by call to __c_mzero8
    1465, Loop not fused: complex flow graph
    Generated vector sse code for the loop
    Generated a prefetch instruction for the loop
    1514, Loop not vectorized/parallelized: contains call
    1628, Loop not fused: different loop trip count
    Generated vector sse code for the loop
    Generated a prefetch instruction for the loop
    1632, Loop not fused: function call before adjacent loop
    Generated vector sse code for the loop
    Generated a prefetch instruction for the loop
    1883, Loop not fused: no successor loop
    1884, Unrolled inner loop 4 times
    Generated 4 prefetch instructions for the loop
    1896, Loop not fused: complex flow graph
    1897, Generated 3 alternate versions of the loop
    Generated vector sse code for the loop
    Generated 2 prefetch instructions for the loop
    1997, Loop not fused: no successor loop
    1998, Generated 5 alternate versions of the loop
    Generated vector sse code for the loop
    Generated 2 prefetch instructions for the loop
    2019, Loop not fused: function call before adjacent loop
    2020, Generated 4 alternate versions of the loop
    Generated vector sse code for the loop
    Generated a prefetch instruction for the loop
    2080, Loop not fused: function call before adjacent loop
    Generated 3 alternate versions of the loop
    Generated vector sse code for the loop
    2263, Loop not fused: no successor loop
    2268, Loop not fused: function call before adjacent loop
    2310, Loop not fused: complex flow graph
    Generated 4 alternate versions of the loop
    Generated vector sse code for the loop
    Generated 2 prefetch instructions for the loop
    2322, Loop not fused: no successor loop
    Generated 3 alternate versions of the loop
    Generated vector sse code for the loop
    Generated 3 prefetch instructions for the loop
    2328, Loop not fused: complex flow graph
    Generated 3 alternate versions of the loop
    Generated vector sse code for the loop
    Generated 2 prefetch instructions for the loop
    2339, Loop not fused: no successor loop
    Generated 3 alternate versions of the loop
    Generated vector sse code for the loop
    Generated 2 prefetch instructions for the loop
    2344, Loop not fused: function call before adjacent loop
    Generated 2 alternate versions of the loop
    Generated vector sse code for the loop
    Generated a prefetch instruction for the loop
    host_cart_routine:
    2462, Loop not fused: function call before adjacent loop
    2483, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
    2487, Memory zero idiom, loop replaced by call to __c_mzero8
    2488, Memory zero idiom, loop replaced by call to __c_mzero8
    2490, Loop not fused: function call before adjacent loop
    Generated vector sse code for the loop
    2504, Loop not fused: function call before adjacent loop
    Generated vector sse code for the loop
    2517, Loop not fused: different loop trip count
    Generated vector sse code for the loop
    2521, Loop not vectorized: data dependency
    host_dener:
    2671, Memory zero idiom, loop replaced by call to __c_mzero8
    2857, Memory zero idiom, loop replaced by call to __c_mzero8
    2883, Loop not vectorized/parallelized: contains call

I will send the code to tech support today.

Cheers,
Crip_crop[/b]

Hi Crip_crop

Thanks for the example. I was able to reduce this down to the following test case:

% cat test.cuf 

module foo

  contains

  attributes(global) subroutine bar ()

     integer :: ii
     double precision :: mone
     ii = 3
     mone=(-1.00)**ii

   end subroutine bar

end module foo
% pgf90 -c test.cuf -i8
PGF90-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected runtime function call (test.cuf: 1)

Looks like our underlying pow function doesn’t like integer8. The work around is explicitly declare “ii” as integer4. I added a problem report (TPR#19462) to see if we can add a new GPU routine to handle this case.

Best regards,
Mat

19462 - CUDA Fortran: Using “-i8” with pow gets “Unexpected runtime function call” from calling pow with an i8 argument.

thanks,
dave