I have developed a CUDA Fortran code and I’m now trying to use it with very large datasets.
The code has been working swimmingly up until now but as I need to test it with large datasets it is necessary for me to have integer(8) variables.
To do this I tried compiling with -i8 but my CUDA Fortran module doesn’t like this and fails to compile with the error:
PGF90-F-0000-Internal compiler error. Unexpected runtime function call 0 (scf.f: 3317)
PGF90/x86-64 Linux 12.10-0: compilation aborted
The module
scf.F
which is failing to compile consists of host code followed by several global routines. The line number it is failing on
3317
is the end of the very first global (device) subroutine. The routine uses only supported intrinsics and variables that have been explicitly transferred to the device.
This is definitely a problem with -i8 as it compiles perfectly without it.
Do I need to modifiy anything in my CUDA code to be able to use -i8?
The only time I’ve seen this error is when an automatic was in the device code and the compiler was trying to put in an implicit allocate/deallocate. While I doubt that’s the case here, there’s probably some other compiler run time call being generated. Other than the allocate issue, I don’t see any technical problem reports for this, so I don’t know what the routine is and will need to ask for a reproducer in order to track it down.
My best guess is that by promoting the default integer kind to 8, one of the intrinsics you’re using (maybe atomicadd?) doesn’t have a GPU version available. It’s possible that later versions have this corrected. Are you able to try 13.6?
Another possible solution is to not use “-i8” and instead use INTEGER(8) explicitly where needed.
I managed to get access to 13.6 and compiling with it gives me the same error although a slightly better description of it:
PGF90-F-0155-Compiler failed to translate accelerator region > (see -Minfo messages)> : Unexpected runtime function call (scf.f: 1)
PGF90/x86-64 Linux 13.6-0: compilation aborted
So as suggested I compiled with -Minfo but I’m still none the wiser. Here is the output for the problem routine:
data_trans:
241, Memory copy idiom, loop replaced by call to __c_mcopy8
242, Memory copy idiom, loop replaced by call to __c_mcopy8
247, Memory copy idiom, loop replaced by call to __c_mcopy8
248, Memory copy idiom, loop replaced by call to __c_mcopy8
249, Memory copy idiom, loop replaced by call to __c_mcopy8
250, Memory copy idiom, loop replaced by call to __c_mcopy8
251, Memory copy idiom, loop replaced by call to __c_mcopy8
252, Memory copy idiom, loop replaced by call to __c_mcopy8
253, Memory copy idiom, loop replaced by call to __c_mcopy8
254, Memory copy idiom, loop replaced by call to __c_mcopy8
255, Memory copy idiom, loop replaced by call to __c_mcopy8
256, Memory copy idiom, loop replaced by call to __c_mcopy8
257, Memory copy idiom, loop replaced by call to __c_mcopy8
258, Memory copy idiom, loop replaced by call to __c_mcopy8
259, Memory copy idiom, loop replaced by call to __c_mcopy8
260, Memory copy idiom, loop replaced by call to __c_mcopy8
261, Memory copy idiom, loop replaced by call to __c_mcopy8
262, Memory copy idiom, loop replaced by call to __c_mcopy8
263, Memory copy idiom, loop replaced by call to __c_mcopy8
264, Memory copy idiom, loop replaced by call to __c_mcopy8
265, Memory copy idiom, loop replaced by call to __c_mcopy8
266, Memory copy idiom, loop replaced by call to __c_mcopy8
267, Memory copy idiom, loop replaced by call to __c_mcopy8
268, Memory copy idiom, loop replaced by call to __c_mcopy8
269, Memory copy idiom, loop replaced by call to __c_mcopy8
270, Memory copy idiom, loop replaced by call to __c_mcopy8
271, Memory copy idiom, loop replaced by call to __c_mcopy8
272, Memory copy idiom, loop replaced by call to __c_mcopy8
273, Memory copy idiom, loop replaced by call to __c_mcopy8
274, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
275, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
276, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
277, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
278, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
279, Memory copy idiom, loop replaced by call to __c_mcopy8
280, Memory copy idiom, loop replaced by call to __c_mcopy8
281, Memory copy idiom, loop replaced by call to __c_mcopy8
282, Memory copy idiom, loop replaced by call to __c_mcopy8
283, Memory copy idiom, loop replaced by call to __c_mcopy8
284, Memory copy idiom, loop replaced by call to __c_mcopy8
285, Memory copy idiom, loop replaced by call to __c_mcopy8
286, Memory copy idiom, loop replaced by call to __c_mcopy8
287, Memory copy idiom, loop replaced by call to __c_mcopy8
288, Memory copy idiom, loop replaced by call to __c_mcopy8
289, Memory copy idiom, loop replaced by call to __c_mcopy8
290, Memory copy idiom, loop replaced by call to __c_mcopy8
291, Memory copy idiom, loop replaced by call to __c_mcopy8
292, Memory copy idiom, loop replaced by call to __c_mcopy8
293, Memory copy idiom, loop replaced by call to __c_mcopy8
294, Memory copy idiom, loop replaced by call to __c_mcopy8
295, Memory copy idiom, loop replaced by call to __c_mcopy8
296, Memory copy idiom, loop replaced by call to __c_mcopy8
297, Memory copy idiom, loop replaced by call to __c_mcopy8
298, Memory copy idiom, loop replaced by call to __c_mcopy8
299, Memory copy idiom, loop replaced by call to __c_mcopy8
300, Memory copy idiom, loop replaced by call to __c_mcopy8
301, Memory copy idiom, loop replaced by call to __c_mcopy8
302, Memory copy idiom, loop replaced by call to __c_mcopy8
303, Memory copy idiom, loop replaced by call to __c_mcopy8
304, Memory copy idiom, loop replaced by call to __c_mcopy8
305, Memory copy idiom, loop replaced by call to __c_mcopy8
306, Memory copy idiom, loop replaced by call to __c_mcopy8
307, Memory copy idiom, loop replaced by call to __c_mcopy8
308, Memory copy idiom, loop replaced by call to __c_mcopy8
313, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
314, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
315, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
316, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
317, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
318, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
319, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
320, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
321, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
322, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
323, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
iter_cuda:
830, Loop not fused: function call before adjacent loop
Generated vector sse code for the loop
832, Loop unrolled 3 times (completely unrolled)
890, Loop not fused: function call before adjacent loop
Loop not vectorized: may not be beneficial
Generated 2 alternate versions of the loop
Unrolled inner loop 4 times
907, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
908, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
909, Memory zero idiom, loop replaced by call to __c_mzero8
910, Memory zero idiom, loop replaced by call to __c_mzero8
911, Memory zero idiom, loop replaced by call to __c_mzero8
912, Memory zero idiom, loop replaced by call to __c_mzero8
918, maxval reduction inlined
Loop not fused: function call before adjacent loop
Unrolled inner loop 4 times
Generated a prefetch instruction for the loop
924, Memory zero idiom, loop replaced by call to __c_mzero8
925, Memory copy idiom, loop replaced by call to __c_mcopy8
927, maxval reduction inlined
Loop not fused: function call before adjacent loop
Unrolled inner loop 4 times
Generated a prefetch instruction for the loop
958, Loop not fused: function call before adjacent loop
959, Loop unrolled 4 times
978, Memory copy idiom, loop replaced by call to __c_mcopy8
1094, Memory copy idiom, loop replaced by call to __c_mcopy8
1098, Memory copy idiom, loop replaced by call to __c_mcopy8
1200, Loop not fused: function call before adjacent loop
Loop unrolled 8 times
1315, Memory zero idiom, loop replaced by call to __c_mzero8
1329, Memory zero idiom, array assignment replaced by call to pgf90_mzero8
1372, Loop not fused: different loop trip count
Loop unrolled 2 times
1379, Loop not fused: complex flow graph
1449, Memory zero idiom, loop replaced by call to __c_mzero8
1465, Loop not fused: complex flow graph
Generated vector sse code for the loop
Generated a prefetch instruction for the loop
1514, Loop not vectorized/parallelized: contains call
1628, Loop not fused: different loop trip count
Generated vector sse code for the loop
Generated a prefetch instruction for the loop
1632, Loop not fused: function call before adjacent loop
Generated vector sse code for the loop
Generated a prefetch instruction for the loop
1883, Loop not fused: no successor loop
1884, Unrolled inner loop 4 times
Generated 4 prefetch instructions for the loop
1896, Loop not fused: complex flow graph
1897, Generated 3 alternate versions of the loop
Generated vector sse code for the loop
Generated 2 prefetch instructions for the loop
1997, Loop not fused: no successor loop
1998, Generated 5 alternate versions of the loop
Generated vector sse code for the loop
Generated 2 prefetch instructions for the loop
2019, Loop not fused: function call before adjacent loop
2020, Generated 4 alternate versions of the loop
Generated vector sse code for the loop
Generated a prefetch instruction for the loop
2080, Loop not fused: function call before adjacent loop
Generated 3 alternate versions of the loop
Generated vector sse code for the loop
2263, Loop not fused: no successor loop
2268, Loop not fused: function call before adjacent loop
2310, Loop not fused: complex flow graph
Generated 4 alternate versions of the loop
Generated vector sse code for the loop
Generated 2 prefetch instructions for the loop
2322, Loop not fused: no successor loop
Generated 3 alternate versions of the loop
Generated vector sse code for the loop
Generated 3 prefetch instructions for the loop
2328, Loop not fused: complex flow graph
Generated 3 alternate versions of the loop
Generated vector sse code for the loop
Generated 2 prefetch instructions for the loop
2339, Loop not fused: no successor loop
Generated 3 alternate versions of the loop
Generated vector sse code for the loop
Generated 2 prefetch instructions for the loop
2344, Loop not fused: function call before adjacent loop
Generated 2 alternate versions of the loop
Generated vector sse code for the loop
Generated a prefetch instruction for the loop
host_cart_routine:
2462, Loop not fused: function call before adjacent loop
2483, Memory copy idiom, array assignment replaced by call to pgf90_mcopy8
2487, Memory zero idiom, loop replaced by call to __c_mzero8
2488, Memory zero idiom, loop replaced by call to __c_mzero8
2490, Loop not fused: function call before adjacent loop
Generated vector sse code for the loop
2504, Loop not fused: function call before adjacent loop
Generated vector sse code for the loop
2517, Loop not fused: different loop trip count
Generated vector sse code for the loop
2521, Loop not vectorized: data dependency
host_dener:
2671, Memory zero idiom, loop replaced by call to __c_mzero8
2857, Memory zero idiom, loop replaced by call to __c_mzero8
2883, Loop not vectorized/parallelized: contains call
Thanks for the example. I was able to reduce this down to the following test case:
% cat test.cuf
module foo
contains
attributes(global) subroutine bar ()
integer :: ii
double precision :: mone
ii = 3
mone=(-1.00)**ii
end subroutine bar
end module foo
% pgf90 -c test.cuf -i8
PGF90-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected runtime function call (test.cuf: 1)
Looks like our underlying pow function doesn’t like integer8. The work around is explicitly declare “ii” as integer4. I added a problem report (TPR#19462) to see if we can add a new GPU routine to handle this case.