Used of shared memory in device function


I’m trying to use shared memory in a device function but I get the message:

SHARED attribute ignored on dummy argument totals_shared

and then:
Error 3 Internal compiler error. Unexpected runtime function call 0

The shared memory is declared properly in the kernel (size known at compile time) and passed as reference to the function, I have tried all different ways to declare the shared memory, but I don’t seem to be able to get around this problem.

 attributes(global) subroutine multiScanTestBox()
    integer,volatile, shared		:: reduction_shared(3256)
    integer,volatile, shared		:: totals_shared(48)
 totalsSum = Multiscan Multiscan(tid, x, total, reduction_shared, totals_shared)


!The function multiscan goes:

attributes(device) integer function Multiscan(tid, x, total, reduction_shared, totals_shared)
  integer,volatile, shared	:: reduction_shared(3256)!(SCANSIZE)!(*)!(:)!(3256)

!...etc etc

Hi David,

Dummy arguments can’t have the shared attribute, so you’ll need to remove it from the declarations in the Multigrid function. You’ll still be using shared memory since you pass in the addresses of the shared arrays so the code should function as expected.

Hope this helps,

Brilliant! Thanks :-)

By the way, do you know what the
“: error F0000 : Internal compiler error. Unexpected runtime function call 0”

really means?, I get that from time to time…

Hi David,

In this case, the message is coming from the NVIDIA back end compiler. Most likely we’re presenting it with some intermediate code it doesn’t like. Though, I’d need a reproducing example to understand what “function call 0” means.

  • Mat

Well this si the function giving all the trouble… I guess is still the same issue. I’m programming Sean Baxter’s scan and radix sort subroutines in CUDA Fortran, I guess is still the problem with shared memory declaration?

type integer2
  integer :: x
  integer :: y
end type

 attributes(device) type(integer2) function Multiscan(tid, x, reduction_shared, totals_shared)
   integer :: tid
   integer :: x
   integer :: warp, lane, i, sum, offset
   integer :: total, totalsSum
   type(integer2) :: result

   integer,volatile, dimension(:)	:: reduction_shared!(SCANSIZE)!(*)!(:)!(3256)!(ScanSize)
   integer,volatile, dimension(:)		:: totals_shared!((*)! (48)!(NUM_WARPS + NUM_WARPS/2)

   integer, volatile :: s, s2 !we have a problem here in the translation of
   warp = tid / WARP_SIZE ! check this one for fortran charac.
   lane = IAND((WARP_SIZE - 1), tid) + 1 !in fortran so we are starting in 1; in c: (WARP_SIZE - 1) & tid 
   s = SCANSTRIDE * warp + lane + WARP_SIZE / 2 !index/pointer 
   reduction_shared(s - 16) = 0 !The first 32 position will be filled with zeros
   reduction_shared(s) = x		!And now only the first 16 will...

   !! Run inclusive scan on each warp's data.
    sum = x
	!CUDA Fortran compiler is suppoused to unroll the loop for us...
	do i = 1, LOG_WARP_SIZE
		offset = ISHFT(1, i-1)!1 << (i - 1)
		sum = sum + reduction_shared(s-offset)
		reduction_shared(s) = 0
	end do

	!! Synchronize to make all totals available to the reduction code
	call syncthreads()

	if(tid < NUM_WARPS)then
		!! Grab the block total for the tid'th block. This is the last element
		!! in the block's scanned sequence. This operation avoids bank
		!! conflicts.
		total = reduction_shared(ScanStride* tid + WARP_SIZE/2 + WARP_SIZE ) !- 1) !this -1 may be eliminated
		totals_shared(tid) = 0
		s2 = NUM_WARPS / 2 + tid
		totalsSum = total
		totals_shared(s2) = total

		!! Compiler shoud unroll this one
		do i = 1, LOG_NUM_WARPS
			offset = ISHFT(1, i-1)!1 << (i - 1)
			totalsSum = totalsSum + totals_shared(s2-offset)
			totals_shared(s2) = totalsSum
		end do

		!! Subtract total from totalsSum for an exclusive scan.
		totals_shared(tid) = totalsSum - total
	end if

	!! Synchronize to make the block scan available to all warps
	call syncthreads()
	sum = sum + totals_shared(warp)
	total = totals_shared(NUM_WARPS + NUM_WARPS / 2) !)- 1) !el - 1
	result%x = sum
	result%y = total
	!!!!!!!!!!!!!!!!!!!!! and return...
	Multiscan = result

 end function Multiscan

Hi David,

I can’t really tell much from this. Can you send a reproducing example to PGI Customer Service ( and ask them to send it to me?

Also, which error are you getting with this code? The function 0 ICE or the Shared dummy as an argument?


it is the 0 ICE problem.
Th shared memory dummy was solved with your tip.

I’ll try to send the full code laetr today.


Ok the problem seem to be using the ISHFT bit intrinsic, any reason for that?
According to the CUDA Fortran reference is a perfectly valid call…
integer ishft(integer, integer)…

It is relatevely easy to sustitue as I use to calculate multiples of 2 … but bit intrinsics are fast…I would like to use them…

Yep, it’s the ISHFT. CUDA Fortran does support ISHFT, but currently only if the “shift” argument is a constant. In this case, ISHFT is inlined but when it’s a variable, a call is emitted.

I asked engineering and they do have these on their TODO list but it was pushed to a lower priority (you’re the first to ask for these). I added a report (TPR#18883) to help track this and the other missing elemental functions.


Any other bit intrinsics not supported?.. I kind of need to know because I’m programming a Radix Sort on CUDA Fortran and I will make heavy use of all of them.

Best Regards,

It will be the same for all of them (btest, iand, ibclr, ibits, ibset, ieor, ior, ishft, ishftc, not) since we don’t have the GPU run time device routine. If the intrinsic gets inlined, then it’s not a problem.

  • Mat

mmm can I use the “inline” keyword to force it to be inlined?
the ibits one is critical for the radix sort…

mmm can I use the “inline” keyword to force it to be inlined?
the ibits one is critical for the radix sort…

When they can be inlined, they are inlined. But the user can’t effect this.

I wish I knew off hand the exact circumstance of when each are inlined or not. Though, I’ll do some digging on ibits and get back to you.

  • Mat

Word from engineering is that they’ll have “btest”, “iand”, “ibclr”, “ibset”, “ieor”, and “iand” into 12.9. Granted, I haven’t verified it myself, but that’s the word.

  • Mat

Thanks for the heads up Mat.
A very important one is ibits, just in case priorities are been asigned.

Best Regards,

Late notice.

TPR 18883 - CUDA Fortran: Full implementation of Fortran elemental functions

was corrected as of the 12.9 release.


I will like to know if its possible to copy data from one shared array to another? In my code I passed two fixed size shared arrays to a device subroutine in which I copy some of potions of the array from one to another, something like:

sXconf(j1,:,:) = spXconf(j1,:,:)

where j1 is the threadid, the code compiles well but at runtime I get a memcpy error which using cuda-memcheck I get the following:

========= Invalid __global__ write of size 8
=========     at 0x00001098 in /home/godfred/bio/Godfred/papers/2014/qmc/Codes/CUDA4-5/host_subs_m.CUF:798:host_subs_m_d_pmetropolis_
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x3f4d77f200000001 is misaligned
=========     Device Frame:/home/godfred/bio/Godfred/papers/2014/qmc/Codes/CUDA4-5/host_subs_m.CUF:1080:host_subs_m_kernel1_ (host_subs_m_kernel1_ : 0x2ce0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/ (cuLaunchKernel + 0x2c5) [0x14ad95]
=========     Host Frame:/state/partition1/pgitest/linux86-64/2015/cuda/6.5/lib64/ [0xf5d8]
=========     Host Frame:/state/partition1/pgitest/linux86-64/2015/cuda/6.5/lib64/ (cudaLaunch + 0x143) [0x36833]
=========     Host Frame:./gpuqmc [0xb5bc]
=========     Host Frame:./gpuqmc [0xf239]
=========     Host Frame:./gpuqmc [0x4a14]
=========     Host Frame:/lib64/ (__libc_start_main + 0xf4) [0x1d994]
=========     Host Frame:./gpuqmc (malloc + 0x1ed) [0x4925]
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaMemcpy. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/ [0x2ef613]
=========     Host Frame:/state/partition1/pgitest/linux86-64/2015/cuda/6.5/lib64/ (cudaMemcpy + 0x1df) [0x3370f]
=========     Host Frame:/state/partition1/pgitest/linux86-64/15.7/lib/ [0x1a0d8]
=========     Host Frame:/state/partition1/pgitest/linux86-64/15.7/lib/ (pgf90_dev_copyout + 0x58) [0x1a194]
=========     Host Frame:./gpuqmc [0xf26b]
0: copyout Memcpy (host=0x200d00200, dev=0x1302080400, size=8) FAILED: 4(unspecified launch failure)
=========     Host Frame:./gpuqmc [0x4a14]
=========     Host Frame:/lib64/ (__libc_start_main + 0xf4) [0x1d994]
=========     Host Frame:./gpuqmc (malloc + 0x1ed) [0x4925]
========= ERROR SUMMARY: 2 errors

where host_subs_m.CUF:798 refers to the line with the code above.

It should be ok to do this.

PGI Customer Support sent me your code and once I can get it built (I sent you the compile error I have), I’ll try to determine where the error is.

  • Mat