To use atomic add

Hi,
I am getting following errors while using atomicadd function in Cuda fortran.
Error 1 unresolved external symbol _ATOMICADDF@8 referenced in function _PI_CUDA_PI_KERNEL__ENTRY@20 picuda.obj

I am calling atomicadd as shown below:
istat=atomicadd(y,x) (for y = y + x)
Can anyone please give me an example of how to use it. I saw the above syntax from one of the topics in the forum.
Regards
Kaustubh

Hi Kaustubh,

Can you please post a reproducing example? I looks like y and x might be floats where we only support the integer versions of the atomics.

  • Mat

Hi Mat,
Yes I was using real variables.
Thanks

CUDA supports Atomic Add for floats on devices with compute compatibility 2.x,
Does PGI CUDA Fortran support this yet?
Karen

Hi Karen,

Does PGI CUDA Fortran support this yet?

Yes.

  • Mat

Do you know with which compiler version that support was added. I get the following error message when I try to use atomicadd() with reals.

PGF90-S-0155-Could not resolve generic procedure atomicadd

My code looks something like
tmpz = A(k,n,j)*C(l,m)
tmpf = atomicadd(sB_real(i,j),REAL(tmpz))
tmpf = atomicadd(sB_img(i,j),AIMAG(tmpz))
where tmpz, A and C are complex(kind=8) and sB_real, sB_img, and tmpf are real(kind=8). sB_real, sB_img are in shared memory, tmpz and tmpf are thread local.

You might notice that what I’d really like is atomicadd for complex, but handling the real and imaginary part separately should produce a correct result for addition. Multiple threads will contribute to the same sB array location, hence the need for atomic.

Karen

Hi Karen,

I’ve forgotten the exact release, but it was an early 11.x. The problem here is that atomicadd only supports single precision. Unfortunately, I don’t know if/when NVIDIA will add hardware support for double precision atomics.

  • Mat

Thanks, I’ll try another approach.
-Karen

I have a simple code that should compute a distribution function, but unfortunately it doesn’t work, i suppose that problem is in implementation of atomicadd.

    attributes(global) subroutine stat_kernel(x,dist,Na,Nbin,dx,nTr)
    integer(4),device :: dist(Nbin)
	real(4),device:: x(Na)
    real(4),value::dx
    integer, value :: Na, nTr,Nbin
    integer :: i, j, tx,ij
	call syncthreads()
	if (blockidx%x.eq.1) then
	tx=threadidx%x+1
    do i=Na*(tx-1)/nTr+1,Na*tx/nTr
		ij=int(X(i)/dx)+1
		ic=atomicadd(dist(ij),1)
	enddo
	endif 
    end subroutine stat_kernel

the compilation proceeds fine, but when i execute it, the following error appears:
copyout MemCpy (host=0x4011a1e0, dev=0x8c00000, size=40) Failed :30(unknown error)

what am i doing wrong?

Can you send/post the main program that calls this CUDA Fortran kernel and causes the program to crash as you showed below. Without being able to reproduce the issue, its hard to say what the problem is.

Thanks.

The main program is:

	program example27b
	use GPU_mod
	implicit real(4)(a-h,o-z)
	include "omp_lib.h"
	integer,parameter:: Nbin=10, Na=134217728
	integer i,j
	integer:: n,Ndev,dev,dist3(Nbin)
	real(4):: x(Na),dist(Nbin),dist2(Nbin),xBin(Nbin)
	integer, allocatable:: OffSet(:),StrSz(:)

    real(4), device, allocatable:: xD(:),xBinD(:)
	integer(4), device, allocatable:: dist3D(:)
    type(dim3) :: dimGrid, dimBlock

	allocate(dist3D(Nbin),xD(Na),xBinD(Nbin))

	count = 0
	Ndev=4
	call omp_set_num_threads(Ndev)

	nTr=16
    dimGrid = dim3( 16, 1, 1 )
    dimBlock = dim3( 16, 1, 1 )
	
	
	do i=1,Na
		call random_number(f)
		X(i)=f
	enddo
	
	dist=0.E0
	xBin=0.E0
	dx=1.E0/real(Nbin)
	do i=1,Nbin
		xBin(i)=(real(i)-0.5E0)*dx
	enddo
	
	start_time=omp_get_wtime()
	do i=1,Na
		ij=int(X(i)/dx)+1
		if (ij.eq.Nbin+1) print *,X(i)
		dist(ij)=dist(ij)+1.E0
	enddo

	end_time=omp_get_wtime()
	print *,"Sequential time= ",(end_time-start_time)
	
	start_time=omp_get_wtime()
 
    !$omp parallel private(i,ij,ii) reduction (+:dist2)
    
	dev=omp_get_thread_num()
    do ii=1,Na,Ndev
		i=ii+dev
		ij=int(X(i)/dx)+1
		dist2(ij)=dist2(ij)+1.E0
	enddo
	!$omp end parallel
	end_time=omp_get_wtime()
	print *,"OpenMP time= ",(end_time-start_time)	

	kol=0

	print *,' '

	start_time=omp_get_wtime()
	istat=cudaMemset(xD,0.0E0,Na)
	istat=cudaMemset(dist3D,0,Nbin)
	istat=cudaMemcpy(xD,x,Na,cudaMemcpyHostToDevice)
	end_time=omp_get_wtime()
	print *,"host to device copy time= ",(end_time-start_time)

	start_time=omp_get_wtime()
	call stat_kernel<<<dimGrid,dimBlock>>>(xD,dist3D,Na,Nbin,dx,nTr)
	end_time=omp_get_wtime()
	print *,"GPU time= ",(end_time-start_time)

	istat=cudaMemcpy(dist3,dist3D,Nbin,cudaMemcpyDeviceToHost)


	print *,' '
	deallocate(dist3D,xD,xBinD)

	end

I’m using GeForce GT635m and PGI fortan 11.7 on Win 7 x64

Thanks! I will give it a try.

Found a couple of issue’s with the global subroutine stat_kernel. First the setting tx=threadidx%x+1 is incorrect. Adding one is not necessary in CUDA Fortran. The threadidx values are one based, not zero based as they are in CUDA C. The second issue I found is how the loop bounds are being computed. Its possible to overflow a 32-bit integer value when computing Na*tx/nTr, given certain values of Na. There are a number of ways to fix this, for example, create a variable that is a 64-bit integer:

integer(8) :: ie

Then compute the value for the loop exit as follows:

   ie = Na
   ie = (ie * tx)/nTr

Then just use this variable in the do statement as follows:

    do i=Na*(tx-1)/nTr+1,ie

Dear toepfer, thank you very much for this reply, but could please tell me about the nature of this overflow? Why it does not affect on results by CPU? How can i diagnose this trouble in future?