I basically need to use atomic memory functions to prevent race condition in my device code but I haven’t been able to because NVIDIA only support these functions in single precision.
However, I’ve recently become aware that there is a way of converting double precision data so that it is stored in two integer/single precision memory slots…hence solving that issue. The problem now is that I don’t have a clue how to go about doing the 1double-to-2single exchange.
A while back I was porting some C code to gpu and I remember that to use texture memory I had to save the double precision data as single and I used the int2 intrinsic to declare the texture followed by __hiloint2double when fetching the texture.
Does anyone have any idea how to do this when not using texture memory? Or has anyone ever done anything similar to what I’m trying to achieve?
Sorry, but I’ve not tried this before. Have you seen anything on doing this in CUDA C? We might be able to then translate it to CUDA Fortran.
Note that we’re really close to getting textures into CUDA Fortran. The push to get OpenACC fully implemented did delay textures a bit, but once OpenACC 1.0 is out, we should be able to get back and finish it up.
Well the only example I can give at the moment is the bit of CUDA C I did a couple of years back but it involves textures. It may be useful though…
[b]
File scope[/b]
*Declaring texture references for arrays a and b*/
texture<int2,1> texRefa;
texture<int2,1> texRefb;
[b]
Host code[/b]
/*Declaring texture reference object for a*/
cudaBindTexture(0, texRefa, Ad, size);
/*Declaring texture reference object for b*/
cudaBindTexture(0, texRefb, Bd, size);
[b]
Device code[/b]
/*Fetching a and b values from texture memory, accumulating result in sum*/
int2 sha = tex1Dfetch(texRefa, ii+k);
int2 shb = tex1Dfetch(texRefb, jj+k);
(sum)= (sum) + (__hiloint2double(sha.y,sha.x)) * (__hiloint2double(shb.y,shb.x));
}
Although this is quite different from what I need it might give you an idea of what I’m trying to achieve (but minus the texture and plus an atomic function).
NVIDIA’s forums are down for maintenance at the moment so I’m struggling to find any more suitable C examples.
I basically need to use atomic memory functions to prevent race condition in my device code but I haven’t been able to because NVIDIA only support these functions in single precision.
Sorry that I didn’t point this out earlier, but NVIDIA does support 64-bit atomic operations on newer GPUs (CC 2.x). Maybe the easiest thing to do would be to upgrade your card?
Well I have a Fermi c2050 so presumably it would be supported on there.
Does this mean that this statement in the pgi cudafor u.g. is incorrect then?
Arithmetic and Bitwise Atomic Functions
These atomic functions read and return the value of the first argument. They also combine that value with
the value of the second argument, depending on the function, and store the combined value back to the first
argument location. Both arguments must be of type integer(kind=4).
Okay, I tried using the atomicadd function with double precision mem and val variable and the result was…
PGF90-S-0155-Could not resolve generic procedure atomicadd (acceler_formd.f: 380)
0 inform, 0 warnings, 1 severes, 0 fatal for formd_cuda
so, I changed these variables to integer and it compiled fine.
I’m a bit confused as clearly pgi fortran does not support 64-bit atomic functions.Do you mean that I’d need to create a wrapper for the function and cal it directly from the C for CUDA library?
Are you using a constant literal for the val? The default kinds for constant literals is 4, but val is expecting a kind of 8. Hence, you need to an “_8” to the end of the literal in order set the kind as 8.
Note that it did some checking and while we’ve had 64-bit atomics in since we added CUDA 4.0, it looks there was a issue with the 64-bit atomicadd where you’ll get a undefined identifier error (TPR#18767). This issue will be resolved in the next release.
Hopefully soon. It’s already be delayed by several weeks due to unforeseen problems. I was told this morning, barring more issues in our final QA testing, it should be out later this week or early next.
Can I just check with you that this code will work in the new release… I just don’t want to delay my work for a week if it ends up not having the right functionality. Here I go…
! Device code
integer::istat
tx=threadIdx%x
bx=blockIdx%x
i=((bx-1)*blocksize)+tx
x=0
y=0
do j=1,catoms_d(i)
katm=coresubatoms(i,j)
do k=1,catoms_d(i)
latm=coresubatoms(i,k)
llk=lowlim(katm)
lll=lowlim(latm)
norbk=natorb(ian(katm))
norbl=natorb(ian(latm))
do iii=1,norbk
ik=llk+iii-1
ix=x+iii
do jjj=1,norbl
jk=lll+jjj-1
jy=y+jjj
if(ik.ge.jk)then
ij=(ik*(ik-1)/2)+jk
istat=atomicadd(globdens_d(ij),
& subdens(i,ix,jy))
c$$$ globdens_d(ij)=globdens_d(ij)+subdens(i,ix,jy)
end if
end do
jy=y
end do
y=y+norbl
end do
x=x+norbk
y=0
end do
This code compiles fine when globdens_d and subdens are declared as integer, so presumably in the new release they will work when declared as above?
Sorry but it looks like we have some miscommunication here. Since you were originally working the “int2 intrinsic”, I assumed you were asking about integer not floating point atomics. This is my fault for which I apologise.
While NVIDIA did add a few 64-bit integer atomic routines, no such atomic routines exist for 64-bit floating point nor do I believe NVIDIA has plans to add them. Hence, that puts us back to your original question, for which I don’t have an answer.
I’ve only just seen your post. I can’t seem to find any documentation about it. I’m a bit skeptical because as far as I’m aware C for CUDA haven’t yet included a double precision atomicadd function and CUDA Fortran usually has the same functionality.
Could you point me in the right direction please? That would be much appreciated.
module atomictests
contains
attributes(global) subroutine testatomicdadd( a )
real8, device :: a
real8 r
r = dble(threadIdx%x)
istat = atomicadd(a, r)
return
end subroutine testatomicdadd
end module atomictests
brentl> cat atomd1.cuf
module atomictests
contains
attributes(global) subroutine testatomicdadd( a )
real8, device :: a
real8 r
r = dble(threadIdx%x)
istat = atomicadd(a, r)
return
end subroutine testatomicdadd
end module atomictests
program t
use cudafor
use atomictests
real8, allocatable, device :: r
real8 x
allocate(r)
r = 0
n = 32
call testatomicdadd<<<1,n>>> (r)
x = r
print ,x,dble(n(n+1)/2)
if (x .eq. dble(n*(n+1)/2)) then
print *,“TEST PASSED”
else
print *,“TEST FAILED”
endif
end