Using atomic memory functions

Hi there,

I basically need to use atomic memory functions to prevent race condition in my device code but I haven’t been able to because NVIDIA only support these functions in single precision.

However, I’ve recently become aware that there is a way of converting double precision data so that it is stored in two integer/single precision memory slots…hence solving that issue. The problem now is that I don’t have a clue how to go about doing the 1double-to-2single exchange.

A while back I was porting some C code to gpu and I remember that to use texture memory I had to save the double precision data as single and I used the int2 intrinsic to declare the texture followed by __hiloint2double when fetching the texture.

Does anyone have any idea how to do this when not using texture memory? Or has anyone ever done anything similar to what I’m trying to achieve?

Cheers,
Crip_crop

Hi Crip Crop,

Sorry, but I’ve not tried this before. Have you seen anything on doing this in CUDA C? We might be able to then translate it to CUDA Fortran.

Note that we’re really close to getting textures into CUDA Fortran. The push to get OpenACC fully implemented did delay textures a bit, but once OpenACC 1.0 is out, we should be able to get back and finish it up.

  • Mat

Well the only example I can give at the moment is the bit of CUDA C I did a couple of years back but it involves textures. It may be useful though…

[b] 
File scope[/b]

*Declaring texture references for arrays a and b*/
texture<int2,1> texRefa;
texture<int2,1> texRefb;

[b]
Host code[/b]

/*Declaring texture reference object for a*/
cudaBindTexture(0, texRefa, Ad, size);

/*Declaring texture reference object for b*/
cudaBindTexture(0, texRefb, Bd, size);

[b]
Device code[/b]

/*Fetching a and b values from texture memory, accumulating result in sum*/
int2 sha = tex1Dfetch(texRefa, ii+k);
int2 shb = tex1Dfetch(texRefb, jj+k);
(sum)= (sum) + (__hiloint2double(sha.y,sha.x)) * (__hiloint2double(shb.y,shb.x));
}

Although this is quite different from what I need it might give you an idea of what I’m trying to achieve (but minus the texture and plus an atomic function).

NVIDIA’s forums are down for maintenance at the moment so I’m struggling to find any more suitable C examples.

Please let me know if this helps at all.

Cheers,
Crip_crop

Hi Crip Crop,

I basically need to use atomic memory functions to prevent race condition in my device code but I haven’t been able to because NVIDIA only support these functions in single precision.

Sorry that I didn’t point this out earlier, but NVIDIA does support 64-bit atomic operations on newer GPUs (CC 2.x). Maybe the easiest thing to do would be to upgrade your card?

  • Mat

Well I have a Fermi c2050 so presumably it would be supported on there.

Does this mean that this statement in the pgi cudafor u.g. is incorrect then?

Arithmetic and Bitwise Atomic Functions
These atomic functions read and return the value of the first argument. They also combine that value with
the value of the second argument, depending on the function, and store the combined value back to the first
argument location. Both arguments must be of type integer(kind=4).

Cheers,
Crip_crop

Okay, I tried using the atomicadd function with double precision mem and val variable and the result was…

PGF90-S-0155-Could not resolve generic procedure atomicadd (acceler_formd.f: 380)
0 inform, 0 warnings, 1 severes, 0 fatal for formd_cuda

so, I changed these variables to integer and it compiled fine.

I’m a bit confused as clearly pgi fortran does not support 64-bit atomic functions.Do you mean that I’d need to create a wrapper for the function and cal it directly from the C for CUDA library?

Cheers,
Crip_crop

Hi Crip_crop,

Are you using a constant literal for the val? The default kinds for constant literals is 4, but val is expecting a kind of 8. Hence, you need to an “_8” to the end of the literal in order set the kind as 8.

Note that it did some checking and while we’ve had 64-bit atomics in since we added CUDA 4.0, it looks there was a issue with the 64-bit atomicadd where you’ll get a undefined identifier error (TPR#18767). This issue will be resolved in the next release.

  • Mat

When is the next release?

Crip_crop

Hopefully soon. It’s already be delayed by several weeks due to unforeseen problems. I was told this morning, barring more issues in our final QA testing, it should be out later this week or early next.

  • Mat

Can I just check with you that this code will work in the new release… I just don’t want to delay my work for a week if it ends up not having the right functionality. Here I go…

!   Module scope

double precision,allocatable,device,dimension(:)globdens_d
double precision,allocatable,device,dimension(:,:,:)::subdens



!   Host code

              allocate
     &       (subdens(subsystems,maxbasfun,maxbasfun),
     &        globdens_d(lowt)



!    Device code
integer::istat

      tx=threadIdx%x
      bx=blockIdx%x
      i=((bx-1)*blocksize)+tx

            x=0
            y=0

            do j=1,catoms_d(i)
               katm=coresubatoms(i,j)
               do k=1,catoms_d(i)
                  latm=coresubatoms(i,k)

                  llk=lowlim(katm)
                  lll=lowlim(latm)
                  
                  norbk=natorb(ian(katm))
                  norbl=natorb(ian(latm))


                  do iii=1,norbk
                     ik=llk+iii-1
                     ix=x+iii
                     do jjj=1,norbl
                        jk=lll+jjj-1
                        jy=y+jjj
                        
                        if(ik.ge.jk)then
                           ij=(ik*(ik-1)/2)+jk
                           istat=atomicadd(globdens_d(ij),
     &                          subdens(i,ix,jy))
c$$$                           globdens_d(ij)=globdens_d(ij)+subdens(i,ix,jy)
                           
                        end if
                     end do
                     jy=y
                  end do
                  
                  y=y+norbl
               end do
               x=x+norbk
               
               y=0
            end do

This code compiles fine when globdens_d and subdens are declared as integer, so presumably in the new release they will work when declared as above?

Cheers,
Crip_crop

Hi Crip_crop,

Sorry but it looks like we have some miscommunication here. Since you were originally working the “int2 intrinsic”, I assumed you were asking about integer not floating point atomics. This is my fault for which I apologise.

While NVIDIA did add a few 64-bit integer atomic routines, no such atomic routines exist for 64-bit floating point nor do I believe NVIDIA has plans to add them. Hence, that puts us back to your original question, for which I don’t have an answer.

  • Mat

Double precision (real*8) atomic adds were implemented in CUDA Fortran starting in release 12.8.

Hi Brentl,

I’ve only just seen your post. I can’t seem to find any documentation about it. I’m a bit skeptical because as far as I’m aware C for CUDA haven’t yet included a double precision atomicadd function and CUDA Fortran usually has the same functionality.

Could you point me in the right direction please? That would be much appreciated.

Cheers,
Crip_crop

It’s just another overloaded atomic add function

module atomictests
contains
attributes(global) subroutine testatomicdadd( a )
real8, device :: a
real
8 r
r = dble(threadIdx%x)
istat = atomicadd(a, r)
return
end subroutine testatomicdadd
end module atomictests

Wowzers this is exciting news for me. You have no idea how much I desperately need this functionality…now I just need to get my hands on v12.8.

Cheers Brentl, I shall report back once I’ve tested it.

Crip_crop

Hi Brentl,

I can’t seem to get it working with real*8… it works for integers fine though.

I get the following compiler error:

Could not resolve generic procedure atomicadd

And that’s compiling with version 12.8.

Have you actually tried this and got it to work?

Cheers,
Crip_crop

Also could any of you PGI guys please verify what Brentl is saying please?

Cheers,
Crip_crop

Are you sure you are using 12.8?

brentl> pgf90 -V12.8 atomd1.cuf
brentl> ./a.out
528.0000000000000 528.0000000000000
TEST PASSED

brentl> pgf90 -V12.6 atomd1.cuf
PGF90-S-0155-Could not resolve generic procedure atomicadd (atomd1.cuf: 7)
0 inform, 0 warnings, 1 severes, 0 fatal for testatomicdadd

brentl> cat atomd1.cuf
module atomictests
contains
attributes(global) subroutine testatomicdadd( a )
real8, device :: a
real
8 r
r = dble(threadIdx%x)
istat = atomicadd(a, r)
return
end subroutine testatomicdadd
end module atomictests

program t
use cudafor
use atomictests
real8, allocatable, device :: r
real
8 x

allocate®
r = 0
n = 32
call testatomicdadd<<<1,n>>> ®
x = r
print ,x,dble(n(n+1)/2)
if (x .eq. dble(n*(n+1)/2)) then
print *,“TEST PASSED”
else
print *,“TEST FAILED”
endif
end

Hazzaaaaar! It works… all I can think of is that I wasn’t correctly linking to the 12.8 version.

Thank you so much brentl, you really have made my week!

Cheers,
Crip_crop

atomicadd is working just fine with double precision variables but I can’t seem to get atomicsub to follow suit.

Do you know if this has been extended to handle double precision or not?

I also tried a work around, so instead of:

 istat=atomicsub(htot(ij,1),temp)

I attempted to make temp negative and use atomicadd

istat=atomicadd(htot(ij,1),-temp)

but this didn’t work and gave an incorrect result.

Any ideas for an alternative work around if atomicsub has in fact not been extended to support double precision would be much appreciated.

Cheers,
Crip_crop