Data corruption after device kernel call

Hi,

I created a test code in which I declare an FX array (shared memory) and append values in it in a global kernel. I stored it in a global variable TEST. Then the code goes into a device kernel with the specification INTENT(IN) :: FX, specifically not to corrupt the FX. Yet in the second turn over the loop, the data gets corrupted.
Actually I expect to get something like
The first iteration
1.000000000000000 2.000000000000000 -1.000000000000000
2.000000000000000 1.000000000000000 2.000000000000000
-1.000000000000000 2.000000000000000 1.000000000000000
The second iteration
1.000000000000000 2.000000000000000 -1.000000000000000
2.000000000000000 1.000000000000000 2.000000000000000
-1.000000000000000 2.000000000000000 1.000000000000000;
it gives
The first iteration
1.000000000000000 2.000000000000000 -1.000000000000000
2.000000000000000 1.000000000000000 2.000000000000000
-1.000000000000000 2.000000000000000 1.000000000000000
The second iteration
-3.000000000000000 14.00000000000000 -50.00000000000000
-4.000000000000000 2.000000000000000 100.0000000000000
5.000000000000000 10.00000000000000 50.00000000000000.

I could not figure out why it is corrupted. Can you help? I sent the code below.

Thank you,
Yunus

Makefile (799 Bytes)
m_gpu.f90 (2.3 KB)
main.f90 (1.0 KB)

Hi Yunus,

What I think is going on is that “ADJF” is pointing to the same memory as “FX”.

With dynamic shared memory, you’re basically creating one memory block with each shared array being offsets into this block. Hence when you use “ADJF” by itself in the device routine, it’s offset is the same as “FX”.

Adding the other shared arrays to D_INV seems to work around the issue:

        ATTRIBUTES(DEVICE) SUBROUTINE D_INV(SD)

                IMPLICIT NONE

                INTEGER, INTENT(IN) :: SD

                DOUBLE PRECISION, SHARED :: INVFXT(SD,SD)
                DOUBLE PRECISION, SHARED :: INVFX(SD,SD)
                DOUBLE PRECISION, SHARED :: FX(SD,SD)
                DOUBLE PRECISION, SHARED :: ADJF(SD,SD)

                DOUBLE PRECISION, shared :: DETF

Also, shouldn’t “DETF” be shared as well? It’s only getting set by one thread but used by all of them. If not, the you should move it from the if block so all threads set it.

-Mat

Once I have done as you said, it solves the problem that I aforementioned. Yet since we do not put few of the arguments in the definition of “D_INV”, I believe it does not carry the variables neither from “global kernel scope” to “device kernel scope” nor vice versa. To show it, I used “DETF” in the global kernel scope while a value is appended in the device kernel and it gave 0 even though DETF is not zero.

main.f90 (1.1 KB)
Makefile (799 Bytes)
m_gpu.f90 (2.4 KB)

I tried the following:

	ATTRIBUTES(DEVICE) SUBROUTINE D_INV(SD,FX,INVFX,DETF)
	
	IMPLICIT NONE
	
	INTEGER, INTENT(IN) :: SD
	DOUBLE PRECISION, SHARED, INTENT(IN) :: FX(SD,SD)
	DOUBLE PRECISION, SHARED, INTENT(OUT) :: INVFX(SD,SD)
	DOUBLE PRECISION, SHARED, INTENT(OUT) :: DETF
	
	DOUBLE PRECISION, SHARED :: INVFXT(SD,SD)
	
	DOUBLE PRECISION, SHARED :: ADJF(SD,SD)

Yet it did not work either because the compiler ignores the shared attributes.

NVFORTRAN-W-0526-SHARED attribute ignored on dummy argument fx (m_gpu.f90: 69)
NVFORTRAN-W-0526-SHARED attribute ignored on dummy argument invfx (m_gpu.f90: 70)
NVFORTRAN-W-0526-SHARED attribute ignored on dummy argument detf (m_gpu.f90: 71)

m_gpu.f90 (2.5 KB)
“m_gpu.f90” that gives the error above.

How can I carry data between these kernels with the shared attribute?

-Yunus

For the automatics, these are pointing into the dynamic shared memory block. So while variables themselves are different between the global and device routines, the memory that they point to is the same so effectively “carry” the results across the calls. You just need to keep the same order so they point to same place in the dynamic shared memory in both routines.

The problem here is with “DETF”. This is a scalar so doesn’t point to the dynamic shared memory. Instead you have two different shared DETFs, one declared in each routine. In the first version, you only had DETF declared in the device routine. I only suggested making it shared because you have it’s assignment guarded in the if block which only one thread sets. Hence when local, it would be uninitialized for the other threads.

For this code, the minimal change is to keep DETF shared in the global routine, but pass it as an argument to D_INV, without “SHARED” on it’s declaration.

% make
nvfortran -O3 -cuda -Minfo=all -Mpreprocess -acc -g -c -mp m_gpu.f90
nvfortran -O3 -cuda -acc m_gpu.o main.o -o main -L/cuda/lib64 -lnvToolsExt
% ./main
 Error code:
 no error
   -16.00000000000000        -32.00000000000000         16.00000000000000
   -32.00000000000000        -16.00000000000000        -32.00000000000000
    16.00000000000000        -32.00000000000000        -16.00000000000000
   -16.00000000000000        -32.00000000000000         16.00000000000000
   -32.00000000000000        -16.00000000000000        -32.00000000000000
    16.00000000000000        -32.00000000000000        -16.00000000000000

m_gpu.f90 (2.4 KB)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.