How to Use Shared Memory in CUDA Fortran for Efficient Array Assignment

Hi all,

I’m currently learning CUDA Fortran and have a specific challenge related to optimizing a kernel function with shared memory. The kernel I’m working with involves copying data from a (31, 64, 5) array to a (64, 64, 5) array.

Here is the kernel code that I’m working with:

fortran

复制代码

attributes(global) SUBROUTINE copyx(ABDENS, ABXMOM, ABYMOM, ABENER, 
    AADENS, AAXMOM, AAYMOM, AAENER)
    REAL*8:: ABDENS(31, 64, 5), ABXMOM(31, 64, 5), ABYMOM(31, 64, 5), ABENER(31, 64, 5)
    REAL*8:: AADENS(64, 64, 5), AAXMOM(64, 64, 5), AAYMOM(64, 64, 5), AAENER(64, 64, 5)
    INTEGER:: IX, IY, IZ, XI

    IX = (blockIdx%x - 1) * blockDim%x + threadIdx%x
    IY = (blockIdx%y - 1) * blockDim%y + threadIdx%y

    IF (IY >= 3 .AND. IY <= 62) THEN
        IF (IX >= 2 .AND. IX <= 62) THEN
            XI = IX / 2
            DO IZ = 1, 5
                AADENS(IX, IY, IZ) = ABDENS(XI, IY, IZ)
                AAXMOM(IX, IY, IZ) = ABXMOM(XI, IY, IZ)
                AAYMOM(IX, IY, IZ) = ABYMOM(XI, IY, IZ)
                AAENER(IX, IY, IZ) = ABENER(XI, IY, IZ)
            END DO
            CALL syncthreads()
        END IF
    END IF
END SUBROUTINE copyx

The problem is, I’m trying to optimize this kernel by using shared memory to avoid redundant global memory accesses. I’ve learned how to use shared memory, but I’m not sure how to implement it in this kernel. Specifically, I want to load blocks of data into shared memory and then perform the assignment to the output arrays from there.

Could anyone help me with how to modify this kernel to use shared memory effectively? Any guidance on using shared memory for the input and output arrays would be much appreciated!

Thanks in advance!


You can copy this and post it on your forum to get help with your issue. It explains your problem clearly and gives context for what you’re trying to achieve.

I don’t think shared memory is going to help here. Shared memory is basically a software managed cache primarily used to coalesce non-contiguous data that’s used multiple times in a kernel. Here, your arrays are being accessed in the stride-1 dimension so contiguous and has little reuse.

Hardware caching has been quite good for many generations of devices lessening the need for using shared memory and here it should be able to cache the two elements of the AB arrays that are used by two different threads.