Allocatable arrays for each thread inside global routines?

Is there anyway to allocate arrays for each thread inside global routines? The dimensions of the allocatable arrays are data-dependent.

Any suggestions are highly appreciated. Thanks!


Yes you can allocate data within a global kernel. However it’s recommended to not dynamically allocated data from the device. The performance of such allocations is quite slow and the default device heap size is quite small (~8MB). Hence you should only use it for very small arrays (or few threads) and willing to slow down your code.

If the threads can share an array, one alternative is to use shared automatic arrays within the kernel where the size of the automatic arrays (in bytes) is determined by the third argument of the launch configuration. For example:

        n1_c = n1
        n2_c = n2
        call kernel<<<dimGrid,dimBlock ,sharedByteSize>>>(a_d,b_d)

module Kernel1
        implicit none
        integer,constant :: n1_c,n2_c


        attributes(global) subroutine kernel(a,b)
                implicit none
                real :: a(:)
                integer :: b(:,:)

                real,shared :: s1(n1_c)
                integer,shared :: s2(n2_c,n2_c+1)