Is it possible to call a kernel and modify an array in kernel

My pseudocode as following, and I got an error message:
Error : an illegal memory access was encountered.

There are 2 key points of this pseudocode:

  1. malloc a tmp array in kernel1 than pass the pointer to kernel2
  2. modify tmp array in kerenl2

Is it possible? How do I fix this error?

__global__
void kernel2(int* array, int array_size){
    int idx = threadIdx.x;
    if (idx < array_size){
        int maxv = 0;
        for(int i=0;i<array_size;i++){
            if (array[i]>maxv){
                maxv = array[i];
            }
        }
        array[idx] = maxv;
    }
}

__global__
void kernel1(int* array, int array_size){
    int idx = threadIdx.x;
    if (idx < array_size){
        int* array_tmp = (int*)malloc(array_size*sizeof(int));
        memcpy(array_tmp, array, array_size*sizeof(int));
        kernel2<<<blockNum, threadNum>>>(array_tmp, array_size);
        ...
    }
}

void main(){
    int array_size = 10;
    cudaError_t err;
    int *array; err = cudaMallocManaged(&array, array_size*sizeof(int));
    kernel1<<<blockNum, threadNum>>>(array, array_size);
    ...
}

Yes, it’s possible. However there are a few things to be aware of:

  1. A child kernel cannot make use of variables that point to the local space of the parent kernel.
  2. in-kernel malloc allocates from a separate device heap. This device heap starts out with a relatively small limit, but is adjustable. Make sure your allocation does not exceed the device heap, when considered kernel-wide. A best practice here is to check for a NULL pointer returned by in-kernel malloc, as this is how the API signals an allocation failure.

From what I can tell, you haven’t done either of those things explicitly/obviously. You might also just be exceeding another CDP limit such as the launch pending limit. This is impossible to determine since you haven’t shown your actual grid sizes. if those are large enough you could be exceeding the device heap limit.

I’m guessing you are on windows due to usage of void main() so if that’s the case, another thing to be aware of might simply be a kernel time-out.

Finally, any time you are having trouble with a CUDA code, I always recommend proper CUDA error checking. When using CDP, you should apply similar error checking to device API usage and device (child) kernel launches.

Here’s a complete example based on what you have shown:

$ cat t2108.cu
#include <cstdio>

__global__
void kernel2(int* array, int array_size){
    int idx = threadIdx.x;
    if (idx < array_size){
        int maxv = 0;
        for(int i=0;i<array_size;i++){
            if (array[i]>maxv){
                maxv = array[i];
            }
        }
        array[idx] = maxv;
        printf("maxv: %d\n", maxv);
    }
}

__global__
void kernel1(int* array, int array_size){
    int idx = threadIdx.x;
    if (idx < array_size){
        int* array_tmp = (int*)malloc(array_size*sizeof(int));
        if (array_tmp != NULL) {
          memcpy(array_tmp, array, array_size*sizeof(int));
          kernel2<<<1,1>>>(array_tmp, array_size);}
        else
          printf("oops!\n");
    }
}

int main(){
    int array_size = 10;
    cudaError_t err;
    int *array; err = cudaMallocManaged(&array, array_size*sizeof(int));
    for (int i = 0; i < array_size; i++) array[i] = i;
    kernel1<<<1,1>>>(array, array_size);
    cudaDeviceSynchronize();
}
$ nvcc -o t2108 t2108.cu -rdc=true -lcudadevrt
t2108.cu(33): warning: variable "err" was set but never used

$ compute-sanitizer ./t2108
========= COMPUTE-SANITIZER
maxv: 9
========= ERROR SUMMARY: 0 errors
$

Therefore I conclude the problem lies in something you haven’t shown. My usual suggestion in these cases if you still need help is to provide a short, complete example (just as I have) that demonstrates the problem, rather than “pseudocode”.