malloc and other routines inside of a kernels directive?

I have tried the following test program to determine whether it was safe to use things like malloc(3) inside of a kernels directive.

#include <stdio.h>
#include <stdlib.h>

#define SIZE 100000000

int main(int argc, char *argv[])
{
	double result = 0;
	
	int *restrict iota = malloc(SIZE*sizeof(int));
	
#pragma acc kernels
{
	for (int i = 0; i < SIZE; i++) {
		iota[i] = i;
	}
	
	int *restrict mask = malloc(SIZE*sizeof(int));
	
	for (int i = 0; i < SIZE; i++) {
		if ((0==iota[i]%3)||(0==iota[i]%5)) {
			mask[i] = 1;
		} else {
			mask[i] = 0;
		}
	}
	
	for (int i = 0; i < SIZE; i++) {
		if (mask[i]) result += iota[i];
	}
}

	printf("%lf\n", result);
	
	return 0;
}

It compiles fine, but when I execute the file, I get the following error:

call to cuMemcpyDtoHAsync returned error 715: Illegal instruction

Are things like this not possible? Are they discouraged? I have code that I want to take advantage of using kernels, but it has various book keeping that occurs like malloc between loops, I’d like the compiler to be smart about how it handles those things.

Are things like this not possible?

It’s possible to call malloc from device code.

Are they discouraged?

I generally discourage using malloc since there is a high execution cost plus the amount of heap space on the device is limited to 8MB (unless you call cudaDeviceSetLimit to raise this size). See:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heap-memory-allocation

For your code, there are a couple issues.

First, you’re using “kernels”. With kernels, the compiler will attempt to create multiple compute kernels from each of the for loops. Sequential code, such as your call to malloc, get called from sequential kernels. Since the malloc’d memory is not global, it can’t be used across multiple kernel invocations. Hence the allocated memory from “mask” can’t be used in the second and third “for” loop.

To fix, use the “parallel” construct instead. This will create a single kernel launch. The caveat being you can only launch a single gang and thus will limit the amount of parallelism, and hence performance, you can achieve.

The second issue, is that you’re allocating more than 8MB so going beyond the device heap size. You’ll need to reduce “SIZE”.

Note that in this case, it really doesn’t make sense to malloc “mask” on the device and you should make it global. Limit your use of device malloc to small arrays private to a gang or vector.

% cat test.c
#include <stdio.h>
#include <stdlib.h>

//#define SIZE 100000000
#define SIZE   1000000

int main(int argc, char *argv[])
{
   double result = 0;

   int *restrict iota = malloc(SIZE*sizeof(int));

#pragma acc parallel num_gangs(1)
{
  #pragma acc loop vector
   for (long i = 0; i < SIZE; i++) {
      iota[i] = i;
   }

  int * restrict mask = malloc(SIZE*sizeof(int));

  #pragma acc loop vector
   for (long i = 0; i < SIZE; i++) {
      if ((0==iota[i]%3)||(0==iota[i]%5)) {
         mask[i] = 1;
      } else {
         mask[i] = 0;
      }
   }

  #pragma acc loop vector
   for (long i = 0; i < SIZE; i++) {
      if (mask[i]) result += iota[i];
   }
   free(mask);
}

   free(iota);
   printf("%lf\n", result);

   return 0;
}
% pgcc -acc -Minfo=accel test.c ; a.out
main:
     13, Accelerator kernel generated
         Generating Tesla code
         16, #pragma acc loop vector(128) /* threadIdx.x */
         23, #pragma acc loop vector(128) /* threadIdx.x */
         32, #pragma acc loop vector(128) /* threadIdx.x */
         33, Sum reduction generated for result
     13, Generating copyout(iota[:1000000])
     16, Loop is parallelizable
     23, Loop is parallelizable
     32, Loop is parallelizable
233333166668.000000

Hope this helps,
Mat