malloc and other routines inside of a kernels directive?

arcfide · November 3, 2015, 7:52am

I have tried the following test program to determine whether it was safe to use things like malloc(3) inside of a kernels directive.

#include <stdio.h>
#include <stdlib.h>

#define SIZE 100000000

int main(int argc, char *argv[])
{
	double result = 0;
	
	int *restrict iota = malloc(SIZE*sizeof(int));
	
#pragma acc kernels
{
	for (int i = 0; i < SIZE; i++) {
		iota[i] = i;
	}
	
	int *restrict mask = malloc(SIZE*sizeof(int));
	
	for (int i = 0; i < SIZE; i++) {
		if ((0==iota[i]%3)||(0==iota[i]%5)) {
			mask[i] = 1;
		} else {
			mask[i] = 0;
		}
	}
	
	for (int i = 0; i < SIZE; i++) {
		if (mask[i]) result += iota[i];
	}
}

	printf("%lf\n", result);
	
	return 0;
}

It compiles fine, but when I execute the file, I get the following error:

call to cuMemcpyDtoHAsync returned error 715: Illegal instruction

Are things like this not possible? Are they discouraged? I have code that I want to take advantage of using kernels, but it has various book keeping that occurs like malloc between loops, I’d like the compiler to be smart about how it handles those things.

MatColgrove · November 3, 2015, 5:42pm

Are things like this not possible?

It’s possible to call malloc from device code.

Are they discouraged?

I generally discourage using malloc since there is a high execution cost plus the amount of heap space on the device is limited to 8MB (unless you call cudaDeviceSetLimit to raise this size). See:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heap-memory-allocation

For your code, there are a couple issues.

First, you’re using “kernels”. With kernels, the compiler will attempt to create multiple compute kernels from each of the for loops. Sequential code, such as your call to malloc, get called from sequential kernels. Since the malloc’d memory is not global, it can’t be used across multiple kernel invocations. Hence the allocated memory from “mask” can’t be used in the second and third “for” loop.

To fix, use the “parallel” construct instead. This will create a single kernel launch. The caveat being you can only launch a single gang and thus will limit the amount of parallelism, and hence performance, you can achieve.

The second issue, is that you’re allocating more than 8MB so going beyond the device heap size. You’ll need to reduce “SIZE”.

Note that in this case, it really doesn’t make sense to malloc “mask” on the device and you should make it global. Limit your use of device malloc to small arrays private to a gang or vector.

% cat test.c
#include <stdio.h>
#include <stdlib.h>

//#define SIZE 100000000
#define SIZE   1000000

int main(int argc, char *argv[])
{
   double result = 0;

   int *restrict iota = malloc(SIZE*sizeof(int));

#pragma acc parallel num_gangs(1)
{
  #pragma acc loop vector
   for (long i = 0; i < SIZE; i++) {
      iota[i] = i;
   }

  int * restrict mask = malloc(SIZE*sizeof(int));

  #pragma acc loop vector
   for (long i = 0; i < SIZE; i++) {
      if ((0==iota[i]%3)||(0==iota[i]%5)) {
         mask[i] = 1;
      } else {
         mask[i] = 0;
      }
   }

  #pragma acc loop vector
   for (long i = 0; i < SIZE; i++) {
      if (mask[i]) result += iota[i];
   }
   free(mask);
}

   free(iota);
   printf("%lf\n", result);

   return 0;
}
% pgcc -acc -Minfo=accel test.c ; a.out
main:
     13, Accelerator kernel generated
         Generating Tesla code
         16, #pragma acc loop vector(128) /* threadIdx.x */
         23, #pragma acc loop vector(128) /* threadIdx.x */
         32, #pragma acc loop vector(128) /* threadIdx.x */
         33, Sum reduction generated for result
     13, Generating copyout(iota[:1000000])
     16, Loop is parallelizable
     23, Loop is parallelizable
     32, Loop is parallelizable
233333166668.000000

Hope this helps,
Mat

Topic		Replies	Views
Dynamic memory allocation during kernel execution Is it posible? CUDA Programming and Performance	13	169612	January 25, 2013
Is it thread-safe to malloc in threads of a kernel function? CUDA Programming and Performance	7	2458	December 8, 2017
malloc memory in kernel linked via in/out variable CUDA Programming and Performance	10	2121	October 17, 2015
CUDA in-kernel malloc CUDA Programming and Performance	4	10006	July 19, 2011
Dynamic Global Memory Allocation ? faulty documentation - malloc within Kernel CUDA Programming and Performance	4	18233	December 30, 2010
malloc of one kernel in another kernel Memory allocated in one kernel can be accessed in another ker CUDA Programming and Performance	5	874	January 23, 2012
cudaMalloc from inside a kernel CUDA Programming and Performance	3	12938	September 2, 2009
Dynamic Memory Allocation inside kernel Can we have a cudaMalloc((void**)&var, size) in our ke CUDA Programming and Performance	1	1554	February 9, 2010
Circular buffer class on device (new[] operator) CUDA Programming and Performance	19	3270	November 24, 2010
Not working correctly new () and malloc () inside the kernel, why? CUDA Programming and Performance	2	1313	April 4, 2014

malloc and other routines inside of a kernels directive?

Related topics