Are things like this not possible?
It’s possible to call malloc from device code.
Are they discouraged?
I generally discourage using malloc since there is a high execution cost plus the amount of heap space on the device is limited to 8MB (unless you call cudaDeviceSetLimit to raise this size). See:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heap-memory-allocation
For your code, there are a couple issues.
First, you’re using “kernels”. With kernels, the compiler will attempt to create multiple compute kernels from each of the for loops. Sequential code, such as your call to malloc, get called from sequential kernels. Since the malloc’d memory is not global, it can’t be used across multiple kernel invocations. Hence the allocated memory from “mask” can’t be used in the second and third “for” loop.
To fix, use the “parallel” construct instead. This will create a single kernel launch. The caveat being you can only launch a single gang and thus will limit the amount of parallelism, and hence performance, you can achieve.
The second issue, is that you’re allocating more than 8MB so going beyond the device heap size. You’ll need to reduce “SIZE”.
Note that in this case, it really doesn’t make sense to malloc “mask” on the device and you should make it global. Limit your use of device malloc to small arrays private to a gang or vector.
% cat test.c
#include <stdio.h>
#include <stdlib.h>
//#define SIZE 100000000
#define SIZE 1000000
int main(int argc, char *argv[])
{
double result = 0;
int *restrict iota = malloc(SIZE*sizeof(int));
#pragma acc parallel num_gangs(1)
{
#pragma acc loop vector
for (long i = 0; i < SIZE; i++) {
iota[i] = i;
}
int * restrict mask = malloc(SIZE*sizeof(int));
#pragma acc loop vector
for (long i = 0; i < SIZE; i++) {
if ((0==iota[i]%3)||(0==iota[i]%5)) {
mask[i] = 1;
} else {
mask[i] = 0;
}
}
#pragma acc loop vector
for (long i = 0; i < SIZE; i++) {
if (mask[i]) result += iota[i];
}
free(mask);
}
free(iota);
printf("%lf\n", result);
return 0;
}
% pgcc -acc -Minfo=accel test.c ; a.out
main:
13, Accelerator kernel generated
Generating Tesla code
16, #pragma acc loop vector(128) /* threadIdx.x */
23, #pragma acc loop vector(128) /* threadIdx.x */
32, #pragma acc loop vector(128) /* threadIdx.x */
33, Sum reduction generated for result
13, Generating copyout(iota[:1000000])
16, Loop is parallelizable
23, Loop is parallelizable
32, Loop is parallelizable
233333166668.000000
Hope this helps,
Mat