Linker Error collect2: ld returned 1 exit status

I have a question. After compiling my program i got these errors:

/tmp/tmpxft_00001d76_00000000-5.o: In function `calculate_parametric_sets':

tmpxft_00001d76_00000000-4.i:(.text+0x7576): undefined reference to `__vla_alloc'

tmpxft_00001d76_00000000-4.i:(.text+0x75b3): undefined reference to `__vla_alloc'

tmpxft_00001d76_00000000-4.i:(.text+0x75f0): undefined reference to `__vla_alloc'

tmpxft_00001d76_00000000-4.i:(.text+0x7a82): undefined reference to `__vla_dealloc'

tmpxft_00001d76_00000000-4.i:(.text+0x7a8e): undefined reference to `__vla_dealloc'

tmpxft_00001d76_00000000-4.i:(.text+0x7a9a): undefined reference to `__vla_dealloc'

collect2: ld returned 1 exit status

I think it has something to do with the allocation of memory but I don’t know that exactly. I hope some can help me with this problem

Here the code of calculate_parametric_sets

#include <stdio.h>

__global__ void calculate_parametric_sets( 

          float *alpha_x, float *alpha_y, float *alpha_z,

          const float *delta_x, const float *delta_y, const float *delta_z,

          const float *i_min, const float *j_min, const float *k_min,

          const float *i_max, const float *j_max, const float *k_max, 

          const int *n_total_summed,

          int gridY, 

          unsigned int blockSize_x, unsigned int blockSize_y, unsigned int offset_x, unsigned int offset_y

          ) {

  // set blockindex

  int bx = blockIdx.x;

  int by = blockIdx.y;

  

  // set threadindex	

  int tx = threadIdx.x;

  int ty = threadIdx.y;

	

  // determine x, y and z pending on threadindex and blocksize

  int i = (tx+offset_x) + (bx % gridY) * blockSize_x; // offset is used to tell which subblock will be calculated

  int j = (ty+offset_y) + by * blockSize_y;

  int k = floorf(bx / gridY);

	

  // determine index of voxel

  int index = i + (nx-1)*(j + k *(ny-1));

	

	__shared__ float dx_inv_deltax;

	__shared__ float dy_inv_deltay;

	__shared__ float dz_inv_deltaz;

	__shared__ float l_delta_x;

	__shared__ float l_delta_y;

	__shared__ float l_delta_z;

	__shared__ float l_alpha_x[(int)(i_max[index]-i_min[index]+1)];

	__shared__ float l_alpha_y[(int)(j_max[index]-j_min[index]+1)];

	__shared__ float l_alpha_z[(int)(k_max[index]-k_min[index]+1)];

	

	l_delta_x = delta_x[index];

	l_delta_y = delta_y[index];

	l_delta_z = delta_z[index];

	dx_inv_deltax = dx / l_delta_x;

	dy_inv_deltay = dy / l_delta_y;

	dz_inv_deltaz = dz / l_delta_z;

	

	

	if(l_delta_x > 0){

  alpha_x[0] = (x_plane + (i_min[index] - 1)*dx-xsource) / l_delta_x;

  for(int i=1; i<(i_max[index]-i_min[index]); i++) {

  	alpha_x[i] = alpha_x[i - 1] + dx_inv_deltax;

  }	

	}	else {

  alpha_x[0] = (x_plane + (i_max[index] - 1)*dx-xsource) / l_delta_x; 

  for(int i=1; i<i_max[index]-i_min[index]; i++) {

  	alpha_x[i] = alpha_x[i - 1] + dx_inv_deltax;

  }

	}

	

	if(l_delta_y > 0){

  alpha_y[0] = (y_plane + (j_min[index] - 1)*dy-ysource) / l_delta_y;

  for(int i=1; i<(j_max[index]-j_min[index]); i++) {

  	alpha_y[i] = alpha_y[i - 1] + dy_inv_deltay;

  }	

	}	else {

  alpha_y[0] = (y_plane + (j_max[index] - 1)*dy-ysource) / l_delta_y; 

  for(int i=1; i<j_max[index]-j_min[index]; i++) {

  	alpha_y[i] = alpha_y[i - 1] + dy_inv_deltay;

  }

	}

	

	if(l_delta_z > 0){

  alpha_z[0] = (z_plane + (k_min[index] - 1)*dz-zsource) / l_delta_z;

  for(int i=1; i<(k_max[index]-k_min[index]); i++) {

  	alpha_z[i] = alpha_z[i - 1] + dz_inv_deltaz;

  }	

	}	else {

  alpha_z[0] = (z_plane + (k_max[index] - 1)*dz-zsource) / l_delta_z; 

  for(int i=1; i<k_max[index]-k_min[index]; i++) {

  	alpha_z[i] = alpha_z[i - 1] + dz_inv_deltaz;

  }

	}

	

//	printf("%f\n",alpha_x[0]);

	

	}

You need to remove lines such as the following…

l_delta_x = delta_x[index];

l_delta_y = delta_y[index];

l_delta_z = delta_z[index];

Sorry :(

You are using a C99 feature, VLA (= variable length array), that is not supported by CUDA for device code, especially not with shared memory:

shared float l_alpha_x[(int)(i_max[index]-i_min[index]+1)];
shared float l_alpha_y[(int)(j_max[index]-j_min[index]+1)];
shared float l_alpha_z[(int)(k_max[index]-k_min[index]+1)];

VLA works a bit like alloca(): it allocates stack memory (presumably via the __vla_alloc, __vla_dealloc routines that the linker can’t find). No such dynamic allocation works on the device. Except for supporting sizing of shared memory usage at kernel launch time, all shared memory allocation must use compile-
time constants.

Why do i need to remove those lines? I don’t understand there is nothing wrong with this in my oppinion.

It’s better to make a local variable then to call the device variable a couple of times I think.