Using conditionals inside kernel

Hi all,

      I actually posted this question on the General GPU computing forum but then felt that it was more appropriate in the programming forum. Apologies for cross posting.

I am trying to compile and run a small program which will reverse an array. The kernel is supposed to transfer the array onto the shared memory and again onto the global memory in reversed order.

My code gets compiled. But when I run it… program executes and I get the reversed array printed out, but I also get the following error message.

“This application has requested the Runtime to terminate it in an unusual way. Please contact the application’s support team for more information” and also a pop-up message box saying “An exception unknown software exception has occured in the application at location 0x004…” The kernel code for this is[codebox]

global void reverse_array_shared(int* dev_arr1, int* dev_arr2, int array_size)

{

extern shared int sh_array;

int in_index = blockDim.x*blockIdx.x + threadIdx.x;

int out_index = array_size - 1 - in_index;

if(in_index < array_size)

{

/** Transferring data from Global memory into shared memory**/

sh_array[threadIdx.x] = dev_arr1[in_index];

/** Transfer it back into global memory but in a different order **/

dev_arr2[out_index] = sh_array[threadIdx.x];

}[/codebox]

I then tried randomly changin my code and I figured out what was causing the problem. The if statement inside the kernel… I put in that if statement in there to deal with the case if array_size is not a multiple of block size and prevent blank threads from executing. So i cannot understand why taking out the ‘if statement’ solved the problem…

[codebox]global void reverse_array_shared (int* dev_arr2, int* dev_arr1)

{

extern shared int sh_array;

int in_index = blockDim.x*blockIdx.x +threadIdx.x;

int out_index = gridDim.x*blockDim.x -1 - in_index;

sh_array[threadIdx.x] = dev_arr1[in_index];

dev_arr2[out_index] = sh_array[threadIdx.x];

}[/codebox]

I did read that using conditionals inside your kernel could slow down execution if the threads diverge. But nothing about causing the kernel to crash. Is there some other risk to using conditionals inside kernels that I should be aware of?? :unsure: I want to understand this problem properly so that I dont get stuck with this again… I would really appreciate it if someone could help me with this.

best,

Avinash

My guess is that you are overwritting memory somewhere and taking out the if condition jostled the code just enough to let you “get away with it” for now.

Oh… How do I find out If I am overwriting somewhere?

Another problem that I had with the same code was that I need to explicitly declare shared memory size during kernel launch for it to launch properly. if I do not explicitly specify Shared memory size during kernel launch, the code works fine as long as the no of threads per block was less than 20. When I hit 20 threads per block the kernel fails to launch saying unknown error. It works fine for any block size if I explicitly specify the Shared memory size. Are the two problems related?

I am posting my code here. If possible could you just look through it and see where the problem is…?

[codebox]#include <stdio.h>

#include <conio.h>

#include <cuda.h>

global void reverse_array_shared (int* dev_arr2, int* dev_arr1)

{

 extern __shared__ int sh_array[];

 int in_index = blockDim.x*blockIdx.x +threadIdx.x;

 int out_index = gridDim.x*blockDim.x -1 - in_index;

sh_array[threadIdx.x] = dev_arr1[in_index];

dev_arr2[out_index] = sh_array[threadIdx.x];

}

void checkCUDAerror (char* msg)

{

cudaError_t err;

err= cudaGetLastError();

if (err != cudaSuccess)

{

fprintf(stderr,"Cuda Error: %s : %s .\n",msg,cudaGetErrorString(err));

printf(“Cuda Error: %s : %s .\n”,msg,cudaGetErrorString(err));

exit(EXIT_FAILURE);

}

}

int main()

{

int i;

int *host_arr, *dev_arr1, *dev_arr2;

int array_size = 1024*128;

int threads_per_block = 20; /No of threads per block/

int no_blocks = array_size/threads_per_block; /** Total no of blocks to be launched **/

/** Allocating memory for the host array **/

host_arr = (int*) malloc(sizeof(int)*array_size);

int sharedMemSize = threads_per_block * sizeof(int);

printf(“Shared memory = %d”,sharedMemSize);

/** Allocating memory for the device arrays **/

cudaMalloc((void**)&dev_arr1, sizeof(int)*array_size);

checkCUDAerror(“Memory allocation on Device 1”);

cudaMalloc((void**)&dev_arr2, sizeof(int)*array_size);

checkCUDAerror(“Memory allocation on Device 2”);

/Initializing the array on the host/

for (i=0; i<array_size; i++)

{

 host_arr[i] = i;

}

/Transferring array to the device/

cudaMemcpy(dev_arr1, host_arr, sizeof(int)*array_size,cudaMemcpyHostToDevice);

checkCUDAerror(“Transfer to device”);

/Defining the execution configuration/

dim3 dimBlock = threads_per_block;

dim3 dimGrid = no_blocks;

/**Launching the kernel **/

reverse_array_shared <<<dimGrid, dimBlock, sharedMemSize>>> (dev_arr2, dev_arr1);

checkCUDAerror(“Kernel invocation”);

/**Transferring reversed array back to host **/

cudaMemcpy(host_arr, dev_arr2, sizeof(int)*array_size,cudaMemcpyDeviceToHost);

checkCUDAerror(“Transfer back to host”);

/Printing out the reversed array/

printf (“Program successfully executed”);

/**Freeing allocated memory **/

free(host_arr);

cudaFree(dev_arr1);

cudaFree(dev_arr2);

return 0;

}

[/codebox]

Thanks and regards,

Avinash

  1. By checking carefully for obvious issues (like not specifying the shared memory amount: see below)

  2. By compiling with --device-emulation and hope that the executable segfaults on the bad memory access

  3. By compiling with --device-emulation and running the executable through valgrind on linux.

Yes, these issues are probably related.

You declare your shared pointer extern shared int sh_array;, which means you must specify the amount of shared memory in the kernel launch parameters (see the programming guide) or otherwise the launch will have NO memory allocated and who knows where your writes to that array are going.

Hi,

 Thanks for the help.

I am already doing this on -deviceemu but I do not get any errors except the “Unknown software exception” I posted about. I didnt know about Valgrind before. I will try that out. I have been running these programs on Windows though. Do u know if I can use the compiled executable directly on Linux or should I compile it again on linux with gcc…??

Oh… I am sorry. I missed that. I just kept thinking about the fact that specifying shared memory is optional. Thanks a lot for pointing that out.

best,

Avinash

Avinash, are you still getting the exception if you add in the “if” statement now that you are allocating the shared memory properly?
-Roger

Hi Roger,

      No, the 'if' statement doesnt cause any problems now that I fixed the Shared memory problem. I was changing the code at random in the beginning so didnt think to check by adding in the 'if ' again after I put in the shared memory allocation. And further.. when i kept testing with different block sizes and array sizes, the program crashed again even with the 'if' statement removed if I didnt allocate my shared memory. 
      It was like you said. Removing the 'if' statement merely jostled the program to let me get away with it for then. The problem had been the shared memory allocation all along.

       Thanks a lot for the help.. I am just getting started out in this and CUDA is now much less intimidating because of you guys..  ^_^ 

cheers,

Avinash