Kernel Launch Failure Simple Kernel with 2D array fails init

I have a very simple Kernel that was reduced down from a larger program to try and capture an error I am having (I have an 8800 GTS card). When I run this Kernel, my system times out and I get a failure to launch error that is returned in about 4 seconds. In the “real” kernel, I am trying to load different segments of global memory into shared memory and the Blocks with blockIdx.x and blockIdx.y equal to 0 have to be slightly different.

#ifndef _TEMPLATE_KERNEL_H_

#define _TEMPLATE_KERNEL_H_

#define dBLOCK_SIZE 16

__global__ void testKernel()

{

  int tidx  = threadIdx.x;

  int tidy  = threadIdx.y;

  int bidx  = blockIdx.x;

  int bidy  = blockIdx.y;

  

  __shared__  int test[dBLOCK_SIZE][dBLOCK_SIZE];

      

  if (bidy==0)

  {

   if (bidx==0)

    test[tidx][tidy]=1;

   else

    test[tidx][tidy]=2;

  }

  else

     test[tidx][tidy]=3;

  

  __syncthreads();

}

#endif // #ifndef _TEMPLATE_KERNEL_H_
#include <stdio.h>

// includes, project

#include <cutil.h>

// includes, kernels

#include <template_kernel.cu>

///////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest( int argc, char** argv);

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main( int argc, char** argv) 

{

    runTest( argc, argv);

   CUT_EXIT(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void

runTest( int argc, char** argv) 

{

	CUT_DEVICE_INIT();

	// set block and grid dimensions

    dim3  grid(1, 1, 1);

    dim3  threads(16, 16, 1);

	// execute the kernel

    testKernel<<< grid, threads>>>();

	

    // check if kernel execution generated and error

    CUT_CHECK_ERROR("Kernel execution failed");

}

Any insight would be greatly appreciated. I do not get an error in debug mode and I have reduced the code down as far as possible to try and troubleshoot. At the moment, I am stuck. (I really hope this isn’t something dumb on my part).

I have never used shared memory like you do here and I think it’s the wrong way.

The size of shared memory is defined when calling the kernel (the third parameter in the fancy triple-angle-brackets).

So I would try removing the “shared int test[dBLOCK_SIZE][dBLOCK_SIZE];” and putting it right before “global void testKernel()” like this: “shared int test;”.
Then the “#define dBLOCK_SIZE 16” should go into your main file and your kernel call should look like this: “testKernel<<< grid, threads, dBLOCK_SIZEdBLOCK_SIZEsizeof(int)>>>();”.

I think it should work… not sure though.
Another note: you realize that you won’t get any results from the kernel, right? So it does not make a whole lot of sense - maybe the compiler realizes this and your kernel won’t do anything. It should launch though.

The shared memory usage is OK. You can declare shared arrays within kernels as long as size is determined at compile time. If you want to determine the size at run-time, then you have to go with the allocation at the launch.

I could not reproduce your crash or time-out. I did consolidate your code into one file though (provided below). Try this on your system and let me know if you’re still getting a crash/time-out. You can generate an executable by simply saving the code in a file (say, test.cu) and then type “nvcc test.cu” at the command line.

Paulius

#include <stdio.h>

#define dBLOCK_SIZE 16

__global__ void testKernel()

{

    int tidx  = threadIdx.x;

    int tidy  = threadIdx.y;

    int bidx  = blockIdx.x;

    int bidy  = blockIdx.y;

   __shared__  int test[dBLOCK_SIZE][dBLOCK_SIZE];

     

    if (bidy==0)

    {

        if (bidx==0)

            test[tidx][tidy]=1;

        else

            test[tidx][tidy]=2;

    }

    else

        test[tidx][tidy]=3;

   __syncthreads();

}

int main( int argc, char** argv) 

{

    // set block and grid dimensions

    dim3  grid(1, 1, 1);

    dim3  threads(16, 16, 1);

   // execute the kernel

    testKernel<<< grid, threads>>>();

   printf("%s\n", cudaGetErrorString(cudaGetLastError()));

   return 0;

}

Paulius,

 Thank you for the reply.  I will try that first thing Monday, as that is the place that I have the 8800 GTS.  I hope it works--I can't figure out what is wrong and I definitely don't want it to be the card itself, but I made sure that I can run several other kernels with no problem.  Just out of curiosity, what type of card did you run the kernel on?

I tried both the original and the new versions on an 8800GTX and they both worked fine. Are you sure you have the CUDA drivers installed? I’m not sure what would happen if you don’t but just a thought since others have been able to run your code…

Thank you to everyone for the replies. I hate when people post, solve the problem and then don’t share their results, so here is what I found. The following is the code I was able to run without error (unspecified launch error) and a short discussion is below.

include <stdio.h>

#define dBLOCK_SIZE 16

__global__ void testKernel()

{

   int tidx  = threadIdx.x;

   int tidy  = threadIdx.y;

   int bidx  = blockIdx.x;

   int bidy  = blockIdx.y;

  __shared__  int test[dBLOCK_SIZE][dBLOCK_SIZE];

    

   if (bidy==0)

   {

       if (bidx==0)

           test[tidx][tidy]=1*bidx;

       else

           test[tidx][tidy]=2*bidx;

   }

   else

       test[tidx][tidy]=3*bidy;

  __syncthreads();

}

int main( int argc, char** argv) 

{

   // set block and grid dimensions

   dim3  grid(1, 1, 1);

   dim3  threads(16, 16, 1);

  // execute the kernel

   testKernel<<< grid, threads>>>();

  printf("%s\n", cudaGetErrorString(cudaGetLastError()));

  return 0;

}

I have a 8800 GTS board and I am able to run the CUDA SDK projects, so I am sure that everything is installed correctly. Although I have not fully characterized the issue, the following is what I observed. If I base an operation conditioned on the BoardID (x or y), then the operation that follows needs to have the Board ID (x or y, respectively) in the calculation. I understand that others did not observe this issue, but I have tried every combination of things I can think of and this is the only modification I have made that launches the kernel without error.

One interesting thing that I found is as follows.

if (bidy==0)

   {

       if (bidx==0)

           test[tidx][tidy]=1*bidx;

       else

           test[tidx][tidy]=2*bidx;

   }

In this section, I can turn one “bidx” into “bidy”, but if both are turned into “bidy”, the kernel will not launch.

If anyone has any thoughts as to why or why not this makes sense, I would be very interested to hear them.

For what it may be worth… I tried all permutations of bidx and bidy in the body of your if-statements and never encountered an error. I am using CUDA v1.0 on OpenSuse 10.2 x64 and a 8800GTX.

I have tried this code on my machine, and I get “unspecified launch failure”. I am running on Fedora Core 6 with 8800GTX. I also have a 7900 GTX, which is connected to my display (I’ve tried it both ways).

I found this thread and tried it out due to a mysterious launch failure problem I’m having with a much longer, more complex program. That program runs fine in emulation but fails running in hardware mode. It only launches a single thread block with a single thread, so synchronization is not an issue.

Any ideas?

Jon