unspecified launch failure on GTX480 with cuda 3.0 but not with cuda 2.3 and 3.0 on GTX285,

Hi,

I have a kernel that work perfectly on GTX285 but that return “unspecified launch failure” when compiled with nvcc 3.0 and nvcc 3.1beta on GTX480. I found a workaround, but I don’t understant why this fix it while it is working perfectly on GTX285.

The kernel fail when I want to copy a matrix that have negative stride to a new matrix that is c contiguous. If the stride are positive, it work correctly.

Here is the code of the kernel.

__global__ void k_elemwise_unary_rowmajor_copy2 (unsigned int numEls,

		unsigned int nd,

		const int * dim,

		const float * a_data, const int * a_str,

		float * z_data, const int * z_str)

{

	const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

	const unsigned int numThreads = blockDim.x * gridDim.x;

	for (unsigned int i = idx; i < numEls; i += numThreads)

	{

		unsigned int ii = i;

		const float * a_i = a_data;

		float * z_i = z_data;

		for (unsigned int _d = 0; _d < nd; ++_d)

		{

			unsigned int d = nd - _d-1;

			unsigned int i_d = ii % dim[d]; /* BUG REMOVE THE unsigned TO FIX THE BUG   */

			ii = ii / dim[d];

			a_i += i_d * a_str[d]; /* increment our a and z pointers by i_d elements */

			z_i += i_d * z_str[d];

		}

		z_i[0] = a_i[0];

	}

}

N.B. This version of the code is made to be generic(work for any number of dimension in the matrix) not fast…

If I change the line

unsigned int i_d = ii % dim[d]; /* BUG REMOVE THE unsigned TO FIX THE BUG */

to

int i_d = ii % dim[d]; /* BUG REMOVE THE unsigned TO FIX THE BUG */

It fix my bug.

It is important that I understand this bug as I have another bug that is also related to negative stride that I’m not able to find a workaround.

I attach a full example that exhibit the problem on GTX480.

Anyone have any glue why this fail?

thanks.

edit note: made the post more clear.
cuda_error.cu (3.37 KB)

Interesting. I think I’m seeing something similar and I also have no idea how to fix it…