Launch failures after CUDA upgrade? 2.0 -> 2.3 = unspecified launch failures

Hello,

Our project recently upgraded from CUDA 2.0 to CUDA 2.3.

The project consists of many kernels and we have found that about half of them now produce “unspecified launch failures” since the upgrade. Before the upgrade all kernels were functioning properly and producing the correct output.

This problem spans across all development systems involved and is also occurring on our Tesla platform.

The code has not changed at all and yet the upgrade has apparently made around half of the kernels instantly “bugged” under 2.3

Has anybody else experienced this after upgrading CUDA?

Any input would be greatly appreciated.

Compile with --ptxas-options=-v and compare the number of registers used by the two versions of nvcc.
You can try to force the new version to use the old number of registers with the maxrregcount flag.

Hello mfatica,

Thank you for taking the time to help me with this issue.

For a given problematic kernel, NVCC 2.3 with no register limit specified generates the following:

ptxas info : Compiling entry function ‘_Z15kernel_gradientPf4dim3S0_jf’

ptxas info : Used 13 registers, 48+16 bytes smem, 12 bytes cmem[1]

Attempting to limit the register utilization via --maxrregcount (4,8,12,etc) does not remedy the issue. I still encounter “unspecified launch failures” regardless of the register limit specified by the --maxrregcount flag.

However, I have found that running the problematic kernels under emulation mode delivers “cannot fetch from a texture that is not bound.”

We have been binding texture references to cudaMalloc()ed global memory via cudaBindTexture(offset, tex_ref, dev_ptrs, size), which as worked in the past. Removing texture unit utilization and just using global memory reads remedies the issue.

Has texture behavior changed between versions?

I also have such problem. And several tens of kernels. Doing debug I have found that textures defined in such a way produce the problems you described. However, accessing the global memory instead of textures has no positive effect…

I stumbled across this thread today…

http://forums.nvidia.com/index.php?showtopic=102830

It seems that we are not the only ones with this problem.

Could somebody from NVIDIA please address or acknowledge this issue? :thumbup:

Can you post a repro case?

Hello tmurray,

Thank you for taking the time to look at this issue with me.

Below I have posted an example that is similar in structure to one of our problematic kernels:

textures.c

#include <stdio.h>

#include <stdlib.h>

#include "tex_stubs.h"

int main()

{

	int elements = 100;

	float* test_data = (float*)malloc(elements*sizeof(float));

	// Generate some work

	int i;

	for (i = 0; i < elements; i++)

		test_data[i] = (float)i;

	// Invoke the texture kernel

	CUDA_texture_test(test_data, elements);

	// Print results

	for (i = 0; i < elements; i++)

		printf("%f\n", test_data[i]);

	

	free(test_data);

	return 0;

}

tex_stubs.cu

#include <stdio.h>

#include "tex_stubs.h"

#include "tex_kernels.h"

extern "C" void CUDA_texture_test(float* test_data, int elements)

{

	float* dev_test_data;

	float* dev_return;

	size_t test_size = elements * sizeof(float);

	// Allocate some global memory on the GPU

	cudaMalloc((void**)&dev_test_data, test_size);

	checkCUDAError("cudaMalloc(): dev_test_data"); 

	cudaMalloc((void**)&dev_return, test_size);

	checkCUDAError("cudaMalloc(): dev_return"); 

	// Copy test data to GPU global memory

	cudaMemcpy(dev_test_data, test_data, test_size, cudaMemcpyHostToDevice);

	checkCUDAError("cudaMemcpy(): test_data -> dev_test_data"); 

	cudaMemset(dev_return, 0, test_size);

	checkCUDAError("cudaMemset(): dev_return"); 

	memset(test_data, 0, test_size);

	// Bind allocated global memory to texture reference

	cudaBindTexture(0, tex_test, dev_test_data, test_size);

	checkCUDAError("cudaBindTexture(): dev_test_data -> tex_test"); 

	// Define the execution configuration

	int threads_per_block = 128;

	int num_threads = elements;

	int num_blocks = (int)ceil(num_threads / (float)threads_per_block);

	dim3 dimGrid(num_blocks, 1, 1);

	dim3 dimBlock(threads_per_block, 1, 1);

	// Invoke the kernel

	kernel_texture<<<dimGrid, dimBlock>>>(dev_return, test_size);

	checkCUDAError("Kernel Panic!"); 

	// Copy results back

	cudaMemcpy(test_data, dev_return, test_size, cudaMemcpyDeviceToHost);

	checkCUDAError("cudaMemcpy(): dev_return -> test_data"); 

	// Cleanup

	cudaUnbindTexture(tex_test);

	cudaFree(dev_test_data);

	cudaFree(dev_return);

}

extern "C" void checkCUDAError(const char *msg)

{

	cudaError_t err = cudaGetLastError();

	if( cudaSuccess != err) 

	{

		fprintf(stderr, "\n\nCUDA ERROR: %s (%s).\n", msg, cudaGetErrorString( err) );

		exit(EXIT_FAILURE);

	}						 

}

tex_kernels.cu

#include "tex_kernels.h"

__global__ void kernel_texture(float* dev_return, int test_size)

{

	// -- Setup Thread Attributes -----------------------------

	int blockIdxInGrid  = (gridDim.x * blockIdx.y) + blockIdx.x;

	int threadsPerBlock  = (blockDim.x * blockDim.y * blockDim.z);

	int threadIdxInBlock = (blockDim.x * blockDim.y * threadIdx.z) + (blockDim.x * threadIdx.y) + threadIdx.x;

	int threadIdxInGrid = (blockIdxInGrid * threadsPerBlock) + threadIdxInBlock;

	// --------------------------------------------------------

	// Return excess threads

	if ( threadIdxInGrid > (test_size/sizeof(float)) )

		return;

	// Read element from texture, increment it, and then

	// place it into the return array.

	dev_return[threadIdxInGrid] = tex1Dfetch(tex_test, threadIdxInGrid) + 1.0;

	

}

__global__ void kernel_no_texture(float* dev_test_data, float* dev_return, int test_size)

{

	// -- Setup Thread Attributes -----------------------------

	int blockIdxInGrid  = (gridDim.x * blockIdx.y) + blockIdx.x;

	int threadsPerBlock  = (blockDim.x * blockDim.y * blockDim.z);

	int threadIdxInBlock = (blockDim.x * blockDim.y * threadIdx.z) + (blockDim.x * threadIdx.y) + threadIdx.x;

	int threadIdxInGrid = (blockIdxInGrid * threadsPerBlock) + threadIdxInBlock;

	// --------------------------------------------------------

	// Return excess threads

	if ( threadIdxInGrid > (test_size/sizeof(float)) )

		return;

	// Read element from texture, increment it, and then

	// place it into the return array.

	dev_return[threadIdxInGrid] = dev_test_data[threadIdxInGrid] + 1.0;

	

}

tex_stubs.h

#ifndef _tex_stubs_h_

#define _tex_stubs_h_

#if defined __cplusplus

extern "C" {

#endif

void CUDA_texture_test(float* test_data, int elements);

void checkCUDAError(const char *msg);

#if defined __cplusplus

}

#endif

#endif

tex_kernels.h

#ifndef _tex_kernels_h_

#define _tex_kernels_h_

__global__ void kernel_texture(float* dev_return, int test_size);

__global__ void kernel_no_texture(float* dev_test_data, float* dev_return, int test_size);

texture<float, 1, cudaReadModeElementType> tex_test;

#endif

kernel_texture() fails to launch for me under CUDA 2.3 with an “unspecified launch failure.”

Under CUDA 2.0 this method of utilizing texture references seemed to be okay.

(We have not tried CUDA 2.1 or CUDA 2.2. We made the jump directly from CUDA 2.0 to CUDA 2.3)

Any insight you could provide is greatly appreciated. :thumbup:

Please let me know if I can provide any further information that may be helpful in resolving this issue.

UPDATE:

Just rolled a machine with a GTX 285 back to CUDA 2.2 and attempted to run the above example code. The program executed and produced to proper output as expected. It would seem that the issue probably lies somewhere within CUDA 2.3