Possible nvcc bug?


I’ve been banging my head on a bug I have with a CUDA kernel for a couple of days now. It only occurs under very specific compile-time circumstances. I have a kernel that attempts convolution. It works under most cases, but there are a few that cause errors. For my test case, I set both arrays to 1’s with leading and/or trailing 0’s so if statements don’t need to occur. This means that in the result array, index 0 should contain 1, index 1 should contain 2, …, and then halfway it reverses and index n-2 should contain 2, and index n-1 should contain 1. This works for the majority of test cases, but with the two initial arrays set to sizes of 44100 with 512 threads per block and 256 blocks per grid, launched as 1 grid, some data in the middle is scrambled. I can’t tell exactly what it’s doing, but blocks 54-56 contain smaller numbers than they should, but still increasing. It gets to the halfway index (though it is NOT the correct value), and reverses, the numbers decreasing. But then at some point (always the same point), and not aligned on a block boundary, 3 blocks (512x3 = 1536 indexes) repeat the same number (41284), then continues decreasing and ending at 1. I have tested this many ways. First, I have added a statement in the kernel as follows:

if(index == 27649) //where index 27649 is one of the affected indexes


    result[0] = sum;


And sure enough, the result array contains the correct result at index 0, with 0’s as all the other values. I have also tested such that as the sum is being totaled, it is written to consecutive indexes in the result array, such as:

for(j = 0; j < size; j++)


    sum += arr1[j] * arr2[index - j];

    if(index == 27649) //where index 27649 is one of the affected indexes


        result[tempIndex++] = sum;



After this runs, the result array contains the values: 1,2,3,4,5,6,…,27650 like it should. This means that the correct value is being computed. But the last thing I tested is where I really can’t figure out what’s going on. In the kernel, after the sum has been computed, I have an if statement:

if(sum < 44101) //this includes all the numbers that should be calculated (since there are 44100 numbers, 44100 should be the largest number for an input of all 1's)


    result[index] = sum;


When this runs, the result array contains the pattern I first described above with the repeating numbers and duplicates. Now, if I comment that out and in it’s place instead put:

if(sum == 44000) //this number should be in the output (it is less than 44100) and greater than 0


    result[index] = sum;


The result contains all 0’s except for the two indexes that should contain 44000 (as it should). However, using the first if statement, the number 44000 is nowhere to be found (since the index that should produce it is one of the affected indexes and contains a different number). I have verified all these findings by writing the result array to a file and scanning it by hand and by using the find functionality of Visual Studio 2010 (since it contains 88199 float values, each on a separate line). Keep in mind that if I change the block size to 512, it works properly. Or 128. Or the number of threads to 256. Or change both. Or change the size of the arrays so that the first one is 44100*3 values, and the second is 16382 values.

I don’t understand how this could be where one run of the program shows that there is a result of 44000, and the other doesn’t. Everything is determined at compile time (even the threads per block and number of blocks), so there is no chance of run-time “contamination” to change the results. Does this sound like it could be an nvcc/cuda bug? If so, I can try to provide more of the kernel if it is needed.

My system information:

CUDA and the SDK are the newest (just downloaded today after this happened on 3.2 RC). I am using Visual Studio 2010 with Visual Studio 2008’s compiler. Intel core i7 920 at 2.66Ghz, 6GB DDR3 RAM, PC built by myself (not Dell, HP, etc…), running Windows 7 64-bit, with a single GTX260 (factory (EVGA) overclocked to 626MHz). If this does sound like a bug with nvcc/cuda, and you need more information, please let me know.

Thank you for your time,


I suggest to try cudamemcheck, try on another system etc. Also try with emulation mode. Overclocking could also be the reason. As well as compiler bug. Does the program return errors after kernell etc?

Also you may try with cuda 3.0 and with other architecture switches.

Unfortunately, I do not have access to a linux machine with a cuda gpu so I cannot use cuda-gdb. I should be able to try it on another system (though Windows XP, Visual Studio 2008) later today with the same graphics card (factory overclocked and everything too) but with CUDA 3.0 (is there a way to get other old versions of CUDA to try? The only reason I still have 3.0 is because I haven’t used the computer for gpu development in a while). I thought emulation mode was deprecated? I have my program to check errors after the kernel call and none are returned.

Also, it might be worth mentioning that if I change the if statement to:

if(sum == 41284) //the number that repeats 1536 times


    result[index] = index;


The output contains all 0’s except for two locations which have the values 41283 and 46915, like it should. I really have no idea how that number can be repeated 1536 times, let alone while not aligned by block (the index is not perfectly divisible by 512, the number of threads per block).

Just from behavior, your problem sounds like a coding problem, likely a race with writing into the same memory location. This is why a single test works (no race) but a range doesn’t (because there’s now multiple writers which can collide.)

But that can’t be a firm conclusion since you haven’t posted enough code. In particular, the definition of “index” is not clearly described.

If you can reproduce the problem with a small test case you can post, instead of 2-3 line snippets, it’d be easier to spot the issue.

Is index set to the thread number?

I have just tested my kernel on a different machine with (I believe) CUDA 3.0 on Windows XP and Visual Studio 2008 and it runs fine without errors.

Index is set to blockIdx.x * blockDim.x + threadIdx.x;

I will try to get a small test case to show. I thought it was a writing race as well; that’s why I’ve spent so long trying to figure it out. But everything I’ve tried, including looking at all the indexes by hand, show no two that overlap at any time during execution (even if they did, that would leave some indexes blank, which doesn’t happen). Plus, with me now getting a correct run with CUDA 3.0, I really don’t know what it could be if it’s a coding problem on my end.

What is about emulation mode?

I thought emulation mode was deprecated and not usable on 3.0+?

Yuo can use it with cuda 3.0

Oh, is it 3.1 in which you can’t use it? Do you just compile with deviceemu to true and when you run it it will be run on the cpu? Thanks for the info.

Yes, also may need to include lib files.

You are very likely irritated from results from previous runs in GPU memory (unlike CPU memory, GPU memory is not initialized to 0).

Make sure you always write to each index, i.e. change your if clauses to something like

if(sum < 44101) //this includes all the numbers that should be calculated (since there are 44100 numbers, 44100 should be the largest number for an input of all 1's)


    result[index] = sum;




    result[index] = 0;


Also, it would be a lot easier to help you if you could post complete code to look at.

Ok, I have stripped the kernel down and gotten rid of as many optimizations as I could. Here is the entire kernel code:


in1 is 44100 elements long before padded, containing all 1's, after padding there are 44099 0's, 44100 1's, and 44099 0's for a total size of 132298 elements

in2 is 44100 elements long, containing all 1's


__global__ void zeroPaddedConvKernelShared(float* result)


	float sum;

	int indexPadded, j, in2_size, accessIndex, resultIndex, indexSpacing;

	in2_size = in2_size_d[0]; //size of in2, located in constant memory

	resultIndex = blockIdx.x * blockDim.x + threadIdx.x; //get index of result

	indexPadded = resultIndex + (in2_size - 1); //don't forget that it's zero-padded

	indexSpacing = blockDim.x * gridDim.x;

//kernelStop_d[0] is located in constant memory and holds the ending value of the kernel, which is:

        //((((in1_size + (2 * (in2_size - 1))) / THREADS_PER_BLOCK) + 1) * THREADS_PER_BLOCK) + (in2_size - 1)

        //which is the next multiple of THREADS_PER_BLOCK of the zero-padded in1 added to the number of padded zeros on each side

	while(indexPadded < kernelStop_d[0])


		sum = 0.0f;

		accessIndex = indexPadded;

		for(j = 0; j < in2_size; j++)


			sum += tex1Dfetch(in2Texture, j) * tex1Dfetch(in1Texture, accessIndex--); //in1 is zero padded on each side so out-of-bounds check can be eliminated


		result[resultIndex] = sum;

		indexPadded += indexSpacing;

		resultIndex += indexSpacing;



Also, I forgot to mention that I tested this with full optimizations and with no optimizations and the result was the same.

Thanks for your time and effort looking into this, everyone. I appreciate it.

So, despite the common mantra of fixing your bugs before you move on, I moved on. I ran into another bug where CUDA returned “unknown error” from the kernel launch. I eventually narrowed it down and fixed it. Turns out it fixed this nasty bug too :confused: It was one of those typos/mixed-up thoughts (like when you want to multiply an integer by 8 and you accidentally type (num << 8) or even (num >> 3) and it’s hard to spot it when you’re looking over the code). When I was calculating kernelStop on the host, I typed dry (which is what I called in1 above for simplicity) instead of wet (which I called result above). I still don’t really understand how it messed everything up the way that it did (other than kernelStop being too large, and with threads per block being 512 and blocks per grid being 256 causes a new block to calculate another sum which wouldn’t on other sizes of in1 and in2), but at least it’s fixed now.

Thanks for your time, everyone.