CUDA Debug and Profiling success but Release hangs

What should you suppose of a program which passes the debug phase of CUDA, profiling success more than 50% occupancy everywhere but the Release version hangs? Its a weird situation I have isolated the problem and I will send it to NVIDIA if I finally give up…By the way you should profile a complex problem, in my case I am seeing 100% occupancy on some kernels I made, nice feeling, my code under the microscope…

I have isolated the problem into requiring only home-made kernels so the problem is mine. There is a conceptual mistake which needs to be found. So no need to send it to NVIDIA I will find it myself.

First of all. Can someone in the Forum administrators do something with the troll in this forum? Possibly pressing charges? Thanks.

The second is that there is a serious problem now that I have broken my code in bits. I need to get in contact with NVIDIA. In debugging everything works perfectly but in release problem appear.

The most common source of errors that only occur in release mode are race conditions, followed by uninitialized variables and out-of-bounds accesses. I would suggest running the code under cuda-memcheck, on the highest compute capability GPU you have available (as recent GPUs offer improved hardware “hooks” for cuda-memcheck to utilize). Note that cuda-memcheck has separate modes for finding race conditions vs out-of-memory accesses. As far as I know, cuda-memcheck can find many, but not all, race conditions, so depending on what your code does a manual review of that aspect is advisable.

Also, review the error checking of CUDA API calls and kernel launches to make sure they are checked 100% of the time. Similarly, it may be advisable to check the host code with valgrind. Depending on the valgrind version and the CUDA version, and the complexity of the application code, there could be occasional misdiagnoses as valgrind seems to be unable to track memory modified by DMA.

One addition to @njuffa’s suggestion: if you’re trying to locate an exception in an optimized non-debug kernel then include the “-lineinfo” option and then run cuda-memcheck. It has saved me a couple times! :)

I am pretty sure its not a race condition…the kernel refuses to execute and leaves me with dirty data in a vector then when the other kernel executes which is dependent from it hangs, its simple as this. I have sent to NVIDIA the code. I am using Visual Studio NSIGHT with memcheck checked. So if there was a race condition it would tell. The program is clean. I am not going to use valgrind because the host code is very simple and its not possible to have a mistake…I am close to 100% certain that there is a problem with CUDA. Lets see what the NVIDIA engineers will tell me…and it was very simple telling that the kernel does not execute in the release mode: if (!thread) printf(“INSIDE\n”);
if the kernel fails to execute it wont print anything. The reading and writing in the dirty kernel is done by one thread in a specific memory place not by many…

I am posting here the kernel when I comment “nodes[o+StartIndex] = node” it runs perfectly and when I uncomment the kernel refuses to execute, I am saying right how how interesting…advice guys :

__global__ void kernelAddAdditionalNodes(dev_octree_node* nodes, int numbNodes, int StartIndex, int StartIndexNew, uint* leafBits, uint* childrenAddress){
	int o = blockIdx.x*blockDim.x + threadIdx.x;
	char signx[8] = {-1, -1, -1, -1, 1, 1, 1, 1};
	char signy[8] = {-1, -1, 1, 1, -1, -1, 1, 1};
	char signz[8] = {-1, 1, -1, 1, -1, 1, -1, 1};
	if (!o) printf("Inside\n ");
	if (o < numbNodes){
		dev_octree_node node = nodes[o+StartIndex];
		//printf("%d ", node.firstChildIdx);
		if (node.firstChildIdx < 0){
			if (leafBits[o]){
				node.firstChildIdx = StartIndexNew + childrenAddress[o] - 8;
				//printf("%d ", node.firstChildIdx);
				for (int i = 0; i < 8; i++){
					dev_octree_node childNode;
					childNode.center.x = node.center.x + signx[i]*node.W/4.0;
					childNode.center.y = node.center.y + signy[i]*node.W/4.0;
					childNode.center.z = node.center.z + signz[i]*node.W/4.0;
					childNode.key = (node.key << 3) + i;
					childNode.firstChildIdx = -1;
					childNode.parentIdx = o+StartIndex;
					childNode.tindx = -1;
					childNode.tnumb = 0;
					childNode.W = node.W/2.0;
					nodes[node.firstChildIdx + i] = childNode;
				}
			}
		}
		//nodes[o+StartIndex] = node;
	}
}

So its not a race condition the same thread reads from nodes and writes…

Also when I change to :

__global__ void kernelAddAdditionalNodes(dev_octree_node* nodes, int numbNodes, int StartIndex, int StartIndexNew, uint* leafBits, uint* childrenAddress){
	int o = blockIdx.x*blockDim.x + threadIdx.x;
	char signx[8] = {-1, -1, -1, -1, 1, 1, 1, 1};
	char signy[8] = {-1, -1, 1, 1, -1, -1, 1, 1};
	char signz[8] = {-1, 1, -1, 1, -1, 1, -1, 1};
	if (!o) printf("Inside\n ");
	if (o < numbNodes){
		dev_octree_node node = nodes[o+StartIndex];
		//printf("%d ", node.firstChildIdx);
		if (node.firstChildIdx < 0){
			if (leafBits[o]){
				nodes[o+StartIndex].firstChildIdx = StartIndexNew + childrenAddress[o] - 8;
				node.firstChildIdx = StartIndexNew + childrenAddress[o] - 8;
				//printf("%d ", node.firstChildIdx);
				for (int i = 0; i < 8; i++){
					dev_octree_node childNode;
					childNode.center.x = node.center.x + signx[i]*node.W/4.0;
					childNode.center.y = node.center.y + signy[i]*node.W/4.0;
					childNode.center.z = node.center.z + signz[i]*node.W/4.0;
					childNode.key = (node.key << 3) + i;
					childNode.firstChildIdx = -1;
					childNode.parentIdx = o+StartIndex;
					childNode.tindx = -1;
					childNode.tnumb = 0;
					childNode.W = node.W/2.0;
					nodes[node.firstChildIdx + i] = childNode;
				}
			}
		}
	}
}

The program runs perfectly fine. So there is a problem with CUDA. I have sent to NVIDIA the code and they should explain me why this happens…

One random idea: the dev_octree_node looks like it’s a large structure. Are you using any vector types (uint4, etc.) inside? Are you copying these structures from the host to the device? You might want to check to see what the sizeof(dev_octree_node) is on the device and whether it matches up with what you think it should be as well as what sizeof(dev_octree_node) is on the host.

They should be the same size but I will check…no vector types…yes its a large structure…
For the shake of completeness I am posting the structure :

struct point{
	float x, y, z;
};

struct dev_octree_node{
	unsigned int key;
	point center;
	float W;
	int tindx;
	int tnumb;
	int parentIdx;
	int firstChildIdx;
	int neighs[27];
	int vertex[8];
	int edge[12];
};

So I do not see a problem here other than CUDA not being able to handle the transfer in SM 3.5…mind that it can handle in SM < 3.5. There is a serious problem in 3.5. This is a bit frustrating and slows my production line…its simply not natural everything to run perfectly on 3.0 and then on 3.5 to hang. Something terribly goes wrong…