division and modulo operations on indices

I’m trying to use logic similar to the following in a kernel:

int dim1 = threadIdx.x / some_int;
int dim2 = threadIdx.x % some_int;

Where dim1 and dim2 then have meaningful index values.

However when I debug and watch dim1 and dim2 they do not appear to be calculating correctly. (And I have to make them ‘volatile’ for them to be viewable in the debugger in the first place.) Just wondering if floored division and modulo operations are known to be glitchy / to be avoided?

Glitchy? No. To be avoided? Possibly, due to performance reasons; but this doesn’t seem to be such an instance.

Do you have a minimal, buildable and runnable example that demonstrates the issue you observe?

Here it is. I’m getting all 0’s in the displayed output.

#include "cuda_runtime.h"
#include <iostream>

__global__ void division_test_kernel(const int denominator, int * d1_out, int * d2_out)
{
	int d1 = (int)threadIdx.x / denominator;
	int d2 = (int)threadIdx.x % denominator;
	int tid = (int)blockIdx.x * (int)blockDim.x + (int)threadIdx.x;

	d1_out[tid] = d1;
	d2_out[tid] = d2;
}

void division_test()
{
	int blocks = 1;
	int threads_per_block = 256;
	int denominator = 10;

	int * d1_out, *d2_out;

	// allocate device memory
	cudaMalloc((void**)&d1_out, blocks * threads_per_block * sizeof(int));
	cudaMalloc((void**)&d2_out, blocks * threads_per_block * sizeof(int));

	// call kernel
	division_test_kernel <<< blocks, threads_per_block >>> (denominator, d1_out, d2_out);

	// allocate host memory and copy results
	int *d1_out_host, *d2_out_host;
	cudaMallocHost((void**)&d1_out_host, blocks * threads_per_block  * sizeof(int));
	cudaMallocHost((void**)&d2_out_host, blocks * threads_per_block * sizeof(int));

	cudaMemcpy(d1_out, d1_out_host, blocks * threads_per_block * sizeof(int), cudaMemcpyDeviceToHost);
	cudaMemcpy(d2_out, d2_out_host, blocks * threads_per_block * sizeof(int), cudaMemcpyDeviceToHost);

	// display results
	std::cout << "d1" << std::endl;
	for (int i = 0; i < blocks * threads_per_block; ++i) std::cout << d1_out_host[i] << std::endl;
	std::cout << std::endl << "d2" << std::endl;
	for (int i = 0; i < blocks * threads_per_block; ++i) std::cout << d2_out_host[i] << std::endl;

	// cleanup device memory
	cudaFree(d1_out);
	cudaFree(d2_out);

	// cleanup host memory
	cudaFreeHost(d1_out_host);
	cudaFreeHost(d2_out_host);
}

int main()
{
	division_test();
	return 0;
}

Wrong order of arguments in the calls to cudaMemcpy(). You want:

cudaMemcpy(d1_out_host, d1_out, blocks * threads_per_block * sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(d2_out_host, d2_out, blocks * threads_per_block * sizeof(int), cudaMemcpyDeviceToHost);

Well that’s embarrassing :p That said, I still have a problem in my actual code though, and I’m having some trouble reproducing it in a smaller example. I’ll take another hack at it an post again shortly.

One challenge is that the VS Nsight debugger is giving me “has no value at the target location” which is preventing me from seeing what’s going on. I can’t find any recent posts explaining how to work around that. Any suggestions?

For use with the debugger, you need to use debug builds. Are you doing that?

For debug builds, the CUDA compiler turns off all optimizations. This forces source code variables to reside in memory locations where the debugger knows where to grab them. In optimized builds, a given variable from the source code may not actually exist anywhere, or could be moved through n different registers in the course of a single loop iteration, so that the debugger has no clue where to grab the data.

I often debug by inserting printf() calls, and logging the results. That is a habit I developed when programming embedded systems for which no dedicated debugger existed. This is how I established in ten seconds that your kernel was working correctly.

I am doing a debug build with optimization explicitly turned off, and still I am getting “has no value at the target location”. Declaring the variable volatile makes it show up as a 0 in the watch, except that’s clearly not right in all cases.

I will try printf.

Not sure what this means. Can you show the exact nvcc commandline invocation? Do you have -g -G as part of the nvcc commandline? Normally that should take care of it. Do you have multiple, contradictory flags, by any chance?

Copied from the VS console:

C:\C++\Projects\test-projects\mnist\mnist\mnist>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe” -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64” -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -G --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -gencode=arch=compute_50,code=“sm_50,compute_50” -g -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /FS /Zi /RTC1 /MDd " -o x64\Debug\kernels.cu.obj “C:\C++\Projects\test-projects\mnist\mnist\mnist\kernels.cu” -clean

The only manual modification I’ve made here is -gencode=arch=compute_50,code=“sm_50,compute_50”, to avoid the “The ‘compute_20’, ‘sm_20’, and ‘sm_21’ architectures are deprecated” warning. Everything else is VS generated.

Yup, looks like a debug build to me. Is the variable you can’t grab with the debugger possibly something like

const float pi = 3.14159265358979323f;

If I recall correctly, under C/C++ rules no storage need to be allocated for such a variable unless its address is taken, meaning it essentially turn into a literal floating-point constant, even before any optimization happens.

It is also possible that the compiler doesn’t sufficiently pessimize the code. If you think here is a bug, you could always file a bug report. Use of the volatile attribute is a valid workaround.