division and modulo operations on indices

Tuee22 · September 5, 2017, 4:09pm

I’m trying to use logic similar to the following in a kernel:

int dim1 = threadIdx.x / some_int;
int dim2 = threadIdx.x % some_int;

Where dim1 and dim2 then have meaningful index values.

However when I debug and watch dim1 and dim2 they do not appear to be calculating correctly. (And I have to make them ‘volatile’ for them to be viewable in the debugger in the first place.) Just wondering if floored division and modulo operations are known to be glitchy / to be avoided?

njuffa · September 5, 2017, 4:50pm

Glitchy? No. To be avoided? Possibly, due to performance reasons; but this doesn’t seem to be such an instance.

Do you have a minimal, buildable and runnable example that demonstrates the issue you observe?

Tuee22 · September 5, 2017, 6:40pm

Here it is. I’m getting all 0’s in the displayed output.

#include "cuda_runtime.h"
#include <iostream>

__global__ void division_test_kernel(const int denominator, int * d1_out, int * d2_out)
{
	int d1 = (int)threadIdx.x / denominator;
	int d2 = (int)threadIdx.x % denominator;
	int tid = (int)blockIdx.x * (int)blockDim.x + (int)threadIdx.x;

	d1_out[tid] = d1;
	d2_out[tid] = d2;
}

void division_test()
{
	int blocks = 1;
	int threads_per_block = 256;
	int denominator = 10;

	int * d1_out, *d2_out;

	// allocate device memory
	cudaMalloc((void**)&d1_out, blocks * threads_per_block * sizeof(int));
	cudaMalloc((void**)&d2_out, blocks * threads_per_block * sizeof(int));

	// call kernel
	division_test_kernel <<< blocks, threads_per_block >>> (denominator, d1_out, d2_out);

	// allocate host memory and copy results
	int *d1_out_host, *d2_out_host;
	cudaMallocHost((void**)&d1_out_host, blocks * threads_per_block  * sizeof(int));
	cudaMallocHost((void**)&d2_out_host, blocks * threads_per_block * sizeof(int));

	cudaMemcpy(d1_out, d1_out_host, blocks * threads_per_block * sizeof(int), cudaMemcpyDeviceToHost);
	cudaMemcpy(d2_out, d2_out_host, blocks * threads_per_block * sizeof(int), cudaMemcpyDeviceToHost);

	// display results
	std::cout << "d1" << std::endl;
	for (int i = 0; i < blocks * threads_per_block; ++i) std::cout << d1_out_host[i] << std::endl;
	std::cout << std::endl << "d2" << std::endl;
	for (int i = 0; i < blocks * threads_per_block; ++i) std::cout << d2_out_host[i] << std::endl;

	// cleanup device memory
	cudaFree(d1_out);
	cudaFree(d2_out);

	// cleanup host memory
	cudaFreeHost(d1_out_host);
	cudaFreeHost(d2_out_host);
}

int main()
{
	division_test();
	return 0;
}

njuffa · September 5, 2017, 6:54pm

Wrong order of arguments in the calls to cudaMemcpy(). You want:

cudaMemcpy(d1_out_host, d1_out, blocks * threads_per_block * sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(d2_out_host, d2_out, blocks * threads_per_block * sizeof(int), cudaMemcpyDeviceToHost);

Tuee22 · September 5, 2017, 7:46pm

Well that’s embarrassing :p That said, I still have a problem in my actual code though, and I’m having some trouble reproducing it in a smaller example. I’ll take another hack at it an post again shortly.

One challenge is that the VS Nsight debugger is giving me “has no value at the target location” which is preventing me from seeing what’s going on. I can’t find any recent posts explaining how to work around that. Any suggestions?

njuffa · September 5, 2017, 7:56pm

For use with the debugger, you need to use debug builds. Are you doing that?

For debug builds, the CUDA compiler turns off all optimizations. This forces source code variables to reside in memory locations where the debugger knows where to grab them. In optimized builds, a given variable from the source code may not actually exist anywhere, or could be moved through n different registers in the course of a single loop iteration, so that the debugger has no clue where to grab the data.

I often debug by inserting printf() calls, and logging the results. That is a habit I developed when programming embedded systems for which no dedicated debugger existed. This is how I established in ten seconds that your kernel was working correctly.

Tuee22 · September 5, 2017, 8:04pm

I am doing a debug build with optimization explicitly turned off, and still I am getting “has no value at the target location”. Declaring the variable volatile makes it show up as a 0 in the watch, except that’s clearly not right in all cases.

I will try printf.

njuffa · September 5, 2017, 8:11pm

Not sure what this means. Can you show the exact nvcc commandline invocation? Do you have -g -G as part of the nvcc commandline? Normally that should take care of it. Do you have multiple, contradictory flags, by any chance?

Tuee22 · September 6, 2017, 2:53am

Copied from the VS console:

C:\C++\Projects\test-projects\mnist\mnist\mnist>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe” -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64” -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -G --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -gencode=arch=compute_50,code="sm_50,compute_50" -g -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /FS /Zi /RTC1 /MDd " -o x64\Debug\kernels.cu.obj “C:\C++\Projects\test-projects\mnist\mnist\mnist\kernels.cu” -clean

The only manual modification I’ve made here is -gencode=arch=compute_50,code="sm_50,compute_50", to avoid the “The ‘compute_20’, ‘sm_20’, and ‘sm_21’ architectures are deprecated” warning. Everything else is VS generated.

njuffa · September 6, 2017, 3:39am

Yup, looks like a debug build to me. Is the variable you can’t grab with the debugger possibly something like

const float pi = 3.14159265358979323f;

If I recall correctly, under C/C++ rules no storage need to be allocated for such a variable unless its address is taken, meaning it essentially turn into a literal floating-point constant, even before any optimization happens.

It is also possible that the compiler doesn’t sufficiently pessimize the code. If you think here is a bug, you could always file a bug report. Use of the volatile attribute is a valid workaround.

Topic		Replies	Views
CUDA & modulos CUDA Programming and Performance	1	1669	February 28, 2010
Bug with integer division? CUDA Programming and Performance	33	9781	September 9, 2015
Bug about 64-bit integer division The result of 64-bit integer division is wrong in certain situatio CUDA Programming and Performance	1	808	November 30, 2009
Odd code exits kernel CUDA Programming and Performance	10	4242	July 16, 2008
Bug about 64-bit integer division The result of 64-bit integer division is wrong in certain situatio CUDA Programming and Performance	1	2172	November 30, 2009
error in modulo operation CUDA Programming and Performance	12	16252	September 20, 2009
How slow is integer division and modulo? CUDA Programming and Performance	11	11383	September 23, 2008
division/modulus optimization CUDA Programming and Performance	0	4529	March 28, 2010
Inst_fp_32 and inst_fp_64 metrics CUDA Programming and Performance	9	1727	April 7, 2018
Speed of modulo operator in CUDA CUDA Programming and Performance	5	4577	September 13, 2019

division and modulo operations on indices

Related topics