Using BlockIdx As An Index

jpistorino · April 11, 2017, 11:54pm

Newbie here, so please be gentle.

I am using CUDA 7.5 with a GTX 760 programming in C++.

I am launching a kernel like this:

kernel<<<2,1024>>>(parameters);

Based on this, I would expect that two blocks of 1024 threads each should be launched.
Further, within each block, the threads should be numbered 0-1023.

Thus, for the call above, I should have:
blockIdx.x = 0, threadIdx,x = 0;
blockIdx.x = 1, threadIdx.x = 0;

However, when I run this with the kernel call as above, I never see a blockIdx.x that is anything other than 0.

Why is that?

Robert_Crovella · April 12, 2017, 12:00am

Because you’ve made a mistake of some sort. It’s notoriously difficult to debug code that you can’t see.

As a relatively simple starting point, make sure you have no CUDA runtime errors (google “proper cuda error checking” and take the first hit) and also run your code with cuda-memcheck

After that, provide a simple reproducer. If the problem can be deduced from the amount of information you have provided. then a simple reproducer should be trivial to create. A kernel call, with a single printf statement in it, and nearly nothing in terms of host code except the kernel call and a cudaDeviceSynchronize statement.

I claim this test case matches the information you have provided, but it does not reproduce the issue you describe:

$ cat t97.cu
#include <stdio.h>

__global__ void kernel(){

  if (threadIdx.x == 0) printf("I am thread 0 in block %d\n", blockIdx.x);
}

int main(){

  kernel<<<2,1024>>>();
  cudaDeviceSynchronize();
}

$ nvcc -arch=sm_30 -o t97 t97.cu
$ cuda-memcheck ./t97
========= CUDA-MEMCHECK
I am thread 0 in block 0
I am thread 0 in block 1
========= ERROR SUMMARY: 0 errors
$

Feel free to run this code on your system. Feel free to modify it if necessary to produce the observation that you are confused about. Feel free to post a full test case like I have above, with a description of the observation that you don’t understand.

jpistorino · April 12, 2017, 1:21am

I suppose I deserved to be thrashed for that.
In my defense, I was trying to ask a very narrow question about the kernel call, which you kindly answered.

That said, I am not seeing the behavior you are.

The kernel call looks like:

bool GPUGenerate_segmentsFlat(int maxpoints, int numContours, int *onumElements, int *ocumElements, int *rnumElements, float *oContPerim, bool *oFinishedStatus, GPURec *GlobalMem)
{
	if(maxpoints <= DCE_THREADS_PER_BLOCK)
	{
		gpuDCESegsFlatShared<<<numContours,DCE_THREADS_PER_BLOCK>>>(numContours, onumElements, ocumElements, rnumElements, oContPerim, oFinishedStatus, GlobalMem);
		cudaDeviceSynchronize();

		return(true);
	}

	return(true);
};

The kernel itself, looks like:

__global__ void gpuDCESegsFlatShared(int numContours, int *onumElements, int *ocumElements, int *rnumElements, float *oContPerim, bool *oFinishedStatus, GPURec *GlobalMem)
{
//	__shared__ GPURec sdata[1024];
//	__shared__ bool fdata[1024];
//	__shared__ int ndata[1024];
//	__shared__ bool globalend;

	int seg1x, seg1y, seg2x, seg2y;
	int tidx = threadIdx.x;
	int bidx = blockIdx.x;
	int gidx = gridDim.x;

	int ttest;

	if(gidx != 0)
		ttest++;

// various bits of initialization code are here


	if(tidx == 0)
	{
		if(bidx != 0)
		{
			fdata[bidx] = false;
			ndata[bidx] = rnumElements[bidx];
		}
		else
		{
			fdata[bidx] = false;
			ndata[bidx] = rnumElements[bidx];
		}
	}

// real function code is in here

	__syncthreads();

}

Here, I know/checked that numContours = 2. DCE_THREADS_PERBLOCK = 1024.

When I put breakpoints on either the gidx test or checking bidx !=0, I never reach them.
As far as I can tell, I never see blockIdx.x that is non-zero.
For testing purposes, I commented out the __shared declaration to see if that might have something to do with things. Still, no effect.

Robert_Crovella · April 12, 2017, 2:03am

sorry, didn’t realize it was thrashing

What you’ve provided is a lot less useful than providing a minimal, complete test case.

Did you do anything at all that I suggested?

run your code with cuda-memcheck
try running my code to see if it breaks
provide a complete test case

I see no evidence of it in your response

the first one is really not that difficult

have you verified that maxpoints is less than or equal to DCE_THREADS_PER_BLOCK? Equivalently, have you run a profiler to confirm that gpuDCESegsFlatShared is actually getting launched at all?

jpistorino · April 12, 2017, 3:17pm

I am using VS2012 and NSight 5.2.

The code was run from within VS2012 using the NSight debugger with “Enable CUDA Memory Checker” turned on.

I know/checked that maxpoints is < 1024.
Nevertheless, I commented out this condition and just called the kernel to eliminate this as a potential issue. Further, I hardcoded the kernel call with “2, 1024” to eliminate any variable as an issue. No effect.

Further, breakpoints set at the __syncthreads() call or the “else” clause when bidx==0 are hit.
Thus, I know that the kernel is being called.

As my own code is so minimal (just what you see), that appeared to be a good test case.

Nevertheless, I also tried this:

__global__ void gpuKernel(void)
{
	bool temp = false;

	if(threadIdx.x == 0)
	{
		if(blockIdx.x == 0)
			temp = false;
		else
			temp = true;
	}
}

bool minimalKernelCall(void)
{
		gpuKernel<<<2,1024>>>();
		return(true);
}

A breakpoint set for when BlockIdx.x == 0 is hit. A breakpoint set at “temp=true” is not hit. Indeed, NSight says that it will never be hit.

I was thinking that I should be using the grid dimensions, but gridIdx.x is also always 0.

Robert_Crovella · April 12, 2017, 3:40pm

In the code that you have now shown, add this line to the very end of your kernel code, right before the closing curly-braces:

if (threadIdx.x == 0) printf("blockIdx.x = %d\n", blockIdx.x);

and modify your host code to add:

cudaDeviceSynchronnize();

immediately after the kernel call. Then recompile/run and indicate what is printed out.

(There is no variable gridIdx.x, btw. You probably meant gridDim.x, but that variable can never be zero.)

jpistorino · April 12, 2017, 6:19pm

Your claim that gridDim.x can never be zero got me investigating things.
I was actually seeing gridDim.x == 0.

Searches led me to this:
https://devtalk.nvidia.com/default/topic/468337/problem-with-threadidx-not-being-set-or-always-zero-/

My code used the threadIdx, blockIdx, and gridDim to initialize the variables in the declaration.
Simply, declaring the variables first, then separately initializing them resulted in correct blockIdx.x values.

Your code was not seeing this because you were reading the values directly. I saw stuff saying that repeatedly reading the dimension variables slowed things down, so I did it the way I did.

If NVidia is listening, there should be GIANT RED FLAG about this (or really the bug should be eliminated altogether).

The way I declared the variables is the most natural ways to do so and what you would expect any programmer (at least one new to CUDA) to do.

Robert_Crovella · April 12, 2017, 8:52pm

I don’t seem to have any trouble with it:

$ cat t1329.cu
#include <stdio.h>

__global__ void kernel(){

  int tid = threadIdx.x;
  int bid = blockIdx.x;

printf("tid: %d, bid: %d\n", tid, bid);
}

int main(){

  kernel<<<5,2>>>();
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_35 -o t1329 t1329.cu
$ cuda-memcheck ./t1329
========= CUDA-MEMCHECK
tid: 0, bid: 0
tid: 1, bid: 0
tid: 0, bid: 4
tid: 1, bid: 4
tid: 0, bid: 1
tid: 1, bid: 1
tid: 0, bid: 3
tid: 1, bid: 3
tid: 0, bid: 2
tid: 1, bid: 2
========= ERROR SUMMARY: 0 errors
$

It also wouldn’t explain your observations in comment #5 since you are not declaring variables that way, in that example. You don’t ever seem to do any of the things I ask, so I’m going to give up now.

Topic		Replies	Views
Newbie help on thread blocks CUDA Programming and Performance	22	10595	December 24, 2008
Where's a PTX ISA Virtual Grid ID? Special Reg %gridid is "temporal".. CUDA Programming and Performance	7	2343	January 23, 2012
Kernel launch failed while number of threads per block smaller than largest number allowed CUDA Programming and Performance cuda	12	2208	October 12, 2021
Grid dimensions CUDA Programming and Performance	6	5417	September 18, 2009
Bad blockIdx.x when profiling with nvvp CUDA Programming and Performance	0	17	September 18, 2024
Urgent help with threads please! CUDA Programming and Performance	21	10784	March 6, 2008
threadIdx undeclared - Compile Problem NVCC NVCC is not reconizing the builtin blockIdx, blockDim, n CUDA Programming and Performance	3	8391	September 24, 2009
Understanding Threads in CUDA help me find the exact number of threads for my code CUDA Programming and Performance	4	2331	July 13, 2009
Reduction CUDA Programming and Performance	19	3455	May 16, 2012
Problem Lanuching Simple Kernel CUDA Programming and Performance	14	2470	August 15, 2015

Using BlockIdx As An Index

Related topics