Using BlockIdx As An Index

Newbie here, so please be gentle.

I am using CUDA 7.5 with a GTX 760 programming in C++.

I am launching a kernel like this:

kernel<<<2,1024>>>(parameters);

Based on this, I would expect that two blocks of 1024 threads each should be launched.
Further, within each block, the threads should be numbered 0-1023.

Thus, for the call above, I should have:
blockIdx.x = 0, threadIdx,x = 0;
blockIdx.x = 1, threadIdx.x = 0;

However, when I run this with the kernel call as above, I never see a blockIdx.x that is anything other than 0.

Why is that?

Because you’ve made a mistake of some sort. It’s notoriously difficult to debug code that you can’t see.

As a relatively simple starting point, make sure you have no CUDA runtime errors (google “proper cuda error checking” and take the first hit) and also run your code with cuda-memcheck

After that, provide a simple reproducer. If the problem can be deduced from the amount of information you have provided. then a simple reproducer should be trivial to create. A kernel call, with a single printf statement in it, and nearly nothing in terms of host code except the kernel call and a cudaDeviceSynchronize statement.

I claim this test case matches the information you have provided, but it does not reproduce the issue you describe:

$ cat t97.cu
#include <stdio.h>

__global__ void kernel(){

  if (threadIdx.x == 0) printf("I am thread 0 in block %d\n", blockIdx.x);
}

int main(){

  kernel<<<2,1024>>>();
  cudaDeviceSynchronize();
}

$ nvcc -arch=sm_30 -o t97 t97.cu
$ cuda-memcheck ./t97
========= CUDA-MEMCHECK
I am thread 0 in block 0
I am thread 0 in block 1
========= ERROR SUMMARY: 0 errors
$

Feel free to run this code on your system. Feel free to modify it if necessary to produce the observation that you are confused about. Feel free to post a full test case like I have above, with a description of the observation that you don’t understand.

I suppose I deserved to be thrashed for that.
In my defense, I was trying to ask a very narrow question about the kernel call, which you kindly answered.

That said, I am not seeing the behavior you are.

The kernel call looks like:

bool GPUGenerate_segmentsFlat(int maxpoints, int numContours, int *onumElements, int *ocumElements, int *rnumElements, float *oContPerim, bool *oFinishedStatus, GPURec *GlobalMem)
{
	if(maxpoints <= DCE_THREADS_PER_BLOCK)
	{
		gpuDCESegsFlatShared<<<numContours,DCE_THREADS_PER_BLOCK>>>(numContours, onumElements, ocumElements, rnumElements, oContPerim, oFinishedStatus, GlobalMem);
		cudaDeviceSynchronize();

		return(true);
	}

	return(true);
};

The kernel itself, looks like:

__global__ void gpuDCESegsFlatShared(int numContours, int *onumElements, int *ocumElements, int *rnumElements, float *oContPerim, bool *oFinishedStatus, GPURec *GlobalMem)
{
//	__shared__ GPURec sdata[1024];
//	__shared__ bool fdata[1024];
//	__shared__ int ndata[1024];
//	__shared__ bool globalend;

	int seg1x, seg1y, seg2x, seg2y;
	int tidx = threadIdx.x;
	int bidx = blockIdx.x;
	int gidx = gridDim.x;

	int ttest;

	if(gidx != 0)
		ttest++;

// various bits of initialization code are here


	if(tidx == 0)
	{
		if(bidx != 0)
		{
			fdata[bidx] = false;
			ndata[bidx] = rnumElements[bidx];
		}
		else
		{
			fdata[bidx] = false;
			ndata[bidx] = rnumElements[bidx];
		}
	}

// real function code is in here

	__syncthreads();

}

Here, I know/checked that numContours = 2. DCE_THREADS_PERBLOCK = 1024.

When I put breakpoints on either the gidx test or checking bidx !=0, I never reach them.
As far as I can tell, I never see blockIdx.x that is non-zero.
For testing purposes, I commented out the __shared declaration to see if that might have something to do with things. Still, no effect.

sorry, didn’t realize it was thrashing

What you’ve provided is a lot less useful than providing a minimal, complete test case.

Did you do anything at all that I suggested?

  1. run your code with cuda-memcheck
  2. try running my code to see if it breaks
  3. provide a complete test case

I see no evidence of it in your response

the first one is really not that difficult

have you verified that maxpoints is less than or equal to DCE_THREADS_PER_BLOCK? Equivalently, have you run a profiler to confirm that gpuDCESegsFlatShared is actually getting launched at all?

I am using VS2012 and NSight 5.2.

The code was run from within VS2012 using the NSight debugger with “Enable CUDA Memory Checker” turned on.

I know/checked that maxpoints is < 1024.
Nevertheless, I commented out this condition and just called the kernel to eliminate this as a potential issue. Further, I hardcoded the kernel call with “2, 1024” to eliminate any variable as an issue. No effect.

Further, breakpoints set at the __syncthreads() call or the “else” clause when bidx==0 are hit.
Thus, I know that the kernel is being called.

As my own code is so minimal (just what you see), that appeared to be a good test case.

Nevertheless, I also tried this:

__global__ void gpuKernel(void)
{
	bool temp = false;

	if(threadIdx.x == 0)
	{
		if(blockIdx.x == 0)
			temp = false;
		else
			temp = true;
	}
}

bool minimalKernelCall(void)
{
		gpuKernel<<<2,1024>>>();
		return(true);
}

A breakpoint set for when BlockIdx.x == 0 is hit. A breakpoint set at “temp=true” is not hit. Indeed, NSight says that it will never be hit.

I was thinking that I should be using the grid dimensions, but gridIdx.x is also always 0.

In the code that you have now shown, add this line to the very end of your kernel code, right before the closing curly-braces:

if (threadIdx.x == 0) printf("blockIdx.x = %d\n", blockIdx.x);

and modify your host code to add:

cudaDeviceSynchronnize();

immediately after the kernel call. Then recompile/run and indicate what is printed out.

(There is no variable gridIdx.x, btw. You probably meant gridDim.x, but that variable can never be zero.)

Your claim that gridDim.x can never be zero got me investigating things.
I was actually seeing gridDim.x == 0.

Searches led me to this:
https://devtalk.nvidia.com/default/topic/468337/problem-with-threadidx-not-being-set-or-always-zero-/

My code used the threadIdx, blockIdx, and gridDim to initialize the variables in the declaration.
Simply, declaring the variables first, then separately initializing them resulted in correct blockIdx.x values.

Your code was not seeing this because you were reading the values directly. I saw stuff saying that repeatedly reading the dimension variables slowed things down, so I did it the way I did.

If NVidia is listening, there should be GIANT RED FLAG about this (or really the bug should be eliminated altogether).

The way I declared the variables is the most natural ways to do so and what you would expect any programmer (at least one new to CUDA) to do.

I don’t seem to have any trouble with it:

$ cat t1329.cu
#include <stdio.h>

__global__ void kernel(){

  int tid = threadIdx.x;
  int bid = blockIdx.x;

printf("tid: %d, bid: %d\n", tid, bid);
}

int main(){

  kernel<<<5,2>>>();
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_35 -o t1329 t1329.cu
$ cuda-memcheck ./t1329
========= CUDA-MEMCHECK
tid: 0, bid: 0
tid: 1, bid: 0
tid: 0, bid: 4
tid: 1, bid: 4
tid: 0, bid: 1
tid: 1, bid: 1
tid: 0, bid: 3
tid: 1, bid: 3
tid: 0, bid: 2
tid: 1, bid: 2
========= ERROR SUMMARY: 0 errors
$

It also wouldn’t explain your observations in comment #5 since you are not declaring variables that way, in that example. You don’t ever seem to do any of the things I ask, so I’m going to give up now.