Nsight visual studio edition, running extremely slow in debug mode

Hi, I’m using newest nsight 5.6 and vs2017.

I am doing a simple matrix multiply example, it is all fine when running the program normally.

However, when I tried to run nsight debug, the time cost is extremely high, actually it is too long that I’d never finish it.

I am using legacy mode with my geforce 1060, a normal run of 600ms cost, but never managed to finish it under debug mode.

Hello, does the break point work for you? Looks like it just hangs, can you tell me your os and driver? I guess performing a clean installation for you driver may solve this issue.

Hey, I’m in windows 10, my driver is 398.11, which is installed by geforece experience.

The break point works fine, and also the step and others. But if I clear the break point or start without break point, it just keep running and running.

however, I can press pause, and it will actually stop in according code line.

Thank you.

Which code line you stop at? Actually it’s impossible to always pause at a same CUDA code line. I think this may be the root cause. Just a guess, dose your app stop at “__syncthreads();” ?

The global function looks like this

__global__ void CuMatrixMult(float *A, float *B, float *C)
	__shared__ float N[BLOCKDIM][BLOCKDIM];
	__shared__ float M[BLOCKDIM][BLOCKDIM];
	int Tx = threadIdx.x;
	int Ty = threadIdx.y;
	int Bx = blockIdx.x;
	int By = blockIdx.y;
	int i, j;
	float rN = A[Ty * WIDTH + Bx * BLOCKDIM + Tx];
	float rM = B[(By*BLOCKDIM + Ty)*WIDTH + Tx];
	float sum = 0;

	for (i = 0; i < WIDTH / BLOCKDIM; i++) {
		N[Ty][Tx] = rN;
		M[Ty][Tx] = rM;
		for (j = 0; j < BLOCKDIM; j++) {
			rN = N[j][Tx];
			rM = M[Ty][j];
			sum += N[j][Tx] * M[Ty][j];
		rN = A[((i + 1)*BLOCKDIM + Ty)*WIDTH + Bx * BLOCKDIM + Tx];
		rM = B[(By*BLOCKDIM + Ty)*WIDTH + (i + 1) * BLOCKDIM + Tx];


	C[(By*BLOCKDIM + Ty)*WIDTH + Bx * BLOCKDIM + Tx] = sum;

so I tried 4 times, 2 stop at “rN = N[j][Tx];”, 1 stop at “sum += N[j][Tx] * M[Ty][j];”, 1 stop at “__syncthreads();”(second one)

So I do not really know is there a pattern or not…

well I actually wait it finished this time, it cost me 4 minutes, but the result is not right…

Can you try the default matrixMul sample in cuda samples?

Hi, sorry it cost me some time, the results remains similar…

I don’t really know where is the problem…

let me run some more test when I am free…