Data is corrupted when I use more than ~200 threads

I wrote this minimal version of the code to get the same error

#include <cmath>
#include <iostream>
#include <ctime>
#include <stdio.h>

#define NoOfthread 800
#define NoOfBlock 3

__global__ void simulate(double* q_global, double* u_global, double* qf_global, double* uf_global) {


		int i;
		double q[19], u[18];

		for (int i = 0; i < 19; i++)
			q[i] = q_global[blockDim.x * blockIdx.x * 19 + threadIdx.x * 19 + i];

		q[6] += threadIdx.x * 5.0 + 10;

		for (int i = 0; i < 18; i++)
			u[i] = u_global[blockDim.x * blockIdx.x * 18 + threadIdx.x * 18 + i];

		for (i = 0; i < 19; i++)
			qf_global[blockDim.x * blockIdx.x * 19 + threadIdx.x * 19 + i] = q[i];
		for (i = 0; i < 18; i++)
			uf_global[blockDim.x * blockIdx.x * 18 + threadIdx.x * 18 + i] = u[i];

}

int main() {

	double *a, *b, *c, *d;
	double *d_a, *d_b;
	double *d_c, *d_d;

	a = (double *) malloc(19 * NoOfthread * NoOfBlock * sizeof(double));
	b = (double *) malloc(18 * NoOfthread * NoOfBlock * sizeof(double));
	c = (double *) malloc(19 * NoOfthread * NoOfBlock * sizeof(double));
	d = (double *) malloc(18 * NoOfthread * NoOfBlock * sizeof(double));

	for (int i = 0; i < 19 * NoOfthread * NoOfBlock; i++)
		a[i] = 0;

	for (int i = 0; i < 18 * NoOfthread * NoOfBlock; i++)
		b[i] = 0;

	cudaMalloc(&d_a, 19 * NoOfthread * NoOfBlock * sizeof(double));
	cudaMalloc(&d_b, 18 * NoOfthread * NoOfBlock * sizeof(double));
	cudaMalloc(&d_c, 19 * NoOfthread * NoOfBlock * sizeof(double));
	cudaMalloc(&d_d, 18 * NoOfthread * NoOfBlock * sizeof(double));

	cudaMemcpy(d_a, a, 19 * NoOfthread * NoOfBlock * sizeof(double), cudaMemcpyHostToDevice);
	cudaMemcpy(d_b, b, 18 * NoOfthread * NoOfBlock * sizeof(double), cudaMemcpyHostToDevice);

	simulate<<<NoOfBlock, NoOfthread>>>(d_a, d_b, d_c, d_d);
	cudaMemcpy(c, d_c, 19 * NoOfthread * NoOfBlock * sizeof(double), cudaMemcpyDeviceToHost);
	cudaMemcpy(d, d_d, 18 * NoOfthread * NoOfBlock * sizeof(double), cudaMemcpyDeviceToHost);

	for (int i = 0; i < 19; i++)
		printf("q0[%d]=%f ", i, c[i]);

	printf("\n");

	for (int i = 19; i < 38; i++)
		printf("q1[%d]=%f ", i, c[i]);


	free(a);
	free(b);
	free(c);
	free(d);

	cudaFree(d_a);
	cudaFree(d_b);
	cudaFree(d_c);
	cudaFree(d_d);


	return 0;
}

Basically I send data (2 types) to the global memory, modify it, and send it back to the host memory. It works perfectly with less than 200 threads but when I use more than that, I get corrupted data. I use one Titan X and eclipse Nsight.

Any time you are having trouble with a CUDA code, you should be sure to use proper cuda error checking. If you don’t know what that is, google “proper cuda error checking” and take the first hit.

You can also run cuda codes with cuda-memcheck to get a quick test of things.

If you use either of the above methods, you will find that your posted code is indicating “too many resources requested for launch”

This generally means you have a registers-per-thread issue. If you google that, you will find many indications of what it is and how to address it.

As a simple proof-point (I’m not suggesting this should be the final fix) compile your code with the additional compiler switch (there should be a project setting in nsight eclipse to enable/select this):

-maxrregcount 32

Then re-run it (with cuda-memcheck, if you wish).

Thank you for your reply. I am a new cuda coder (this is my first cuda code) so I didn’t know that these tools were available. but I ran cuda-memcheck and got zero errors. I tried -maxregcount 32 but it didn’t change anything.

Then I rebooted and realized that nvidia graphics driver is damaged somehow. I couldn’t login… I reinstalled the nvidia driver and ran the same code without any change and the code seems to work fine…