Understanding Threads in CUDA help me find the exact number of threads for my code


Iam facing some problems understanding Threads, Iam providing the code iam working on, I have a card with compute capability of 1.1, means 768 threads, Its 9200M GS


In the code I have 2d arrays of size NxN which I copy from Host to device and then back to Host.

The point is this when I put N > 768 , there are unpredicted values in last row of my results. And most important of all my Gnome Crashes

Iam providing a working code --------------> With N = 768 ,blocks = 2, numberThreads per block = 384, It works If I put N = 768 ,blocks = 3, numberThreads per block = 256, It works

But if i put N = 1024, blocks = 4 , numberThreads per block = 256 , my gnome crashes and the last rows have some garbage.

I can understand I cannot spawn more than 768 threads on a compute capability 1.1 card, but Is there any other way, to make the code working with N > 768, do I have to change the kernel, say some threads do the same work twice , if thread 1 operates on array[1] it has to do for array[1+768] as wel.

I have checked, its working with N = 800, but at N = 1000 it gives garbage and now i can barely write my Gnome is showing horrible display.( iam also using Compiz)

Q1.1. when I put numberthreads = 500, blocks = 2, what do u expect in kernel this statement will return —> int idx = blockIdx.x * blockDim.x + threadIdx.x;

on a compute capability 1.1 card.

Any help I will appreciate,

Q2. if the kernel<<<…>>>() cal a blocking statement, or my code on host works independently

#include <stdio.h>

__global__ void multi( int *M1, int *M2, int *M3, size_t p_M1,size_t p_M2, size_t p_M3, int N)


	int idx =  blockIdx.x * blockDim.x   +  threadIdx.x;


	int myrow = idx;

	int j= 0,i=0;

	//int point = idx % N;

	/*if (idx < N ){

	int* row_M3 = (int*)((char*)M3 + myrow * p_M3);

	for(i = 0; i< N; i++)

	row_M3[i] = (int) 20;


	if ( idx < N ){

	int k = 0;


	int* row_M3 = (int*)((char*)M3 + myrow * p_M3);

	int* row_M1 = (int*)((char*)M1 + myrow * p_M1);


	for(j = 0; j <N; j++){


		row_M3[j] = (int) 0;


		int* row_M2 = (int*)((char*)M2 + k * p_M2);

		row_M3[j] += row_M1[k] * row_M2[j];






	int const N = 768;

int main(){

	/* pointers to host memory */

	int *Host_M1, *Host_M2, *Host_M3;

	/* pointers to device memory */

	int *GPU_M1, *GPU_M2, *GPU_M3;

	size_t pitch_M1,pitch_M2,pitch_M3;


	int i;

	/* Allocate 2darrays  on host*/

	Host_M1 = (int*) malloc(N*N*sizeof(int));

	Host_M2 = (int*) malloc(N*N*sizeof(int));



	printf("OK mem 2d host\n ");	

	/* Allocate 2darrays  on device*/


	size_t width = N* sizeof(int);

	size_t height = N;


	cudaMallocPitch((void**)&GPU_M1, &pitch_M1,width,height);

	cudaMallocPitch((void**)&GPU_M2, &pitch_M2,width,height);

	cudaMallocPitch((void**)&GPU_M3, &pitch_M3,width,height);


	printf("OK mem2d cuda\n ");


	/* Initialize arrays a and b */

	for (i=0; i<N*N; i++)




		Host_M1[i] = (int) 1;

		Host_M2[i] = (int) 1;




	printf("OK initialize\n\n\n\n\n ");


	/* Copy data from host memory to device memory */

	cudaMemcpy2D(GPU_M1, pitch_M1,Host_M1,width, width,height, cudaMemcpyHostToDevice);

	cudaMemcpy2D(GPU_M2, pitch_M2,Host_M2,width, width,height, cudaMemcpyHostToDevice);

	printf("OK  memcpy H to D\n ");

	//cudaMemcpy(b_d, b, sizeof(int)*N, cudaMemcpyHostToDevice);

	// Invoke kernel

	// here the threads and blocks are stuctured in linear way

	int threadsPerBlock = 256;

	//int blocksPerGrid = (N + threadsPerBlock - 1)/threadsPerBlock;



	cudaError_t erro = cudaGetLastError();

	 if(erro != cudaSuccess)


		printf("ERROR PREP launch FAIL!\n");


	printf("error is %s \n",cudaGetErrorString (erro) );

	printf("OK Kernel\n ");

	Host_M3 =  (int*) malloc(N*N*sizeof(int));

	cudaMemcpy2D(Host_M3,width,GPU_M3,pitch_M3,width,height ,cudaMemcpyDeviceToHost);

	printf("OK memcp D to H\n ");

	printf("OK done\n");


	for(i = N*764; i < N*N; i++){




			printf("%d(%d)  ",Host_M3[i],i%N);

			if(i%N == N-1)



	// Time to free the memories 




		printf("OK freeHost\n ");




	printf("OK freeDevice\n ");


please guide me

hi, I’m having a very similar problem, it’s a simple code that adds two matrices and prints the result. If i set a NxN matrix with N=3040, everything seems fine, but if I try to rise it to N=3200 the .cu file compiles but when i run the executable my screen flicks and the program doesnt work because it prints the same values I declared the result matrix with, in other words, it is as if everything works but the kernel i defined!

I’m working in a 8600M GT GPU, computer capability 1.1, I defined dimBlock( 16, 16 ) and dimGrid( N/dimBlock.x, N/dimBlock.y ), so, if N=3040 my grid will be 190x190 and each block will be 16x16=256 and the program prints “4.5” each time which is the correct answer, but if I set N=3200, grid 200x200 and each block 16x16, everything goes wrong, the program prints “10” each time which is the value that the answer matrix has at the beginning, any advise??

this is the code I’m using, it’s very simple because I’m just learning cuda and I need to check that everything works fine, for compiling I just use “nvcc AddMatrix.cu” and then “./a.out”


using namespace std;

const int N = 3040;

const int blocksize = 16;


void add_matrix_gpu( float* a, float *b, float *c, int N )


int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

int index = i + j*N;

if ( i < N && j < N )

c[index] = a[index] + b[index];


int main() {

float a = new float[NN];

float b = new float[NN];

float c = new float[NN];

for ( int i = 0; i < N*N; ++i )


a[i] = 1.0f; b[i] = 3.5f; 


float *ad, *bd, *cd;

const int size = NNsizeof(float);

cudaMalloc( (void**)&ad, size );

cudaMalloc( (void**)&bd, size );

cudaMalloc( (void**)&cd, size );

cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice );

cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice );

dim3 dimBlock( blocksize, blocksize );

dim3 dimGrid(N/blocksize,N/blocksize);

add_matrix_gpu<<<dimGrid, dimBlock>>>( ad, bd, cd, N );

cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost );

cudaFree( ad ); cudaFree( bd ); cudaFree( cd );

for ( int i = 0; i < N; ++i )


for ( int j = 0; j < N; ++j )


int index = i + j*N;





delete a; delete b; delete c;




Before my kernel calls I use

int blockSize = 512;
int nBlocks = cnt/blockSize + (cnt%blockSize == 0?0:1);

kernel_function <<< nBlocks, blockSize >>> (cnt);

where cnt is the number of items (threads) I want to process

Then, in the kernel, it is important to put an “if (idx < cnt)” as below

global void kernel_function(int cnt)
const unsigned int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx < cnt)
all your kernel code in here

For my 9600GTX card, 512 is the maximum I can use. Any alteration to the setup above, causes me problems. Some similar to what you describe.
If you change your code examples to use variables in the above way, maybe it will help.

I run your code, it works fine with 3040 and 3200, but it gives zeros as u said, if i put N > 3040. I put N = 4096 it gave me all zeros.

Like MichaelC said, I changed the code a bit, still gives the same errors

dim3 dimBlock( blocksize, blocksize );

dim3 dimGrid( N/dimBlock.x +  (N%dimBlock.x == 0?0:1) , N/dimBlock.y +  (N % dimBlock.y == 0?0:1));

add_matrix_gpu<<<dimGrid, dimBlock>>>( ad, bd, cd, N );

hi…i had the same issue. you’re receiving the 0’s because (i think) your grid size is too big. i played with my dimGrid and dimBlock.

for me, i needed 3,000,000 threads and since the blocksize is fixed at 512, that gives me 3,000,000/512=5860. so i have kernel <<<5860, 512>>> (args)