Understanding Threads in CUDA help me find the exact number of threads for my code

biebo · July 9, 2009, 10:03am

Hi,

Iam facing some problems understanding Threads, Iam providing the code iam working on, I have a card with compute capability of 1.1, means 768 threads, Its 9200M GS

Q1.

In the code I have 2d arrays of size NxN which I copy from Host to device and then back to Host.

The point is this when I put N > 768 , there are unpredicted values in last row of my results. And most important of all my Gnome Crashes

Iam providing a working code --------------> With N = 768 ,blocks = 2, numberThreads per block = 384, It works If I put N = 768 ,blocks = 3, numberThreads per block = 256, It works

But if i put N = 1024, blocks = 4 , numberThreads per block = 256 , my gnome crashes and the last rows have some garbage.

I can understand I cannot spawn more than 768 threads on a compute capability 1.1 card, but Is there any other way, to make the code working with N > 768, do I have to change the kernel, say some threads do the same work twice , if thread 1 operates on array[1] it has to do for array[1+768] as wel.

I have checked, its working with N = 800, but at N = 1000 it gives garbage and now i can barely write my Gnome is showing horrible display.( iam also using Compiz)

Q1.1. when I put numberthreads = 500, blocks = 2, what do u expect in kernel this statement will return —> int idx = blockIdx.x * blockDim.x + threadIdx.x;

on a compute capability 1.1 card.

Any help I will appreciate,

Q2. if the kernel<<<…>>>() cal a blocking statement, or my code on host works independently

#include <stdio.h>

__global__ void multi( int *M1, int *M2, int *M3, size_t p_M1,size_t p_M2, size_t p_M3, int N)

{

	int idx =  blockIdx.x * blockDim.x   +  threadIdx.x;

	

	int myrow = idx;

	int j= 0,i=0;

	//int point = idx % N;

	/*if (idx < N ){

	int* row_M3 = (int*)((char*)M3 + myrow * p_M3);

	for(i = 0; i< N; i++)

	row_M3[i] = (int) 20;

	}*/

	if ( idx < N ){

	int k = 0;

	

	int* row_M3 = (int*)((char*)M3 + myrow * p_M3);

	int* row_M1 = (int*)((char*)M1 + myrow * p_M1);

	

	for(j = 0; j <N; j++){

	

		row_M3[j] = (int) 0;

	for(k=0;k<N;k++){

		int* row_M2 = (int*)((char*)M2 + k * p_M2);

		row_M3[j] += row_M1[k] * row_M2[j];

		}

	}

	}

		//__syncthreads();

}

	int const N = 768;

int main(){

	/* pointers to host memory */

	int *Host_M1, *Host_M2, *Host_M3;

	/* pointers to device memory */

	int *GPU_M1, *GPU_M2, *GPU_M3;

	size_t pitch_M1,pitch_M2,pitch_M3;

	

	int i;

	/* Allocate 2darrays  on host*/

	Host_M1 = (int*) malloc(N*N*sizeof(int));

	Host_M2 = (int*) malloc(N*N*sizeof(int));

	

	

	printf("OK mem 2d host\n ");	

	/* Allocate 2darrays  on device*/

	

	size_t width = N* sizeof(int);

	size_t height = N;

	

	cudaMallocPitch((void**)&GPU_M1, &pitch_M1,width,height);

	cudaMallocPitch((void**)&GPU_M2, &pitch_M2,width,height);

	cudaMallocPitch((void**)&GPU_M3, &pitch_M3,width,height);

	

	printf("OK mem2d cuda\n ");

	

	/* Initialize arrays a and b */

	for (i=0; i<N*N; i++)

	{

		

	

		Host_M1[i] = (int) 1;

		Host_M2[i] = (int) 1;

		

	}

		

	printf("OK initialize\n\n\n\n\n ");

	

	/* Copy data from host memory to device memory */

	cudaMemcpy2D(GPU_M1, pitch_M1,Host_M1,width, width,height, cudaMemcpyHostToDevice);

	cudaMemcpy2D(GPU_M2, pitch_M2,Host_M2,width, width,height, cudaMemcpyHostToDevice);

	printf("OK  memcpy H to D\n ");

	//cudaMemcpy(b_d, b, sizeof(int)*N, cudaMemcpyHostToDevice);

	// Invoke kernel

	// here the threads and blocks are stuctured in linear way

	int threadsPerBlock = 256;

	//int blocksPerGrid = (N + threadsPerBlock - 1)/threadsPerBlock;

	multi<<<3,threadsPerBlock>>>(GPU_M1,GPU_M2,GPU_M3,pitch_M1,pitch_M2,pitch_M3,N);

	

	cudaError_t erro = cudaGetLastError();

	 if(erro != cudaSuccess)

   	 {

		printf("ERROR PREP launch FAIL!\n");

	 }

	printf("error is %s \n",cudaGetErrorString (erro) );

	printf("OK Kernel\n ");

	Host_M3 =  (int*) malloc(N*N*sizeof(int));

	cudaMemcpy2D(Host_M3,width,GPU_M3,pitch_M3,width,height ,cudaMemcpyDeviceToHost);

	printf("OK memcp D to H\n ");

	printf("OK done\n");

	

	for(i = N*764; i < N*N; i++){

	

			

			

			printf("%d(%d)  ",Host_M3[i],i%N);

			if(i%N == N-1)

			printf("\n");

		}

	// Time to free the memories 

	free(Host_M1);

	free(Host_M2);

	free(Host_M3);

		printf("OK freeHost\n ");

	cudaFree(GPU_M1);

	cudaFree(GPU_M1);

	cudaFree(GPU_M1);

	printf("OK freeDevice\n ");

}

please guide me

buster100 · July 10, 2009, 2:58am

hi, I’m having a very similar problem, it’s a simple code that adds two matrices and prints the result. If i set a NxN matrix with N=3040, everything seems fine, but if I try to rise it to N=3200 the .cu file compiles but when i run the executable my screen flicks and the program doesnt work because it prints the same values I declared the result matrix with, in other words, it is as if everything works but the kernel i defined!

I’m working in a 8600M GT GPU, computer capability 1.1, I defined dimBlock( 16, 16 ) and dimGrid( N/dimBlock.x, N/dimBlock.y ), so, if N=3040 my grid will be 190x190 and each block will be 16x16=256 and the program prints “4.5” each time which is the correct answer, but if I set N=3200, grid 200x200 and each block 16x16, everything goes wrong, the program prints “10” each time which is the value that the answer matrix has at the beginning, any advise??

this is the code I’m using, it’s very simple because I’m just learning cuda and I need to check that everything works fine, for compiling I just use “nvcc AddMatrix.cu” and then “./a.out”

[codebox]#include

using namespace std;

const int N = 3040;

const int blocksize = 16;

global

void add_matrix_gpu( float* a, float *b, float *c, int N )

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

int index = i + j*N;

if ( i < N && j < N )

c[index] = a[index] + b[index];

}

int main() {

float a = new float[NN];

float b = new float[NN];

float c = new float[NN];

for ( int i = 0; i < N*N; ++i )

{

a[i] = 1.0f; b[i] = 3.5f;

}

float *ad, *bd, *cd;

const int size = NNsizeof(float);

cudaMalloc( (void**)&ad, size );

cudaMalloc( (void**)&bd, size );

cudaMalloc( (void**)&cd, size );

cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice );

cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice );

dim3 dimBlock( blocksize, blocksize );

dim3 dimGrid(N/blocksize,N/blocksize);

add_matrix_gpu<<<dimGrid, dimBlock>>>( ad, bd, cd, N );

cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost );

cudaFree( ad ); cudaFree( bd ); cudaFree( cd );

for ( int i = 0; i < N; ++i )

{

for ( int j = 0; j < N; ++j )

{

int index = i + j*N;

  if(index%1000==0)

  cout<<c[index]<<endl;

}

delete a; delete b; delete c;

return EXIT_SUCCESS;

}[/codebox]

Thanks

MichaelC · July 10, 2009, 9:08am

Before my kernel calls I use

int blockSize = 512;
int nBlocks = cnt/blockSize + (cnt%blockSize == 0?0:1);

kernel_function <<< nBlocks, blockSize >>> (cnt);

where cnt is the number of items (threads) I want to process

Then, in the kernel, it is important to put an “if (idx < cnt)” as below

global void kernel_function(int cnt)
{
const unsigned int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx < cnt)
{
all your kernel code in here
}
}

For my 9600GTX card, 512 is the maximum I can use. Any alteration to the setup above, causes me problems. Some similar to what you describe.
If you change your code examples to use variables in the above way, maybe it will help.

biebo · July 10, 2009, 9:44am

I run your code, it works fine with 3040 and 3200, but it gives zeros as u said, if i put N > 3040. I put N = 4096 it gave me all zeros.

Like MichaelC said, I changed the code a bit, still gives the same errors

dim3 dimBlock( blocksize, blocksize );

dim3 dimGrid( N/dimBlock.x +  (N%dimBlock.x == 0?0:1) , N/dimBlock.y +  (N % dimBlock.y == 0?0:1));

add_matrix_gpu<<<dimGrid, dimBlock>>>( ad, bd, cd, N );

uw_mike · July 13, 2009, 6:36pm

I run your code, it works fine with 3040 and 3200, but it gives zeros as u said, if i put N > 3040. I put N = 4096 it gave me all zeros.

Like MichaelC said, I changed the code a bit, still gives the same errors
dim3 dimBlock( blocksize, blocksize );

dim3 dimGrid( N/dimBlock.x +  (N%dimBlock.x == 0?0:1) , N/dimBlock.y +  (N % dimBlock.y == 0?0:1));

add_matrix_gpu<<<dimGrid, dimBlock>>>( ad, bd, cd, N );

hi…i had the same issue. you’re receiving the 0’s because (i think) your grid size is too big. i played with my dimGrid and dimBlock.

for me, i needed 3,000,000 threads and since the blocksize is fixed at 512, that gives me 3,000,000/512=5860. so i have kernel <<<5860, 512>>> (args)

mike

Topic		Replies	Views
Reduction CUDA Programming and Performance	19	3449	May 16, 2012
Urgent help with threads please! CUDA Programming and Performance	21	10783	March 6, 2008
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3971	February 9, 2024
Can not use more than 16*256 threads! CUDA Programming and Performance	7	2458	August 4, 2008
looking for further suggestion to speed up the code CUDA Programming and Performance	9	1266	February 4, 2014
LARGE 2D arrays CUDA Programming and Performance	10	8542	August 11, 2011
How to realise 3 dimension calculation? CUDA Programming and Performance	9	240	August 28, 2023
Newbie help on thread blocks CUDA Programming and Performance	22	10593	December 24, 2008
understading Number of blocks and threads CUDA Programming and Performance	5	1645	April 23, 2010
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16287	January 30, 2011

Understanding Threads in CUDA help me find the exact number of threads for my code

Related topics