Thread block size and data size problem

derekhe · February 5, 2010, 11:39am

Hi,all. I’m new to CUDA. I have thread block size and data size problem.

Below is my code.

And my question is:

set my data size to 3 by 3 and thread to 4 by 4 and block to 1 by 1, how many thread will actuall run on my device? Is 4 by 4 or 3 by 3?

And if the threads number is 4 by 4, does it mean that I will get an out of range error when running on GPU because the data size is only 3 by 3?

I do a little research on emulation mode and CPU give me the error.

So if my data size is smaller than the number of threads, what should I do to avoid the out of range error? Thanks!

__global__ void assignNumber(float* d_out, float dim)

{

	int x=blockIdx.x*blockDim.x+threadIdx.x;

	int y=blockIdx.y*blockDim.y+threadIdx.y;

	int pos=y*dim+x;

	//printf("%d\t%d\t%d\t%d\t%d\t%d\t%d\n",gridDim.x, gridDim.y, blockIdx.x, blockIdx.y, threadIdx.x, threadIdx.y, pos);

	d_out[pos]=pos;

}

void main()

{

	int dim=3;

	float* d_out;

		cudaMalloc((void**)&d_out, dim*dim*sizeof(float)));

	dim3 thread(4,4);

	dim3 grid( (dim-1)/thread.x+1, (dim-1)/thread.y+1);

	assignNumber<<<grid, thread>>>(d_out,dim);

	float* h_out=new float[dim*dim];

	cudaMemcpy(h_out, d_out, dim*dim*sizeof(float), cudaMemcpyDeviceToHost);

//Check if pos equals the value

	bool rst=true;

	for(int i=0;i<dim;i++)

	{

		for(int j=0;j<dim;j++)

		{

			int pos=i*dim+j;

			if(h_out[pos]!=pos)

			{

				rst=false;

				break;

			}

		}

		if(!rst) break;

	}

	if(rst)

	{

		cout<<"Success"<<endl;

	}

	else

	{

		cout<<"Failed"<<endl;

	}		

}

avidday · February 5, 2010, 12:07pm

Cuda doesn’t know anything about the size of your data, just the block and grid sizes you specify. In this case 16 threads in a 4x4 arrangement will run.

Quite probably yes.

One way is to pass the size and have threads which would operate outside the valid size do nothing (put a big if statement around the kernel operations). The other way would be to make sure you always match data and block/grid dimensions, which would imply padding your input and just ignoring the results that are not needed. Both should work.

derekhe · February 5, 2010, 1:16pm

Thanks!

CapJo · February 5, 2010, 1:43pm

What is the better way in general? If statements are more flexible, but how big is the performance impact an if statement?

klaunglu · February 5, 2010, 11:31pm

I also have a problem with data size. I think I understand the out of range error, however, and do not think that is the cause.

I am transferring a 933 size linear array into the device to be algebraically manipulated in matrix form (by using the thread IDs). The input array is transferred to shared memory. Two major operations (each one accessing the memory of the whole array) occur in the device and then the output is transferred out to the host via the original linear array.

The problem I am having is that I am trying to increase the size of the array beyond 933 to something like 977 (within the 512 thread limit per block). I am only using single precision digits, so I think I am only currently using 933*4=324 bytes, which is much less than the maximum shared memory of 16kb.I am currently only working within one block (dimBlock(4,4,9)) in a single grid.

The problem is that when increased to an array of 944, the CUDA functions do not operate and returns the value inputted.

I am relatively new to CUDA and do not have proper computer science background, so any help would be greatly appreciated. Thanks

{[codebox]global void Cudacomp(float *vek, int *i, int *j)

{

int x = threadIdx.x;

int y = threadIdx.y;

int z = threadIdx.z;

int Dx = blockDim.x;

int Dy = blockDim.y;

shared float ra[4], ra2[4];

shared float rua[4],rua2[4];

shared float rva[4],rva2[4];

shared float rea[4],rea2[4];

shared float pa[4], pa2[4];

shared float qs[4], qs2[4];

shared float fs1[4],fs12[4];

shared float fs2[4],fs22[4];

shared float fs3[4],fs32[4];

shared float fs4[4],fs42[4];

shared float cvek[144];//

int ya = 0;

int xa = 0;

int xb = 4;

int yb = 4;

// copy input vector into shared memory for manipulation

cvek[x+Dxy+zDxDy]= vek[x+Dxy+zDxDy];

__syncthreads();

// erase input vector to be used as output vector

vek[x+Dxy+zDx*Dy] = 0.0;

__syncthreads();

for (y=ya+1;y<yb;++y){

///…generous algebra…//

}

__syncthreads();

for (x=xa+1;x<xb;++x){

///…generous algebra…///

}

__syncthreads();

}

extern “C” void cudaf_(int *i,int *j, float *h_vek)

{

int ie = *i;

int je = *j;

int size1 = 9ieje*sizeof(float);

int size2 = 4ieje*sizeof(float);

float *d_vek;

cudaMalloc((void **) &d_vek , size1);

cudaMemset(d_vek,0,size1);

cudaMemcpy(d_vek, h_vek, size1, cudaMemcpyHostToDevice);

dim3 dimBlock(4,4,9);

Cudacomp<<<1,dimBlock>>> (d_vek, i, j);

cudaMemcpy(h_vek, d_vek, size2, cudaMemcpyDeviceToHost);

cudaFree(d_vek);

}[/codebox]

Topic		Replies	Views
Launching Kernel Fail CUDA Programming and Performance	15	3413	May 28, 2014
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3983	February 9, 2024
Grids and Threads question CUDA Programming and Performance	2	4425	August 7, 2007
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27758	February 15, 2010
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2640	August 2, 2010
trouble learning how to set block and max thread size CUDA Programming and Performance	4	1979	January 26, 2011
Block size's effect on program performance, why does my program run faster at seemingly random sizes? CUDA Programming and Performance	5	3933	January 2, 2017
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	140	December 31, 2024
Newbie help on thread blocks CUDA Programming and Performance	22	10627	December 24, 2008
CUDA kernels keep on crashing CUDA Programming and Performance	6	3655	October 27, 2008

Thread block size and data size problem

Related topics