Thread block size and data size problem

Hi,all. I’m new to CUDA. I have thread block size and data size problem.

Below is my code.

And my question is:

set my data size to 3 by 3 and thread to 4 by 4 and block to 1 by 1, how many thread will actuall run on my device? Is 4 by 4 or 3 by 3?

And if the threads number is 4 by 4, does it mean that I will get an out of range error when running on GPU because the data size is only 3 by 3?

I do a little research on emulation mode and CPU give me the error.

So if my data size is smaller than the number of threads, what should I do to avoid the out of range error? Thanks!

__global__ void assignNumber(float* d_out, float dim)


	int x=blockIdx.x*blockDim.x+threadIdx.x;

	int y=blockIdx.y*blockDim.y+threadIdx.y;

	int pos=y*dim+x;

	//printf("%d\t%d\t%d\t%d\t%d\t%d\t%d\n",gridDim.x, gridDim.y, blockIdx.x, blockIdx.y, threadIdx.x, threadIdx.y, pos);



void main()


	int dim=3;

	float* d_out;

		cudaMalloc((void**)&d_out, dim*dim*sizeof(float)));

	dim3 thread(4,4);

	dim3 grid( (dim-1)/thread.x+1, (dim-1)/thread.y+1);

	assignNumber<<<grid, thread>>>(d_out,dim);

	float* h_out=new float[dim*dim];

	cudaMemcpy(h_out, d_out, dim*dim*sizeof(float), cudaMemcpyDeviceToHost);

//Check if pos equals the value

	bool rst=true;

	for(int i=0;i<dim;i++)


		for(int j=0;j<dim;j++)


			int pos=i*dim+j;







		if(!rst) break;











Cuda doesn’t know anything about the size of your data, just the block and grid sizes you specify. In this case 16 threads in a 4x4 arrangement will run.

Quite probably yes.

One way is to pass the size and have threads which would operate outside the valid size do nothing (put a big if statement around the kernel operations). The other way would be to make sure you always match data and block/grid dimensions, which would imply padding your input and just ignoring the results that are not needed. Both should work.


What is the better way in general? If statements are more flexible, but how big is the performance impact an if statement?

I also have a problem with data size. I think I understand the out of range error, however, and do not think that is the cause.

I am transferring a 933 size linear array into the device to be algebraically manipulated in matrix form (by using the thread IDs). The input array is transferred to shared memory. Two major operations (each one accessing the memory of the whole array) occur in the device and then the output is transferred out to the host via the original linear array.

The problem I am having is that I am trying to increase the size of the array beyond 933 to something like 977 (within the 512 thread limit per block). I am only using single precision digits, so I think I am only currently using 933*4=324 bytes, which is much less than the maximum shared memory of 16kb.I am currently only working within one block (dimBlock(4,4,9)) in a single grid.

The problem is that when increased to an array of 944, the CUDA functions do not operate and returns the value inputted.

I am relatively new to CUDA and do not have proper computer science background, so any help would be greatly appreciated. Thanks

{[codebox]global void Cudacomp(float *vek, int *i, int *j)


int x = threadIdx.x;

int y = threadIdx.y;

int z = threadIdx.z;

int Dx = blockDim.x;

int Dy = blockDim.y;

shared float ra[4], ra2[4];

shared float rua[4],rua2[4];

shared float rva[4],rva2[4];

shared float rea[4],rea2[4];

shared float pa[4], pa2[4];

shared float qs[4], qs2[4];

shared float fs1[4],fs12[4];

shared float fs2[4],fs22[4];

shared float fs3[4],fs32[4];

shared float fs4[4],fs42[4];

shared float cvek[144];//

int ya = 0;

int xa = 0;

int xb = 4;

int yb = 4;

// copy input vector into shared memory for manipulation

cvek[x+Dxy+zDxDy]= vek[x+Dxy+zDxDy];


// erase input vector to be used as output vector

vek[x+Dxy+zDx*Dy] = 0.0;


for (y=ya+1;y<yb;++y){

///…generous algebra…//



for (x=xa+1;x<xb;++x){

///…generous algebra…///




extern “C” void cudaf_(int *i,int *j, float *h_vek)


int ie = *i;

int je = *j;

int size1 = 9ieje*sizeof(float);

int size2 = 4ieje*sizeof(float);

float *d_vek;

cudaMalloc((void **) &d_vek , size1);


cudaMemcpy(d_vek, h_vek, size1, cudaMemcpyHostToDevice);

dim3 dimBlock(4,4,9);

Cudacomp<<<1,dimBlock>>> (d_vek, i, j);

cudaMemcpy(h_vek, d_vek, size2, cudaMemcpyDeviceToHost);