Big grid size crash on GTX480

With dim3 block(32,12), running cuda_filter2D with a 4000x3000 float matrix took 23ms on my GTX480.

In device query result, I found Maximum sizes of each dimension of a block: 1024 x 1024 x 64

So I changed block to dim3(320,320), it took 0.4ms this time. But I found all output is zero. That means cuda code did not execute at all!

Why this happens?

#define BLOCKDIMX 32

#define BLOCKDIMY 12

#include "cutil_inline.h"

__constant__ float d_kn[25];

__global__ void pad_kernel(float *dst,float *src,int width,int height)

{

	const int ix = blockDim.x * blockIdx.x + threadIdx.x;

	const int iy = blockDim.y * blockIdx.y + threadIdx.y;

dst[(iy + 2) * (width + 5) + ix + 2] = src[iy * width + ix];

}

__global__ void filter_kernel(float *dst,float *src,int width,int height)

{

	const int ix = blockDim.x * blockIdx.x + threadIdx.x;

	const int iy = blockDim.y * blockIdx.y + threadIdx.y;

	

	if (ix < width && iy < height)

	{

		float sum = 0;

		for (int i = 0 ; i < 5; i++)

		{

			for (int j = 0 ; j < 5 ; j++)

			{

				sum += d_kn[i * 5 + j] * src[(iy + i) * (width + 5) + ix + j];

			}

		}

		sum *= 0.04f;

		dst[iy * width + ix] = sum;

	}

	

}

inline int iDivUp(int a, int b){

    return (a % b != 0) ? (a / b + 1) : (a / b);

}

extern "C" void cuda_filter2D(float *d_src,float *d_mod,float *d_dst,float *kn,const int width,const int height)

{

	cutilSafeCall(cudaMemcpyToSymbol(d_kn,kn,sizeof(float) * 25));

	cutilSafeCall(cudaMemset(d_mod,0,sizeof(float) * (width + 5) * (height + 5)));

	dim3 block(BLOCKDIMX,BLOCKDIMY,1);

        dim3 gridf(iDivUp(width,BLOCKDIMX),iDivUp(height,BLOCKDIMY),1);

	

	pad_kernel<<<gridf,block>>>(d_mod,d_src,width,height);

	cutilSafeCall(cudaThreadSynchronize());

	filter_kernel<<<gridf,block>>>(d_dst,d_mod,width,height);

	cutilSafeCall(cudaThreadSynchronize());

}

Thanks

superZZ

Hello zlf,

as you noticed, Maximum sizes of each dimension of a block in Fermi is 1024 x 1024 x 64 but the Maximum number of threads per block is 1024.

A block of size 320x320 exceeds the maximum number of threads per block. You can try other configurations as 32x32, 64x16, 1024x1 and check the results and performance.

Best regards!

Thank you for your reply.

I tried 1024x1 block, the performance is the same (24ms).

But 32x32 and 64x16 fail with error “e:/GPU/GpuLab/gpu/Filter2D/filter_kernel.cu(59) : cudaSafeCall() Runtime API error : unknown error.”

Compute Visual Profiler gives me this report with 64x16 block. I am wondering can I have a better performance? Actually, my realtime video process application needs the algorithm finishs in 5ms!

Best regards,

zlf

The report of the visual profiler show a block size of 20x20x1 not 32x32x1 or 64x16x1.

How arrange threads within a block depends on how data is being to be accessed. On the other hand, you are not using the shared memory which can gives you a better performance.

Regards!

Thank you. I will try shared memory.

I am still wondering why 32x32 and 64x16 failed.

Regards,

zlf

Your in input data (4000x3000) is not multiple of your block size, neither for a block of 1024x1x1. You are probably writing out of the global memory space allocated in the GPU.

Next question here is, Q: why it works with 1024 if it’s not multiple of 4000 or 3000?. A: In some cases store data out of the allocated space does not crashed your application, although it’s an access violation. When you copy back data to the CPU you copy the exactly bytes reserved so you dont notice you wrote out of the bound.

As a good practice, if your data is not multiple of your block size check it before load / store data. In example:

__global__ void pad_kernel(float *dst,float *src,int width,int height)

{

	const int ix = blockDim.x * blockIdx.x + threadIdx.x;

	const int iy = blockDim.y * blockIdx.y + threadIdx.y;

	int idx = ix + width*iy; // global index

	if (idx < width*height) // your load / store accesses should be safe

	{

 	dst[(iy + 2) * (width + 5) + ix + 2] = src[iy * width + ix];

	}

}

Hope this help and best regards.