Big grid size crash on GTX480

zlf · November 4, 2011, 5:53am

With dim3 block(32,12), running cuda_filter2D with a 4000x3000 float matrix took 23ms on my GTX480.

In device query result, I found Maximum sizes of each dimension of a block: 1024 x 1024 x 64

So I changed block to dim3(320,320), it took 0.4ms this time. But I found all output is zero. That means cuda code did not execute at all!

Why this happens?

#define BLOCKDIMX 32

#define BLOCKDIMY 12

#include "cutil_inline.h"

__constant__ float d_kn[25];

__global__ void pad_kernel(float *dst,float *src,int width,int height)

{

	const int ix = blockDim.x * blockIdx.x + threadIdx.x;

	const int iy = blockDim.y * blockIdx.y + threadIdx.y;

dst[(iy + 2) * (width + 5) + ix + 2] = src[iy * width + ix];

}

__global__ void filter_kernel(float *dst,float *src,int width,int height)

{

	const int ix = blockDim.x * blockIdx.x + threadIdx.x;

	const int iy = blockDim.y * blockIdx.y + threadIdx.y;

	

	if (ix < width && iy < height)

	{

		float sum = 0;

		for (int i = 0 ; i < 5; i++)

		{

			for (int j = 0 ; j < 5 ; j++)

			{

				sum += d_kn[i * 5 + j] * src[(iy + i) * (width + 5) + ix + j];

			}

		}

		sum *= 0.04f;

		dst[iy * width + ix] = sum;

	}

	

}

inline int iDivUp(int a, int b){

    return (a % b != 0) ? (a / b + 1) : (a / b);

}

extern "C" void cuda_filter2D(float *d_src,float *d_mod,float *d_dst,float *kn,const int width,const int height)

{

	cutilSafeCall(cudaMemcpyToSymbol(d_kn,kn,sizeof(float) * 25));

	cutilSafeCall(cudaMemset(d_mod,0,sizeof(float) * (width + 5) * (height + 5)));

	dim3 block(BLOCKDIMX,BLOCKDIMY,1);

        dim3 gridf(iDivUp(width,BLOCKDIMX),iDivUp(height,BLOCKDIMY),1);

	

	pad_kernel<<<gridf,block>>>(d_mod,d_src,width,height);

	cutilSafeCall(cudaThreadSynchronize());

	filter_kernel<<<gridf,block>>>(d_dst,d_mod,width,height);

	cutilSafeCall(cudaThreadSynchronize());

}

Thanks

superZZ

pQB · November 4, 2011, 8:29am

Hello zlf,

as you noticed, Maximum sizes of each dimension of a block in Fermi is 1024 x 1024 x 64 but the Maximum number of threads per block is 1024.

A block of size 320x320 exceeds the maximum number of threads per block. You can try other configurations as 32x32, 64x16, 1024x1 and check the results and performance.

Best regards!

zlf · November 4, 2011, 12:53pm

Thank you for your reply.

I tried 1024x1 block, the performance is the same (24ms).

But 32x32 and 64x16 fail with error “e:/GPU/GpuLab/gpu/Filter2D/filter_kernel.cu(59) : cudaSafeCall() Runtime API error : unknown error.”

Compute Visual Profiler gives me this report with 64x16 block. I am wondering can I have a better performance? Actually, my realtime video process application needs the algorithm finishs in 5ms!

Occupancy analysis for session ‘Session13 : Device_0 : Context_0’ :

Occupancy analysis for kernel ‘filter_kernel’ for context ‘Session13 : Device_0 : Context_0’ :

Kernel details : Grid size: 200 x 150, Block size: 20 x 20 x 1

Register Ratio = 0.8125 ( 26624 / 32768 ) [28 registers per thread]

Shared Memory Ratio = 0 ( 0 / 49152 ) [0 bytes per Block]

Active Blocks per SM = 2 : 8

Active threads per SM = 800 : 1536

Occupancy = 0.541667 ( 26 / 48 )

Achieved occupancy = 0.541667 (on 15 SMs)

Occupancy limiting factor = Registers

Occupancy analysis for kernel ‘pad_kernel’ for context ‘Session13 : Device_0 : Context_0’ :

Kernel details : Grid size: 200 x 150, Block size: 20 x 20 x 1

Register Ratio = 0.375 ( 12288 / 32768 ) [7 registers per thread]

Shared Memory Ratio = 0 ( 0 / 49152 ) [0 bytes per Block]

Active Blocks per SM = 3 : 8

Active threads per SM = 1200 : 1536

Occupancy = 0.8125 ( 39 / 48 )

Achieved occupancy = 0.8125 (on 15 SMs)

Occupancy limiting factor = Block-Size

Best regards,

zlf

pQB · November 4, 2011, 1:41pm

The report of the visual profiler show a block size of 20x20x1 not 32x32x1 or 64x16x1.

How arrange threads within a block depends on how data is being to be accessed. On the other hand, you are not using the shared memory which can gives you a better performance.

Regards!

zlf · November 4, 2011, 4:45pm

Thank you. I will try shared memory.

I am still wondering why 32x32 and 64x16 failed.

Regards,

zlf

pQB · November 4, 2011, 5:13pm

Your in input data (4000x3000) is not multiple of your block size, neither for a block of 1024x1x1. You are probably writing out of the global memory space allocated in the GPU.

Next question here is, Q: why it works with 1024 if it’s not multiple of 4000 or 3000?. A: In some cases store data out of the allocated space does not crashed your application, although it’s an access violation. When you copy back data to the CPU you copy the exactly bytes reserved so you dont notice you wrote out of the bound.

As a good practice, if your data is not multiple of your block size check it before load / store data. In example:

__global__ void pad_kernel(float *dst,float *src,int width,int height)

{

	const int ix = blockDim.x * blockIdx.x + threadIdx.x;

	const int iy = blockDim.y * blockIdx.y + threadIdx.y;

	int idx = ix + width*iy; // global index

	if (idx < width*height) // your load / store accesses should be safe

	{

 	dst[(iy + 2) * (width + 5) + ix + 2] = src[iy * width + ix];

	}

}

Hope this help and best regards.