Shared Memory Limitation

papag62 · January 26, 2009, 1:09pm

All, I have been using CUDA for sometime now but everything I have done has utilized the texture fetching heavily. My new application does not, so I guess its time for me to learn how to utilize the shared memory. I have a few questions regarding my specific application.

It is really a image template matching problem. I capture a 21x21 pixel (float) (1.73 KB) image and need to find its closest match in a reference set. My current reference set is on the order of 21,000 21x21 pixel (float) images. I want to parallelize the database matching by assigning partitions of the reference set to each thread. What would be my best way of doing this to ensure performance? Does the size of my images effectively restrict the number of threads I can have per thread block? Since each thread would need access to the captured image and a image from the database, would I be limited to
(16KB - 1.73KB)/1.73KB ~= 8 threads per block? Or can you assign more threads with the thought that the scheduler knows how to handle the threads in order to maintain the required shared memory for the life of the thread?

I will be using C1060 hardware for this application.

Thanks

cbuchner1 · January 26, 2009, 2:20pm

Having 8 threads per block is not a workable solution in CUDA. The hardware scheduler in the GPU relies on having several warps (of 32 threads each) pending in the scheduler queue to hide latencies. If much less than 192 threads are present, you will start to see the performance drop.

Wouldn’t a thread assignment of 21 x 21 threads (per block) be better suited for your problem? So each thread computes a “yes / no” metric for exactly one pixel and the final result is determined with a parallel reduction step to get exactly one metric. Each thread block could evaluate your sample against one of the 21000 reference images.

I think a hierarchical approach might be faster than doing a brute-force comparison against 21000 reference images, for example by comparing a subsampled version first (3x3 pixels), then a 7x7 version, leaving a greatly reduced candidate set for the final 21 x 21 comparison. Your application will be memory bandwidth limited, no matter how well thought out your use of shared memory is. So the key is to get the total number of comparisons at the full 21x21 resolution down to a minimum.

Christian

papag62 · January 26, 2009, 2:39pm

hmmm… I don’t think I would see any performance gains using 21 x 21 threads per block since I’m not matching identical images, but calculating a similarity measure between images. Each thread calculation per pixel would need to be accumulated in order to calculate the similarity measure. Unfortunately, the values are floats so atomic adds are out of the question. I think some example code would help. Here is a serial C++ version of the code

[codebox]float calcSimMeas( Model &REF, Model &SCENE, int index)

{

float A = 0, B = 0, C = 0, D = 0, E = 0;

float N = 0;

float *REF_SI = &REF.spinImage[index*REF.width2si];

float *SCENE_SI = SCENE.spinImage;

for( int ii = 0; ii < REF.width2si; ii++)

{

	if(REF_SI[ii] != 0)

	{

		C += SCENE_SI[ii];

		E += pow(SCENE_SI[ii],2);

	}

	if(SCENE_SI[ii] != 0)

	{

		B += REF_SI[ii];

		D += pow(REF_SI[ii],2);

	}

	if(REF_SI[ii] != 0  && SCENE_SI[ii] != 0)

	{

		A += REF_SI[ii]* SCENE_SI[ii];

		N++;

	}	

}

float R = ( N * A - B * C)/sqrtf((N * D - B * B ) * (N * E - C * C));

float Cval = powf( g_atanhÂ®, 2) - REF.lambda * 1.5 * 1. / (N - 3.);



return Cval;

}[/codebox]

Where REF_SI is a pointer to the reference set image and SCENE_SI is point to the image to be found. In this case REF.width2SI = 21*21 = 441. This function resides in a for loop that increments the index into the reference set.

Instead of using a hierarchical approach, I eventually plan on using the PCA algorithm to compress the images into a linear combination of images. For this particular case PCA would allow me to reduce the images from 21x21 to 2x2 and be able to reconstruct 95% of the database. But I’m not always that lucky!

_Big_Mac · January 26, 2009, 5:57pm

That’s not necessarily true, I’ve seen applications where 64 threads per block was the optimum.

But below 32 threads = not using a whole warp, which is underutilising the resources.

papag62 · January 26, 2009, 6:28pm

I understand that 8 threads is not good. The question is whether I am really limited to that because of my shared memory requirements. I would even think that using more threads and working directly from global memory would be faster then only 8 from shared. Obviously that is not the solution I’m looking for.

With the way I have stated my current implementation of the problem am I limited to 8 threads b/c of the size of my images? Will assigning more then 8 work? Or do I need to rethink the problem to find a way to utilize more threads?

Thanks

_Big_Mac · January 26, 2009, 8:31pm

How about storing the per-pixel difference in shared memory?

As cbuchner1 suggested - have each thread read one pixel and compare it against a pixel in a single image from the reference set. Store the difference in shared memory. This way you use 1’764 bytes per block. Then perform a parallel reduction within this shared memory to sum the differences of all pixels.

E.D_Riedijk · January 26, 2009, 8:43pm

It is 192 threads per multiprocessor. So if you have 3 blocks of 64 threads running concurrently on a MP, you have those 192 threads running per MP.

papag62 · January 27, 2009, 8:04pm

First off, is my assessment that I would be limited to 8 threads correct?

If I were to have each block assigned to a image and then each thread assigned to a pixel, each block would require a database image, the image of reference and 5 accumulation parameters (each having the same size as an image).

If you look at my code, you see that thread i would only be doing the following, where REF is the database image and SCENE is the image of interest:

[codebox]if(REF[i] != 0 && SCENE[i] != 0)

{

C[i] = SCENE[i];

E[i] = C[i]^2;

B[i] = REF[i];

D[i] = B[i]^2;

A[i] = E[i]*D[i];

N++;

}[/codebox]

First off there is very little computation on the pixel level. Second, this would require 21x21 floats for A,B,C,D,E and REF and SCENE (~ 13KB total) per block. Third reduction would have to be performed on A,B,C,D,E and then the similarity measure would still need calculated:

[codebox]float R = ( N * A - B * C)/sqrtf((N * D - B * B ) * (N * E - C * C));

Result = powf( atanh( R ), 2) - lambda * 1.5 * 1. / (N - 3.);[/codebox]

Wouldn’t those 3 things limit what I can do by assigning each pixel a thread?

papag62 · February 4, 2009, 12:57am

OK I have tried using the blocks at an image level and threads at a pixel level, but in order to avoid the memory constraint, as mentioned in the previous post, I am using a temporary holder for the arrary (VAL). I have also reduced the image size from21x21 to 16x16 so it plays better with CUDA. Here is my implementation. It is not pretty and it provides zero performance improvement over just using global memory. Does any body see a better way to do this?

__global__ void cu_compareGPU3( float* REF_si, float* SCENE_si, float* Result,int width, int width2, float lambda, int numsi)

{

	__shared__ float SCENE_Share[256];

	__shared__ float REF_Share[256];

	__shared__ float VAL[256];

	int tid = threadIdx.x;

	int bid = blockIdx.x;

	int x = __mul24(bid, blockDim.x) + tid;	

	SCENE_Share[tid] = SCENE_si[tid];

	__syncthreads();

	REF_Share[tid] = REF_si[x];

	__syncthreads();

	

	bool check = REF_Share[tid] != 0 && SCENE_Share[tid] !=0;

	if(check)

	{

		VAL[tid] = SCENE_Share[tid];

	}

	__syncthreads();

	float Csum = reduce(VAL);

	VAL[tid] = 0;

	__syncthreads();

	if(check)

	{

		VAL[tid] = SCENE_Share[tid]*SCENE_Share[tid];

	}

	__syncthreads();

	float Esum = reduce(VAL);

	VAL[tid] = 0;

	__syncthreads();

	if(check)

	{

		VAL[tid] = REF_Share[tid];

	}

	__syncthreads();

	float Bsum = reduce(VAL);

	VAL[tid] = 0;

	__syncthreads();

	if(check)

	{

		VAL[tid] = REF_Share[tid]*REF_Share[tid];

	}

	__syncthreads();

	float Dsum = reduce(VAL);

	VAL[tid] = 0;

	__syncthreads();

	if(check)

	{

		VAL[tid] = SCENE_Share[tid]*REF_Share[tid];

	}

	__syncthreads();

	float Asum = reduce(VAL);

	VAL[tid] = 0;

	__syncthreads();

	if(check)

	{

		VAL[tid] = 1;

	}

	__syncthreads();

	float Nsum = reduce(VAL);

	VAL[tid] = 0;

	__syncthreads();

	if(tid == 0)

	{

		float denom1 = sqrtf((Nsum * Dsum - Bsum * Bsum) * (Nsum * Esum - Csum * Csum));

		float denom2 = (Nsum - 3.);

		if ( denom1 > .001 && denom2 > 0 )

		{

			float R = ( Nsum * Asum - Bsum * Csum)/denom1;

			Result[bid] = pow( atanh(R), 2) - lambda * 1.5 / denom2;

			//Result[bid] = Asum;

		} else {

			Result[bid] = -999.;

		}

	}

}

__device__ float reduce(float* x)

{

	unsigned int i = threadIdx.x;

	unsigned int n = blockDim.x;

	for( unsigned int offset = n/2; offset > 0; offset /= 2)

	{

		if(i < offset) x[i] += x[i + offset];

		__syncthreads();

	}

	return x[i];

}

Topic		Replies	Views
blocks vs threads CUDA Programming and Performance	14	18425	February 27, 2007
Loading from texture mem to shared mem Trouble with the threads CUDA Programming and Performance	2	1460	March 11, 2010
Device Query GeForce 210 CUDA Programming and Performance	6	11258	May 27, 2010
image convolution question using large bloks of shared memory CUDA Programming and Performance	0	2740	March 14, 2010
Need help to better understand CUDA structure CUDA Programming and Performance	7	1222	May 17, 2011
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8488	February 12, 2008
Threads per block Using shared memory CUDA Programming and Performance	11	2591	October 20, 2010
Reduction done in shared memory CUDA Programming and Performance	4	4086	September 25, 2009
Not enough shared mem CUDA Programming and Performance	5	5893	November 3, 2009
Shared memory and global memory containg different values CUDA Programming and Performance	0	537	February 22, 2011

Shared Memory Limitation

Related topics