All, I have been using CUDA for sometime now but everything I have done has utilized the texture fetching heavily. My new application does not, so I guess its time for me to learn how to utilize the shared memory. I have a few questions regarding my specific application.
It is really a image template matching problem. I capture a 21x21 pixel (float) (1.73 KB) image and need to find its closest match in a reference set. My current reference set is on the order of 21,000 21x21 pixel (float) images. I want to parallelize the database matching by assigning partitions of the reference set to each thread. What would be my best way of doing this to ensure performance? Does the size of my images effectively restrict the number of threads I can have per thread block? Since each thread would need access to the captured image and a image from the database, would I be limited to
(16KB - 1.73KB)/1.73KB ~= 8 threads per block? Or can you assign more threads with the thought that the scheduler knows how to handle the threads in order to maintain the required shared memory for the life of the thread?
I will be using C1060 hardware for this application.
Having 8 threads per block is not a workable solution in CUDA. The hardware scheduler in the GPU relies on having several warps (of 32 threads each) pending in the scheduler queue to hide latencies. If much less than 192 threads are present, you will start to see the performance drop.
Wouldn’t a thread assignment of 21 x 21 threads (per block) be better suited for your problem? So each thread computes a “yes / no” metric for exactly one pixel and the final result is determined with a parallel reduction step to get exactly one metric. Each thread block could evaluate your sample against one of the 21000 reference images.
I think a hierarchical approach might be faster than doing a brute-force comparison against 21000 reference images, for example by comparing a subsampled version first (3x3 pixels), then a 7x7 version, leaving a greatly reduced candidate set for the final 21 x 21 comparison. Your application will be memory bandwidth limited, no matter how well thought out your use of shared memory is. So the key is to get the total number of comparisons at the full 21x21 resolution down to a minimum.
hmmm… I don’t think I would see any performance gains using 21 x 21 threads per block since I’m not matching identical images, but calculating a similarity measure between images. Each thread calculation per pixel would need to be accumulated in order to calculate the similarity measure. Unfortunately, the values are floats so atomic adds are out of the question. I think some example code would help. Here is a serial C++ version of the code
[codebox]float calcSimMeas( Model &REF, Model &SCENE, int index)
{
float A = 0, B = 0, C = 0, D = 0, E = 0;
float N = 0;
float *REF_SI = &REF.spinImage[index*REF.width2si];
float *SCENE_SI = SCENE.spinImage;
for( int ii = 0; ii < REF.width2si; ii++)
{
if(REF_SI[ii] != 0)
{
C += SCENE_SI[ii];
E += pow(SCENE_SI[ii],2);
}
if(SCENE_SI[ii] != 0)
{
B += REF_SI[ii];
D += pow(REF_SI[ii],2);
}
if(REF_SI[ii] != 0 && SCENE_SI[ii] != 0)
{
A += REF_SI[ii]* SCENE_SI[ii];
N++;
}
}
float R = ( N * A - B * C)/sqrtf((N * D - B * B ) * (N * E - C * C));
float Cval = powf( g_atanh®, 2) - REF.lambda * 1.5 * 1. / (N - 3.);
return Cval;
}[/codebox]
Where REF_SI is a pointer to the reference set image and SCENE_SI is point to the image to be found. In this case REF.width2SI = 21*21 = 441. This function resides in a for loop that increments the index into the reference set.
Instead of using a hierarchical approach, I eventually plan on using the PCA algorithm to compress the images into a linear combination of images. For this particular case PCA would allow me to reduce the images from 21x21 to 2x2 and be able to reconstruct 95% of the database. But I’m not always that lucky!
I understand that 8 threads is not good. The question is whether I am really limited to that because of my shared memory requirements. I would even think that using more threads and working directly from global memory would be faster then only 8 from shared. Obviously that is not the solution I’m looking for.
With the way I have stated my current implementation of the problem am I limited to 8 threads b/c of the size of my images? Will assigning more then 8 work? Or do I need to rethink the problem to find a way to utilize more threads?
How about storing the per-pixel difference in shared memory?
As cbuchner1 suggested - have each thread read one pixel and compare it against a pixel in a single image from the reference set. Store the difference in shared memory. This way you use 1’764 bytes per block. Then perform a parallel reduction within this shared memory to sum the differences of all pixels.
First off, is my assessment that I would be limited to 8 threads correct?
If I were to have each block assigned to a image and then each thread assigned to a pixel, each block would require a database image, the image of reference and 5 accumulation parameters (each having the same size as an image).
If you look at my code, you see that thread i would only be doing the following, where REF is the database image and SCENE is the image of interest:
First off there is very little computation on the pixel level. Second, this would require 21x21 floats for A,B,C,D,E and REF and SCENE (~ 13KB total) per block. Third reduction would have to be performed on A,B,C,D,E and then the similarity measure would still need calculated:
[codebox]float R = ( N * A - B * C)/sqrtf((N * D - B * B ) * (N * E - C * C));
Result = powf( atanh( R ), 2) - lambda * 1.5 * 1. / (N - 3.);[/codebox]
Wouldn’t those 3 things limit what I can do by assigning each pixel a thread?
OK I have tried using the blocks at an image level and threads at a pixel level, but in order to avoid the memory constraint, as mentioned in the previous post, I am using a temporary holder for the arrary (VAL). I have also reduced the image size from21x21 to 16x16 so it plays better with CUDA. Here is my implementation. It is not pretty and it provides zero performance improvement over just using global memory. Does any body see a better way to do this?