Hi all,
My name is Ido, and I’m new to CUDA (or other GPU programming).
I have a simple problem
Large (about 200K) vectors of 256 floating point (4bytes) elements each as my data.
I sample 1 vector (same size), and I need to find the closest match to any vector in my data. I use a simple merit function that do the scalar sum of ((I-R)^2), for each vector in the data, and search for the minimum. This is my match.
I did it on the CPU, but even when I’ve tried to use Intel’s IPP, I still can’t get the performance I need. Then I’ve started to think about using the GPU and got to CUDA.
After reading CUDA’s programming guide, I realized that I need help in finding the right way of transferring/storing/comparing the data on the device.
Here is what I’ve thought about. Please comment, and help me get started.
-allocate memory on the device using cudaMemAlloc()
-copy the data to the device memory using a cudaMemcpy() (this data should stay on the device during application lifetime)
-copy the sampled signal to the device (how?)
-start compare the signal with all other signal’s (how?)
Any help will be much appreciated.
Regards,
Ido Ben-Ami
Sounds pretty much like a task to be done in CUDA, because you can do all the comparisons in parallel.
allocate using cudaMalloc (remember to use cudaMallocHost for allocation of pinned host memory for maximum transfer rate host to device)
copy the data to the device (is as easy as it sounds)
copy the sampled signal to the device => here I would suggest constant memory (if you stay under 64k of data, which is the case in your statement) because it is cached and you have no latencies while fetching the data (look at some examples to see how to do this, static allocation, cudaMemCpyToSymbol)
comparison
Each thread block should compare your sampled data with one of your initialization vectors, thus each thread just calculates the distance for one sample.
This would give you a blockDim of (256,1,1) and a GridDim of (50k,4) (for you have a limit at about 65k for a grid dimension)
After calculating each single distance you should use something like the reduction example to sum up all the single distances. If you do it correctly all your reads should be coallesced and you could get high performance. With 200k blocks you also have enough to hide memory latencies.
Can you please expain this part. I’ve done the easy first steps :). I don’t fully understand how to define the block dim and grid dim. maybe you can point me to a relevant sample (as with the reduction)
here is what I’ve done. This should dec the sample vector from all the data.
define NUM_VECTORS 8192 //start with something easy :)
looks good until the sub function. You do have to calculate your working adresses including block and grid dimensions.
you don’t need to loop, because you created blocks that run in parallel. what you want to do is:
__global__ void vecSub(float *data, float *sample, float* res)
// numVectors and vectorSize are not needed, because they equal blockDim and gridDim
{
 int samplepointer = threadIdx.x; // each thread does one calc
 int datapointer = threadIdx.x + blockIdx.x*blockDim.x //+ blockIdx.y*blockDim.x*gridDim.x only needes if more than 65k vectors
 res[datapointer] = data[datapointer]-sample[samplepointer];
}
I’ve not tested this but that’s how it should work.
Edit: An error on the memcopy often shows that something in your kernel went wrong. It is in your case access of unallocated data (or massively parallel access, so you get race conditions because all your blocks are calculating the same and try to write to the same address).
You copy the sample to “SampleOnDevice”. This array is known to the device (also the adress only exists on the device), so:
__global__ void vecSub(float *data, float* res)
// numVectors and vectorSize are not needed, because they equal blockDim and gridDim
{
int samplepointer = threadIdx.x; // each thread does one calc
int datapointer = threadIdx.x + blockIdx.x*blockDim.x //+ blockIdx.y*blockDim.x*gridDim.x only needes if more than 65k vectors
res[datapointer] = data[datapointer]-sampleOnDevice[samplepointer];
}
I was about to write that I’ve figured this out… you are too fast for me
Ok, now to phase 2, multiply the result with itself, than the reduction.
Don’t go to far away…I’ll probably keep nag you External Image
Read the doc to the reduction example. It is really explanative and shows how to get maximum performance even for other problems. In your case you could almost completely copy the most optimized version and merge it with your distance and square calculation.
And if there are any problems, just keep nagging ;)
The first part went well, but I can’t figure out how to continue.
After subtracting the sample from each vector, multiply the result by itself, I need to sum each vector.
I can’t use the reduction sample as is, because I don’t want to sum all my data to get one scalar, but I need to get a new vector which is the scalar sum of each data-sample vector.
Also, I can’t declare a shared memory that is big enough to hold my sums vector.
__global__ void vecSub(float *data, float *sample, float* res)
// numVectors and vectorSize are not needed, because they equal blockDim and gridDim
{
int samplepointer = threadIdx.x; // each thread does one calc
int datapointer = threadIdx.x + blockIdx.x*blockDim.x //+ blockIdx.y*blockDim.x*gridDim.x only needes if more than 65k vectors
res[datapointer] = data[datapointer]-sample[samplepointer];
}
Datapointer contains the index of the data array. What does samplepointer (threadidx.x) contains. I could not get it. As per understanding Here we are trying to find distance between a sample and the data. If the sample is constant vector, why should be index threadidx.x.
The reduction example is working on a dataset (for your case each comparison vector) in one multiprozessor. So you are calculating the sum of those 256 squared differences in one block and each block calculates one element of the result vector.
That for: each block needs 256 ints (or floats or whatever you are using) of shared memory (should be no problem).
Each thread loads the result of the calculation on “it’s” element (you should be able to merge the difference calc and the reduction) to the shared mem. After that just do the normal summing up. As you can see, the storing location is only depending on the blockIds so that each block gives one result.
I would suggest something like that:
__global__ void calc(float *g_idata, float *g_odata)
{
__shared__ float sdata[256];
// perform first level of reduction,
// reading from global memory, writing to shared memory
int tid = threadIdx.x;
int datapointer = threadIdx.x + blockIdx.x*blockDim.x + blockIdx.y*blockDim.x*gridDim.x;
sdata[tid] = sample[tid] - g_idata[datapointer];
sdata[tid] *= sdata[tid];
__syncthreads();
// do reduction in shared mem
if (tid < 128) { sdata[tid] += sdata[tid + 128]; }__syncthreads();
if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads();
if (tid < 32)
{
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
// write result for this block to global mem
if (tid == 0) { g_odata[blockIdx.x + gridDim.x*blockIdx.y] = sdata[0]; }
}
We are trying to find the distances between one sample vector (of length 256, so fitting in one block, so samplepointer = threadIdx.x) and several data vectors of the same length (each block comparing the whole sample vector to one data vector). So the blockIdx’s are only relevant for the datapointer.
About your solution:
If I understand correctly, the output should be a NUM_VECTORS size vector, that should hold the sum of every diff.
I need 256 threads, and NUM_VECTORS blocks to do that. In other words I should call the calc function with <<<NUM_VECTORS,VECTOR_LENGTH>>>.
I’ll give this a try. I just wanted to check if I should go with the GPU solution, and find myself working on this for a week. Without you I would probably still trying to figure out how to call a kernel function…
When I try to run this with num_vectors < 1024 (without the blockIdx.y…) I get the right results in the output. once the num_vectors reach 1024 and up, the output is currpted.
Ok - I have some results:
I use GeForce8800GT.
Running 50K vectors result in about 5.5msec. This time include copying the sample to the GPU, and getting the results back (full loop). This is x2.5 faster than my cpu ( 2.5Ghz * 2 cores, running both cores ).
I’m thinking of using the new GX280 cards to see how fast this would run.
There is still a bug once passing 64K. The calc returns without doing anything. I have no idea why.
IMO you should see some more speedup with the GX280. Maybe you can get anyone using a GX280 to test the performance of you program.
64k:
if you use
Dg.x = NUM_VECTORS
this should result in an invalid parameter error, for you have a limit of AFAIR 65535 blocks per grid dimension. If your NUM_VECTORS is (maybe padded to) 2^N I would write as follows:
Dg.x = NUM_VECTORS;
Dg.y = 1;
while (Dg.x > MAX_GRID_DIM) //should be 65535
{
Dg.x /=2;
Dg.y *=2;
}