beginner question Checking if GPU is the answer to me

Hi all,
My name is Ido, and I’m new to CUDA (or other GPU programming).
I have a simple problem
Large (about 200K) vectors of 256 floating point (4bytes) elements each as my data.
I sample 1 vector (same size), and I need to find the closest match to any vector in my data. I use a simple merit function that do the scalar sum of ((I-R)^2), for each vector in the data, and search for the minimum. This is my match.

I did it on the CPU, but even when I’ve tried to use Intel’s IPP, I still can’t get the performance I need. Then I’ve started to think about using the GPU and got to CUDA.

After reading CUDA’s programming guide, I realized that I need help in finding the right way of transferring/storing/comparing the data on the device.
Here is what I’ve thought about. Please comment, and help me get started.

-allocate memory on the device using cudaMemAlloc()
-copy the data to the device memory using a cudaMemcpy() (this data should stay on the device during application lifetime)
-copy the sampled signal to the device (how?)
-start compare the signal with all other signal’s (how?)
Any help will be much appreciated.
Ido Ben-Ami

Sounds pretty much like a task to be done in CUDA, because you can do all the comparisons in parallel.

  • allocate using cudaMalloc (remember to use cudaMallocHost for allocation of pinned host memory for maximum transfer rate host to device)

  • copy the data to the device (is as easy as it sounds)

  • copy the sampled signal to the device => here I would suggest constant memory (if you stay under 64k of data, which is the case in your statement) because it is cached and you have no latencies while fetching the data (look at some examples to see how to do this, static allocation, cudaMemCpyToSymbol)

  • comparison

Each thread block should compare your sampled data with one of your initialization vectors, thus each thread just calculates the distance for one sample.

This would give you a blockDim of (256,1,1) and a GridDim of (50k,4) (for you have a limit at about 65k for a grid dimension)

After calculating each single distance you should use something like the reduction example to sum up all the single distances. If you do it correctly all your reads should be coallesced and you could get high performance. With 200k blocks you also have enough to hide memory latencies.

Give it a try!


Hi Vrah,

Thanks for the reply.

Can you please expain this part. I’ve done the easy first steps :). I don’t fully understand how to define the block dim and grid dim. maybe you can point me to a relevant sample (as with the reduction)

here is what I’ve done. This should dec the sample vector from all the data.

#define NUM_VECTORS 8192 //start with something easy :)

#define VECTOR_LENGTH 256

constant float sampleOnDevice[VECTOR_LENGTH];


main( int argc, char** argv)


CUT_DEVICE_INIT(argc, argv);

float *dataOnHost = new float [NUM_VECTORS*VECTOR_LENGTH]; //library file

float *sampleOnHost = new float[VECTOR_LENGTH];		//sample

//init with some data

             for ( int i=0 ; i< NUM_VECTORS*VECTOR_LENGTH ; i++) 


	dataOnHost[i] = i;

	if ( i < VECTOR_LENGTH )

		sampleOnHost[i] = i;


int size = NUM_VECTORS*VECTOR_LENGTH*sizeof(float); //size in bytes

cudaError_t err;

float *dataOnDevice;

err = cudaMalloc((void**)&dataOnDevice,size);

err = cudaMemcpy(dataOnDevice,dataOnHost,size,cudaMemcpyHostToDevice);

err = cudaMemcpyToSymbol(sampleOnDevice,sampleOnHost,sizeof(sampleOnDevice));

dim3 Dg;


Dg.y = sizeof(float);

dim3 Db;


Db.y = 1;

Db.z = 1;

float *outOnDevice;

err = cudaMalloc((void**)&outOnDevice,size);


float *outOnHost = new float[NUM_VECTORS*VECTOR_LENGTH];

err = cudaMemcpy(outOnHost,outOnDevice,size,cudaMemcpyDeviceToHost);

//err here is always errUnknown !!!

delete []dataOnHost;

delete []sampleOnHost;

delete []outOnHost;



CUT_EXIT(argc, argv);


and the vecSub function:

global void vecSub(float *data, float sample, float res,int numVectors,int vectorSize)


int threadId = threadIdx.x;

for ( int i=0 ; i< numVectors ; i++)


	res[i*vectorSize+threadId] = data[i*vectorSize+threadId]-sample[i*vectorSize+threadId];



what am I doing wrong ??

Why the memcpy back to the host doesn’t work ?



looks good until the sub function. You do have to calculate your working adresses including block and grid dimensions.

you don’t need to loop, because you created blocks that run in parallel. what you want to do is:

__global__ void vecSub(float *data, float *sample, float* res)

// numVectors and vectorSize are not needed, because they equal blockDim and gridDim


 Â int samplepointer = threadIdx.x; // each thread does one calc

 Â int datapointer = threadIdx.x + blockIdx.x*blockDim.x //+ blockIdx.y*blockDim.x*gridDim.x only needes if more than 65k vectors

 Â res[datapointer] = data[datapointer]-sample[samplepointer];


I’ve not tested this but that’s how it should work.

Edit: An error on the memcopy often shows that something in your kernel went wrong. It is in your case access of unallocated data (or massively parallel access, so you get race conditions because all your blocks are calculating the same and try to write to the same address).

Hi VrahoK

I’ve got the same result with your code.
any advise ?


Do a


after the kernel and see what it says.

Edit, now I see it:

You copy the sample to “SampleOnDevice”. This array is known to the device (also the adress only exists on the device), so:

__global__ void vecSub(float *data, float* res)

// numVectors and vectorSize are not needed, because they equal blockDim and gridDim


 int samplepointer = threadIdx.x; // each thread does one calc

 int datapointer = threadIdx.x + blockIdx.x*blockDim.x //+ blockIdx.y*blockDim.x*gridDim.x only needes if more than 65k vectors

 res[datapointer] = data[datapointer]-sampleOnDevice[samplepointer];


I was about to write that I’ve figured this out… you are too fast for me
Ok, now to phase 2, multiply the result with itself, than the reduction.
Don’t go to far away…I’ll probably keep nag you O:)


Read the doc to the reduction example. It is really explanative and shows how to get maximum performance even for other problems. In your case you could almost completely copy the most optimized version and merge it with your distance and square calculation.

And if there are any problems, just keep nagging ;)


Thanks Vrah,
I’ll continue this Sunday morning.

I probably need to create another array that will keep the sum of each (I-R)^2, then find the min of that array, and by the index, located my match.

Now I just need to figure how to do this :)


I’m >.<

The first part went well, but I can’t figure out how to continue.
After subtracting the sample from each vector, multiply the result by itself, I need to sum each vector.
I can’t use the reduction sample as is, because I don’t want to sum all my data to get one scalar, but I need to get a new vector which is the scalar sum of each data-sample vector.

Also, I can’t declare a shared memory that is big enough to hold my sums vector.

Can anyone give me a push here ?

This post is really useful.

__global__ void vecSub(float *data, float *sample, float* res)

// numVectors and vectorSize are not needed, because they equal blockDim and gridDim


 int samplepointer = threadIdx.x; // each thread does one calc

 int datapointer = threadIdx.x + blockIdx.x*blockDim.x //+ blockIdx.y*blockDim.x*gridDim.x only needes if more than 65k vectors

 res[datapointer] = data[datapointer]-sample[samplepointer];


Datapointer contains the index of the data array. What does samplepointer (threadidx.x) contains. I could not get it. As per understanding Here we are trying to find distance between a sample and the data. If the sample is constant vector, why should be index threadidx.x.

Please help me to understand.

Sorry for replying this late, I had 2 days off.

The reduction example is working on a dataset (for your case each comparison vector) in one multiprozessor. So you are calculating the sum of those 256 squared differences in one block and each block calculates one element of the result vector.

That for: each block needs 256 ints (or floats or whatever you are using) of shared memory (should be no problem).

Each thread loads the result of the calculation on “it’s” element (you should be able to merge the difference calc and the reduction) to the shared mem. After that just do the normal summing up. As you can see, the storing location is only depending on the blockIds so that each block gives one result.

I would suggest something like that:

__global__ void calc(float *g_idata, float *g_odata)


  __shared__ float sdata[256];

 // perform first level of reduction,

  // reading from global memory, writing to shared memory

  int tid = threadIdx.x;

  int datapointer = threadIdx.x + blockIdx.x*blockDim.x + blockIdx.y*blockDim.x*gridDim.x;

  sdata[tid] = sample[tid] - g_idata[datapointer];

  sdata[tid] *= sdata[tid];


 // do reduction in shared mem

  if (tid < 128) { sdata[tid] += sdata[tid + 128]; }__syncthreads();

  if (tid <  64) { sdata[tid] += sdata[tid +  64]; } __syncthreads();


  if (tid < 32)


    sdata[tid] += sdata[tid + 32];

    sdata[tid] += sdata[tid + 16];

    sdata[tid] += sdata[tid +  8];

    sdata[tid] += sdata[tid +  4];

    sdata[tid] += sdata[tid +  2];

    sdata[tid] += sdata[tid +  1]; 



    // write result for this block to global mem 

    if (tid == 0) { g_odata[blockIdx.x + gridDim.x*blockIdx.y] = sdata[0]; }


We are trying to find the distances between one sample vector (of length 256, so fitting in one block, so samplepointer = threadIdx.x) and several data vectors of the same length (each block comparing the whole sample vector to one data vector). So the blockIdx’s are only relevant for the datapointer.

I hope that helps.

Thanks man.
I was about to give up on this.

About your solution:
If I understand correctly, the output should be a NUM_VECTORS size vector, that should hold the sum of every diff.
I need 256 threads, and NUM_VECTORS blocks to do that. In other words I should call the calc function with <<<NUM_VECTORS,VECTOR_LENGTH>>>.

I’ll give this a try. I just wanted to check if I should go with the GPU solution, and find myself working on this for a week. Without you I would probably still trying to figure out how to call a kernel function…

I fill so dumb.

Thanks again,

You are my god.

a question about the your solution:

When I try to run this with num_vectors < 1024 (without the blockIdx.y…) I get the right results in the output. once the num_vectors reach 1024 and up, the output is currpted.

Any idea ?


With num_vectors > 1024 you should still be able to go without blockIdx.y. How do you calculate block and grid dimensions? Are you still doing this?

dim3 Dg;


Dg.y = sizeof(float);

dim3 Db;


Db.y = 1;

Db.z = 1;

Because the Dg.y = sizeof(float) doesn’t make any sense. Could you show your wrapper code?


My bad. You are right.
…and now for the real life. I’ll integrate the code into my application, and check the overall solution.

Again - Thank you very much for your help.

Ok - I have some results:
I use GeForce8800GT.
Running 50K vectors result in about 5.5msec. This time include copying the sample to the GPU, and getting the results back (full loop). This is x2.5 faster than my cpu ( 2.5Ghz * 2 cores, running both cores ).

I’m thinking of using the new GX280 cards to see how fast this would run.

There is still a bug once passing 64K. The calc returns without doing anything. I have no idea why.


IMO you should see some more speedup with the GX280. Maybe you can get anyone using a GX280 to test the performance of you program.


if you use


this should result in an invalid parameter error, for you have a limit of AFAIR 65535 blocks per grid dimension. If your NUM_VECTORS is (maybe padded to) 2^N I would write as follows:


Dg.y = 1;

while (Dg.x > MAX_GRID_DIM) //should be 65535


  Dg.x /=2;

  Dg.y *=2;


this way you can do up to 65535^2 vectors.



I think I’ve got it.
The max number for the dg.x is 65K. I need to “move” to .y once I using more than that.