Asking for advice In need of advice from more experienced fellows

Howdy guys, (as you probably guessed already) I’m quite new to cuda computing. So far I managed by myself and even now I am not asking for solution or code samples, but for advice on my problem:

There’s a project that requires parallelization.
Input is acquire from other hardware and sent to computer’s memory as a 1024x1024 matrix (it’s an image).
Cuda kernel will do some image-processing on it, nothing extremely advance, but not trivial either.
It has to be calculated at 50fps…
Gpu compatible with cuda 2.1

And now question: how would you start doing it? Just sketches, more or less accurate pseudo code, anything that could help me find the best way to implement it.

Some questions I have already:
What is better? Copying a matrix into kernel or flattening it into 1d array? Keep in mind I would have to flatten the matrix 50 times per second so I am not sure if there would be any gain in that…
Is it even possible to generate kernel that fast? Not to mention some calculations inside?
What memory should I use? I guess registers/shared, but maybe there are some tricks i should know about?

I know this is not a small thing to ask for, especially as I am a new forum member with 1 post count, but It would be awesome to receive any feedback from you guys. Thanks in advance :)

Check cuda sdk there are examples those do just it.

As much as I appreciate your answer I’m afraid it does not help me at all.

All of the matrixes in sdk examples are flattened before passing into kernel and images are very small… but that’s irrelevant at the moment.

My main problem right now is getting input image into the kernel. As I said before I will get the image in matrix[y] form from a hardware device. I could flatten it in good old regular way, but keep in mind I’d have to flatten 50 matrixes per second, that would take a lot of time… I’ve been trying memcpy2d hoping it copies my 2d host matrix into 1d flattened matrix on device but it doesnt want to work, program crashes at cudaMemcpy2D.

cudaMallocPitch((void**)&imageDevice, &pitch,_imageW * sizeof(float), _imageH);

cudaMemcpy2D(imageVectorDevice, pitch, _imageHost, _imageW*sizeof(float), _imageW*sizeof(float), _imageH, cudaMemcpyHostToDevice);

kernelGPU<<<numBlocks, numThreadsPerBlock>>>(imageDevice, pitch, _imageSize, _imageH);

Where imageDevice is float* of _imageW*_imageH length, _imageHost is float** of _imageW x _imageH size.


Do you mean you get array of pointers to each line from device?

Does your device really deliver a [font=“Courier New”]float**[/font]? That would be highly unusual. Maybe the data is continuous in memory and the code just sets up the indirection pointers because someone has confused [font=“Courier New”]float[1024][1024][/font] and [font=“Courier New”]float**[/font]?

You’ll be far better off with a flat array, so that you have to do just one contiguous copy.

Thanks for the answers guys :)

I’m still waiting for confirmation on whatever the hardware delivers but either float** or most probably, tho I would simply love to get input as an array.

Back to my main problem because I still didnt solve it: how to copy 2d matrix or array of pointers into kernel? Is flattening only possible way?

As long as the storage on the host side is contiguous, you can copy it with a simple cudaMemcpy(). Any structure inside that single contiguous allocation is just a matter of interpretation that cudaMemcpy() can happily ignore. You could of course also use cudaMemcpy2D() for the copy but that would be overkill. cudaMemcpy2D() is useful for cases where the data is piece-wise contiguous, and the individual pieces are offset by a fixed distance (in bytes, or elements), commonly referred to as either pitch or stride depending on context. A typical example is copying a 2D tile of a larger 2D matrix. Another example would be copying a 1D vector with non-unit stride.

If your host data however is stored in non-contiguous form with variable spacing (as would typically be the case when a 2D matrix is represented as a vector of pointers, each of which points to a data vector representing one row or column of the matrix), then neither of these two transfer functions would work. Instead you would have to copy each of the (contiguous) data vectors individually using cudaMemcpy(), which will be less efficient: more copy calls and smaller amount of data per copy -> lower copy throughput.

As other have pointed out in this thread it’s usually most efficient to deal with matrices when they are represented by one contiguous chunk of storage (i.e. a single allocation) and the programmer can impose any particular multi-dimensional view, row-major/column-major wise, via access functions or macros.

Even tho some code snipplet would be great I think i can doodle it out by myself now.
Thank you guys, your tips were very useful so far :)