Iterate over two-dimensions array and memory allocation

Hello
I’m new in CUDA and in C++ as well. So I’m sorry if my questions would sound as beginner question :) ( but it is :) )

I’m going to iterate over array of pixels to find and change particular color. It’s horrible slow when I’m trying to compute it in normal way ( JAVA + threads on main processor).
Could anybody please tell me how should I allocate memory and do the main loop on two-dimensions array to make this in as many threads as my graphic card can run and have this changed table back in my C++ code?
This is Java code for single thread I wrote to do mantioned operation - and now I’m trying to rewrite it on c++ with CUDA. Each thread had number of rows to iterate over.

public void run(IplImage frame, int yStart, int yStop, IplImage back){		
		for(int i=0;i<frame.width();i++){			
			for(int j=yStart;j<yStop;j++){				
				CvScalar currPix = cvGet2D(frame, j, i);
				if(currPix.val(0) > blueMin){					
					CvScalar currBgPix = cvGet2D(back, j, i);
					cvSet2D(frame, j, i, currBgPix);					
				}	
			}
		}
	}

Is there any way to wait for the end of processing this multi thread operation on device? How I would know if the operation is done and I can go to the next operation on my host ?

Best Regards
Jan

YOu can use cudathreadsynchronize() or if you need the data on host you can copy it with cudamemecpy which will automtically sync the gpu calculations with the host (aka the program waits until the copy is finished before proceeding).

@JohnnyGor:

You could create a two dimensional grid of threads, so that each (i,j) index of the for loop corresponds to a different thread. I would recommend reading J. Sanders, E. Kandrot “CUDA By Example”, Chapter 5, pp. 64-65 showing how to set a two dimensional grid of threads.

@pasoleatis:

cudaThreadSynchronize() has been deprecated in favour of cudaDeviceSynchronize(), see

[url]cudaThreadSynchronize vs. cudaDeviceSynchronize what is the difference? - CUDA Programming and Performance - NVIDIA Developer Forums

I also stronlgy recommend. I read the book in 3 days. It is not just about CUDA C, but also how to think your algorithm in a parallel manner. I do not use so much the the function, so I was not aware. Thanks for point it.
For the OP, if you are really at the beginning, I recommend to spend some time playing around with simple codes such c[i]=a[i]+b[i] to get familiar with basics of CUDA C.

Thank you guys. Yes - this is not easy task as I noticed. I will start from some tutorials first I think and that book of course. This is not 100% clear for me how to write this code ;) as I do not have right understanding the way how does threads work in CUDA …