Help With Programming Model

First of all I would like to thank NVIDIA for releasing CUDA. I have been waiting for an easy way to integrate general purpose applications with GPUs and CUDA has taken a huge step in that direction.

As to my specific question, I am developing a data processing application that takes a very large input data set consisting of independent data points, performs operations on each data point, and updates an output data set as the results are generated. The application currently uses a streaming programming model. Data points are read from disk into a fifo in memory by a host thread. A number of worker threads pull data points from the fifo, compute a result from the point, and store the result in another fifo in host memory. Finally, another thread pulls results from the output fifo and writes them to disk.

I would like to add a kernel running with CUDA to the pool of worker threads. If I could get a substantial performance improvement by doing the computation on the device compared to on the host CPU, I think it would make a very strong case for building single CPU systems with multiple NVIDIA cards instead of traditional multiprocessor systems. However, I do not know if it would be a good idea to extend the streaming programming model to the kernel on the device, or if it is even possible.

Using a regular programming model, there would be a single CUDA worker thread that would pull a large number of data points from the data fifo, copy them to cuda with cudaMemcpy(), call a global kernel that would compute the results and copy them back to host memory, the worker thread would then add them to the results fifo and start over.

The problem with this approach is that the copy of data points to the device, the computation of results, and the copy back to the host will be serialized.

What I would like to be able to do is spawn two worker threads for CUDA. The first thread would allocate device memory, call a kernel to run on the device, and block. The second thread would then pull data points from the input fifo, group them into packets corresponding to the number of threads running on the device, and copy them to the device. Thread 1 running on the device would continually read from global memory until it receives an okay from the host. Then it would update a shared memory location allowing all device threads to start.

The code for this would look something like :

__global__ void runCUDA(char* basePointerToFifo,int fifoLength,int fifoTag, char* basePointerToResult, bool* finished  )


	// The lock

    	__shared__ bool lock;	

	//An index indicating where in the memory allocated for data points the next packet will be located. 

	__shared__ int fifoIndex;


	//A pointer to the next location to read a packet from

	__shared__ char* pointerToFifo;

	//whether we should continue waiting for new packets or return

	__shared__ bool done;

	// Thread index

	int tx = threadIdx.x;

	int ty = threadIdx.y;


	//We start out not done

	if(tx==0 && ty==0)


  done = false;

  fifoIndex = 0;





  //Lock the lock

  if(tx==0 && ty==0)


  	lock = true;



 //wait until thread1 opens the lock



  	//have thread1 open the lock if the host has updated tag

  	if(tx==0 && ty==0)



    if(* pointerToFifo == fifoTag )



     lock = false;




       pointerToFifo+=pointerToFifo + packetLength;








    else if(*finished)


    	done = true;

     lock = false;




  }    //while lock




  //	Load 1 packet starting from (pointerToFifo)+1 .  Each thread loads a different value.

  //	Perform computation of the result from the input.

  //	Save the results to (basePointerToResult+1) + fifoIndex*(packetLength) + (tx+ty*X_THREADS)*(size of a result)



  //tag the result packet

  if(tx==0 && ty==0)





	}//if !done

	}//while !done


Forgive any syntax errors as I haven’t tried compiling that. The intended function is as follows:

1.A packet of data being processed has N data points and the first byte is a tag.

2.A packet of results has N results and the first byte is a tag.

3.There is 1 group of N threads in the device kernel.

4.The kernel is passed the following parameters:

    1.basePointerToFifo : a pointer to the global memory location containing the start of the first data packet

    2.fifoLength : the number of data packets that can be held in global memory

    3.fifoTag : the starting tag number

    4.basePointerToResult : a pointer to the global memory location that the host will expect the first result packet

    5.finished : a pointer to the global memory address that the host will set when it wants the kernel to return

5.The execution of the kernel is as follow:

    1.All threads record their thread ids.

    2.The first thread sets a lock and all other threads wait until it is done.

    3.The first thread reads the tag from the first packet. When it matches the current 

        value of fifoTag it unlocks the lock.  The lock will also be unlocked if the first 

        thread reads the value of the global variable finished to be true.  In that case, 

        immediately return.

    4.All of the threads load one data point and store one result.

    5.The first thread tags the result with fifoTag and increments fifoTag.

    6.Go to 2.

6.The execution of the host thread communicating with the kernel is as follows

    1.Clear global memory used for data and result packets.

    2.If there is room in global memory, copy data points from host memory into a   


    3.Tag that packet with the current tag number and increment the tag number.

    4.Check if the next result packet in global memory has a current tag.

    5.If it does, copy the results into host memory.

Will this work?

Can I make CUDA calls from a separate thread while a kernel is running? If I call cudaMemcpy() for the packet data and then again for the tag from the same host thread will I be guaranteed that all of the packet data will be written before the tag is written?

Does anyone have any suggestions?

Another question that just came up is whether there is a mechanism for the kernel to post an interrupt to the host without returning. This would be nice since the thread on the host would not have to poll global memory to determine when there is a new packet.

I see the __trap() function in the programming guide but there is not much description. The file device_functions.h that contains the function header doesn’t offer any more information.