CUDA vs. OpenGL textures read-only vs. read-write

Hi all,

I’m trying to port some of our OpenGL + Cg code to CUDA and was thinking what would be the best strategy. In OpenGL all data was in textures and the Cg kernels could both read and write them. Now in CUDA texture memory is read-only so the result from a kernel can only go to the global memory. If a kernel is applied repeatedly in such a way that after the first pass the output becomes the input of the second pass it seems that a cudaMemcpyToArray call has to be made to copy from global to texture memory on the device. Will this not kill the performance gain of using textures? Or is there a better strategy for what I’m trying to do?

Assuming the memcpy is done at ~streaming bandwidth (I don’t see why it wouldn’t be) then the worst performance drop would be 2x and that’s if your kernels only take as long as simply copying the data, and I imagine they take longer that :)

If the memory copy is slower than streaming bandwidth, that would be both interesting and unfortunate.

I agree it would be nice to be able to write to textures, if only because of this problem.

Well you can use a single CUDA buffer and r/w to it. But the CUDA buffers are not cached. “old” shader implementations however tend to often get their entire speedup from the 2D texture cache.

On the other side you now have the shared mem. Given that you did apply the Cg shaders repeatedly as you said, they probably operate only locally on the data. I would try to reformulate the program such that all/multiple kernel applications are collapsed into a single CUDA kernel that only reads a local area once into shared mem, works on it and writes it back once.

If you have multiple Cg shader passes, you probably have an iterative algorithm. I have made good experiences with writing the termination criterion as a device routine and do the iteration completely in a single kernel call. The termination criterion usually need global info, so take a look at the prefix-sum examples in the SDK.

Peter

Thanks to both of you for the answers :)

Peter, things are pretty much as you said. A differential operator was discretized to get a sparse matrix and multiplication by this sparse matrix was implemented as a Cg shader. Indeed, the matrix only operates locally on a vector.

The iterative algorithm is actually a conjugate gradient but I can’t write this in a single kernel call for two reasons. A kernel call, as far as I know, can only take 5 seconds which is way too short for the kind of matrices we have and the second reason is that for the stopping criterion the length of a vector has to be determined but summing over all components has to be done with 64 bit precision.

I also have another problem with stuffing too many things into a single CUDA kernel call, which is that if the operation is too complicated then too many registers are needed for the kernel and occupancy will be very low simply because the 8192 registers in one multiprocessor will be reached with few threads. I asked about this issue here: http://forums.nvidia.com/index.php?showtopic=31771 but so far there were no answers :( Am I misunderstanding something here? Because of the (maybe nonexistent?) problem in the above post I tend to do only very basic things in one kernel call and prefer to have a series of calls instead of one big complicated kernel.

I’m just beginning with CUDA but so far the OpenGL + Cg code performs something like 4-5 times better than my best CUDA implementation.

cudaMemcpyDeviceToDevice copies have a performance bug in the current beta release (0.8). This is fixed and the fix will be in the next release. At that point, these copies should be very fast, but currently you may only see 3-4 GB/second.

Mark

Thanks Mark, this explains why I didn’t really see any performance gain in using textures as I had this memcpy after every kernel call.