Does anyone have the same experience that Cuda is slower than Directx shader programming? I implemented bilateral filter using both directx shader and Cuda, and it seems that cuda’s is much slower (about twice the running time). The image size is 640X480.
post your CUDA kernel, then we may find your missing GFLOPS ;)
Actually, I tried the box filter provided with cuda sdk, however, the speed is only about 60 fps, but my directx implementation is 10x faster. So I believe cuda is very bad at memory caching.
What exactly are you comparing? If you are just reading the FPS count from the boxfilter example, I think you should remember:
- the boxfilter example is using OpenGL
- it performs N iterations
Which memory caching are you referring to? The boxfilter example only uses a texture for 1 dimension. The other dimension uses coalesced access.