Hi all,
I’m very new to CUDA development, and I’m afraid I need a nudge in the right direction to get started. I started out creating a simple kernel to convert a UYVY 4:2:2 video images to YUYV (simple byte swapping). I did something like the following:
- Create UYVY and YUVY images in system memory.
- Allocate space using cudaMallocPitch() in the GPU to hold the source UYVY and dst YUYV images.
- DMA the UYVY image to the to the GPU using cudaMemcpy2D().
- Run the kernel to convert the UYVY to YUYV.
- DMA the result back to system memory using cudaMemcpy2D().
This produced the correct result, but ran incredibly slowly. More slowly than an IPP routine using a single thread on the host. From here I started to simplify my problem. I ended up just trying the write a kernel to initialize a block of memory. Again, the app produced the correct result, but memset on the host ran faster (note that I’m not considering the DMA transfer times in my timings).
So … what is the correct (and fast) way to do something simple like initialize memory? Do you want to create a thread per byte, per word, or per block of bytes?
I’m running these tests on a dual quad-core Xeon PC running 64-bit XP with a Quadro FX 1700. The application has been compiled as a 32-bit (not 64-bit).
If someone could show me a simple kernel and the call to it, I’d really appreciate it.
Thanks,
Peter