Hi Mark,
After lots of trial and error and head scratching, I’ve got my kernel to work last night. :magic: It ultimately came down to device vs. host pointers. Thanks to Cyril for his suggestions. I will make the code available further on once I have a chance to do some performance tuning and write some more kernels. As x264 codec is GPL’ed code, there are no IP issues on my part. I just want to have more done before I show something.
My biggest frustration has been the difficulty with debugging. I did figure out the kernel was failing with a launch failure (RC=3) because my kernel aborted. After that it was a matter of searching for the failure point using RETURN statements which turned out to be trying to dereference a host pointer.
Is it possible to document all these return codes? 3? 10201?
Is there a way to have the CUDA driver to return some context information on failure (file + offset) in the future? Having to sprinkle RETURN statements all over to figure out where things failed is positively baroque. I rather use a logic analyzer. :-)
As for emulator inaccuracies, what it showed me was my basic algorithm mapping was correct which was crucial to allow me to look at other potential problems. What I would really like to have to be able to use is a cycle-accurate simulator. The Intel network processor SDK has those. It runs really slowly but it works.
Spencer