benchmark data / error handling

Hi there,

edit: deleted the benchmark-related question, because of rather not being an intelligent one.

And another question: what is the preferred way of error-handling inside a cuda-kernel? I have a loop ,that launches the kernel n time, that I want to end if an error occurs inside the kernel. Is there a faster way to notify the host-thread than reading some error-flag from global device memory?

Thanks for answers,

What sort of performance are you getting?

Thank you for the link.

My current approach decomposes a 10928x10928 Matrix in about 15s on a geforce 9600 GT, CPU: 2x2,1GHz. (I know, there are faster implementations…)

This is actually the largest size I can use, for any bigger matrices cudaMallocPitch returns ‘out of memory’, In theory, in 512MB should be space for more then 11500^2…?

I get it in about 11s now. And I have more question:

  • Is there a way to stream cublas calls?
  • When I use a block size larger then 20x20 I get wrong results. I reduced the registers per --maxrregcount to 8, so I should be able to use 32x32, since maxRegsPerBlock/(32*32) = 8. Am I missing something? What else could be the problem? (I have enough shared memory and I get no error code).