Extremely unexpected behaviour of worker threads Help needed

Here is my application setup:

  1. WinXP, CUDA 2.1, two identical GPUs (gtx280) installed

  2. Once the application starts, I call cuInit(0) from the main application thread

  3. When the user presses the “Start Task” button on the main dialog, I start one more secondary thread that should actually do the job (it interacts with the main application thread via windows messages)

Neither main application thread nor secondary thread “do not know” anything about CUDA - no CUDA contexts are created for them, no cudaSetDevice are called, no cudaGetDeviceCount or GetDeviceProps are called … absolutely no CUDA-related things (except cuInit(0)) are done in those threads.

In the secondary thread I prepare a piece of data that is to be crunched on GPU (structure of data if simple: float* pData = new float[…]). Then I start two worker threads (one per GPU, I create CUDA context inside each of them and then do cuCtxDetach when thread is finished) from the secondary thread that run kernels that handle the pre-calculated pData (pData is copied into the GPU memory via cudaMemcpy).

And this is what I see: when I try to work with pData that is pre-calculated in the secondary thread, the behaviour of the application becomes absolutely instable (cudaFree may fail with ‘unexpected launch failure’ error returned, cudaBind/UnbindTexture may fail, the whole system may hang e t c).

BUT when I put the pre-calculation of pData inside those two GPU worker threads (I recalculated it twice independently, once in each GPU worker thread) everything works fine.

pData is just a piece of host heap in both cases and it is identical in both cases! How can pData relate to the thread it is calculated in as it is just a fragment of host heap? Also, no matter how many worker threads are started (1 or 2) - the behaviour is similar on both cases. Any advices are appreciated.

Thanks in advance!

So let me get this straight

  • primary thread
    – secondary thread
    — GPU worker thread 1
    — GPU worker thread 2

When you compute pData in secondary thread, everything blows up, but when you do it in the GPU worker threads, it works fine. Is that right?

Absolutely.

This thing does not work:

  • primary thread (cuInit(0))

– secondary thread (pData = new float; compute pData)

— GPU worker thread 1 (CreateContext(no matter which CU_CTX_SCHED_ mode specified); cudaMemcpy(d_pData1, pData); RunKernel; CtxDetach)

— GPU worker thread 2 (CreateContext(no matter which CU_CTX_SCHED_ mode specified); cudaMemcpy(d_pData2, pData); RunKernel; CtxDetach)

This thing does work:

  • primary thread (cuInit(0))

– secondary thread

— GPU worker thread 1 (pData = new float; compute pData; CreateContext(); cudaMemcpy(d_pData1, pData); RunKernel;CtxDetach)

— GPU worker thread 2 (pData = new float; compute pData; CreateContext(); cudaMemcpy(d_pData2, pData); RunKernel;CtxDetach)

Can you post/email me source? Are you 100% sure that you’re not doing something very bad in the secondary thread and wrecking everything, which could cause the memcpy to fail?

Not sure I can post the source … it’s a massive system, sources size is about 1MB, I can hardly send a fragment.

In general: is it possible to have such a problem if to create/destroy CUDA contexts incorrectly ? Can this issue be related to CUDA usage even theoretically ?

I will double-tripple check the code in the secondary thread and let you know.

However, just to clarify the things: it is not only cudaMemcpy that fails. In debug configuration, randomly fails cudaMemcpy, cudaFree, cudaBind/UnbindTexture, even the kernel launch may fail or may hang (even the whole system may hang so I’m forced to reset it). And this happens not only with first attempt to launch the kernel from the worker thread - 3-4 attempts may succeed, fifth attempt may fail (once again - everything starts working perfectly when I calculate the source data in the worker thread).

In release configuration the situation is much more unusual: cuda functions do not report the error, but each kernel launch takes more and more time, so finally the system just hangs.

It doesn’t sound like it’s directly a CUDA issue (I don’t know how context behavior would cause this), but I am curious to take a look.

I’ve checked the code out, the facts are:

  1. pData (lets call the source data calculated in the secondary thread pData1) is always calculated even when I duplicate it’s calculation in worker threads - so the secondary thread works similarly in both cases.
  2. pData1 that is calculated in the secondary thread contains absolutely the same values with pData2 (lets call the source data calculated in the worker thread pData2) that is recalculated in the worker threads, I have checked it; however, kernels fails to work with pData1 from the secondary thread.

I can make (2) even stronger: in order to handle pData1 or pData2 in the worker thread I call the Evaluate(pData1 (or pData2)) routine that evaluates it (launches a kernel for it). Evaluate() routine may accept NULL as a parameter - in this case, it will recalculate pData inside itself (lets name this locally recalculated data as pData3). Everything works just fine only with pData3 that is recalculated inside Evaluate() function! Neither pData1 from the secondary thread nor just calculated pData2 (immediately before the subsequent call to Evaluete()) from the worker thread succeed, only when all calculations are done inside a single routine everything works just fine.

Once again - the contents of pData1, pData2 and pData3 arrays is totally the same.

Found the bug, CUDA is totally not implicated.
I have done my best explaining the problem to you - and finally I understood the reason myself …

Thank you for your patience!

Was it some kind of heap corruption from out of bounds memory access? Those can be really hard to track down and cause weird errors.

Not exactly. Input data (pData) calculation was correct in all cases (secondary thread, worker thread or inside Evaluate()), but Evaluate() routine handled the external data incorrectly. In general, it is possible to calculate the length of pData array by pData contents - Evaluate() did it wrong for the external data series (the length it calculated was shorter then the actual length of passing external pData array). Consequently, following cudaMalloc was wrong, cudaMemcpy was wrong, the data that went to GPU was truncated, kernel was not able to handle it properly and usually it led to the infinite loops.