Multi threaded issue with --default-stream per-thread

Hi,
I’m using the --default-stream per-thread compilation flag with multiple threads then using cudaStreamSynchronize(cudaStreamPerThread) to synchronize the appropriate stream.
It seems that the code hangs. Using gdb it seems the threads are all on cudaStreamSynchronize.

Anyone has an idea? is this a bug in this configuration?

thanks
Eyal

No one uses this feature??

In my experience you’re more likely to get help if you provide a short, complete test case demonstrating the issue.

I created a simple test case and it seems to work fine for me.

$ cat t325.cu
#include <iostream>
#include <pthread.h>

const size_t dt = 1000000000ULL;
const size_t nt = 4;

__global__ void k(){
  size_t start = clock64();
  while (clock64() < start+dt);
}

typedef struct {
} ptArgs;

static void* rt(void* args)
{
  k<<<1,1>>>();
  cudaStreamSynchronize(cudaStreamPerThread);
  std::cout<< "thread exiting" << std::endl;
  return NULL;
}

int main(int argc, char* argv[])
{
  pthread_t pt[nt];
  ptArgs args[nt];
  for (size_t t = 0; t < nt; ++t) {
    pthread_create(pt + t, NULL, &rt, (void*)(args + t));
    }
  std::cout << "threads created" << std::endl;
  for (size_t t = 0; t < nt; ++t) {
    pthread_join(pt[t], NULL);
    }
  return 0;
}
$ nvcc -o t325 t325.cu -lpthread --default-stream per-thread
$ vi t325.cu
$ cuda-memcheck ./t325
========= CUDA-MEMCHECK
threads created
thread exiting
thread exiting
thread exiting
thread exiting
========= ERROR SUMMARY: 0 errors
$

CUDA 10.0, CentOS7, Tesla P100

Hi Robert,
Thanks for the answer. I was indeed unable to reproduce the issue with the code you’ve sent.
However I think I’ve found the root cause. Inside the thread function I was calling the nvtxXXX functions that NVIDIA provides. I was using it as defined in the documentation and here:

It seems that what caused the issue was that the parameter that is being passed to the nvtxRangePushEx function by reference, seems to have gone out of scope too soon.
I was not yet able to reproduce it with the code you’ve sent - will try later this week. However creating the nvtxEventAttributes_t variable that was passed to the nvtxRangePushEx on the stack and making sure it will not go out of scope prior to the pop operation + cudaStreamSynchronize(cudaStreamPerThread) calls has completed, cleared the lock down.

So I guess I’m now wondering about two things:

  • Could it be that if there’s a corruption of the nvtxEventAttribute_t parameter the cudaStreamSynchronize locks? why?
  • Why does all the profile function take the parameters (for example the nvtxEventAttribute_t to nvtxPushRangeEx) by reference and not by value?

thanks
Eyal