Device code cannot run when its parent host thread is sleeping?

Do I understand it correctly that the device code cannot run when its parent host thread is sleeping, e.g. waiting on a condition variable?

I’m launching async kernels from a host thread, and then having that host thread wait, but it seems that this makes the device code stall too, even though I don’t call cudaDeviceSynchronize until much later.

It’s not obvious to me that should be the case. What threading model are you using? What is the “waiting on a condition variable” exactly, in terms of threading functions (if any)? The CUDA runtime spins up its own host threads to manage various kinds of activity.

I’m not sure what calling cudaDeviceSynchronize() would have to do with this. It doesn’t make device code stall or un-stall. in fact, I would expect that if you called cudaDeviceSynchronize() from that host thread before “waiting” on the condition variable, that there should be even fewer issues. That would pretty much guarantee that device code would complete by the time you begin waiting.

Probably questions like these would be better understood if you provide a short, complete(i.e. ready-to-be-compiled) code that demonstrates the observation.

Here’s a simple code that I wrote to prove to myself that a sleep-ing thread does not inhibit device execution:

$ cat
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>

#define NUM_THR 2

__device__ int comm = 0;

typedef struct {
   int thr_num;
} thread_st;

__global__ void kernel_fc()
  comm = 1;

void *thread_func(void* struc)
    thread_st *data = (thread_st*)struc;
    if (!data->thr_num) {
    // thread 0
    else {
    // other threads
      int my_comm = 0;
      cudaMemcpyFromSymbol(&my_comm, comm, sizeof(int));
      if (my_comm) printf("success!\n");
    printf("thread %d func exit\n", data->thr_num);
    return NULL;

int main(void)
    // Make thread data objects
    thread_st thread_data[NUM_THR];

    // Make threads
    pthread_t pthread[NUM_THR];

    // assign thread numbers
    for (int j=0; j<NUM_THR; j++) {
      thread_data[j].thr_num = j;

    // Create and excute pthread
    for (int j=0; j<NUM_THR; j++) {
      pthread_create(&pthread[j], NULL, thread_func, (void*)&(thread_data[j]));

    // Join pthread
    for(int i=0; i<NUM_THR; i++) {
        pthread_join(pthread[i], NULL);

    return 0;

$ nvcc -arch=sm_20 -o t619
$ ./t619
thread 1 func exit
thread 0 func exit

I had one host thread waiting to launch kernels after another host thread had launched its own kernels and registered the appropriate events. However, that other thread had a needless call to cudaDeviceSynchronize, stalling the first thread. (I thought I fixed that, but I must have inadvertently profiled the buggy version) Thanks for the input!