Calling a CUDA kernel with buffers multiples times causes "Invalid argument error ID 1"

I am implementing an MPI master-worker based program. The master program has a stream of tasks and it assigns a task to a worker as soon as the worker is available. Each worker process runs on one node with six GPUs. So, it launches kernels to these six GPUs with some buffer as arguments. The first time each worker is assigned a task it completes the task perfectly. But then when it’s being assigned a task for the second time, I am getting “Invalid argument error ID 1”. If I change the kernel to only work with scalar values (say, int num_of_values) instead of a buffer (say, int* data), there are no issues. Can someone suggest what might be the issue?

Before launching the kernel I am creating the device buffer using cudaMalloc() and freeing the buffer using cudaFree().

In high level:

while(there is more task)
send_task_to(worker w)


int ngpus = 6;
long data_size = 10000 * 10000 / 2;
MyData *data = new MyData[ngpus];
for( int i = 0; i < ngpus; i++) {
gpuErrorCheck( cudaMalloc( &comb_d[i] , sizeof( MyData) * data_size) ;
myKernel<<<num_blocks, num_of_threads>>>(comb_d);