I modified the example provided with SDK simplemultiGPU to launch, as in the example, a kernel to two different devices. However each thread that is created on the host will allocate memory on the device using cudaMalloc as opposed to the SDK simplemultiGPU. When I execute a shutdown routine for the GPU inside the thread using CUDA_SAFE_CALL(cudaFree(device->pointer_to_device_data)) I get this error while executing the programme (not a compilation error) Cuda error in file ‘2GPU.cu’ in line ### : unspecified launch failure.
Is there any kind of restriction in where to free the memory allocated inside the programme? I haven’t tried to instruct the main function to free the device memory from outside the thread as it would be a patch to the solution I’m trying to obtain, ie each thread should prepare each device for execution, memory allocation and data copying from/to host to/from device and then cleanup each device after executing the kernel.