cudaMalloc from device not working

Any idea why cudaMalloc returns 30 (unknown error). I can compiling to compute_61,sm_61.

#include <stdio.h>
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include <curand.h>
#include "device_launch_parameters.h"

__global__ void Test()
{
	void* ptr;
	cudaError_t err = cudaMalloc(&ptr, 100);
}


void main()
{
	Test << < 1, 1 >> > ();
}

Make sure you are linking properly against the cuda device runtime (-lcudadevrt)

If that doesn’t fix it, add an extra call to another cuda runtime API function in the kernel, e.g. cudaDeviceSynchronize()

I’ve already filed a bug for this issue. You’re welcome to file your own bug, of course.

Thanks. I tried your suggestion, modified source below. Now it won’t debug with error “CUDA Dynamic Parallelism debugging is not supported in preemption mode. Breakpoints will be disabled.”. This is on Windows 10 under CUDA 8.0 on both Pascal and Maxwell GPUs. I don’t have a headless gpu to test with but unclear why calling cudaDeviceSyncronize would require dynamic parallelism. It happens on two different machines so probably not specific to my set up.

#include <stdio.h>
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include <curand.h>
#include "device_launch_parameters.h"

__global__ void Test()
{
	cudaError_t err = cudaDeviceSynchronize();
	void* ptr;
	err = cudaMalloc(&ptr, 100);
}


void main()
{
	Test << < 1, 1 >> > ();
}

Well you don’t need to debug that code to figure out if err is 30 or 0. Just print it out.

Regarding the message from nsight, it is using some heuristic to decide if your code is using DP or not. I suggested filing a bug for that here:

https://devtalk.nvidia.com/default/topic/1008799/cuda-programming-and-performance/dynamic-parallelism-detected-inappropriately/

Yes it seems to work outside of debug mode. I really am starting to hate the windows implementation of CUDA - so many things just don’t work…!