In several examples of the CUDA Programming Guide 2.3.1 they
then they later cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice):
In other words they allocate the memory on the host then copy it to the device where it s used in this case by a matrix multiplication kernel. Why bother with allocating on the host system? Why not just allocate on the device and skip the host to device transfer. It seems a lot easier. They may not be a command to allow cudaMalllloc on the device and that could be a valid point. However, the question is still valid. Maybe there should be such as command if there currently is not one.