In several examples of the CUDA Programming Guide 2.3.1 they
cudaMalloc((void**)&dd_A, size);
then they later cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice):
In other words they allocate the memory on the host then copy it to the device where it s used in this case by a matrix multiplication kernel. Why bother with allocating on the host system? Why not just allocate on the device and skip the host to device transfer. It seems a lot easier. They may not be a command to allow cudaMalllloc on the device and that could be a valid point. However, the question is still valid. Maybe there should be such as command if there currently is not one.
Err no. cudaMalloc allocates memory on the device, not the host. The cudaMemcpy is copying from another piece of host memory into the memory allocated by cudaMalloc. This is all covered in Ch 3 of the programming guide…