cudaMalloc() vs Malloc() in pure C

Hi,

I tried to look up the specs for cudaMalloc() in the docs (CUDA Runtime API :: CUDA Toolkit Documentation) but couldn’t seem to find my answer.

Is cudaMalloc (and associated routines) designed to run/work even if the user does not have a CUDA capable device installed ? Or need we have two separate sets of routines (i.e. one that uses cudaMalloc, one that uses Malloc, depending)?

*obviously avoiding the case of doing something dumb like trying to make a transfer if there is no device there

cudaMalloc allocates memory on the device. What would you expect, if no device is connected? To get an error as return value?

@Curefab well I just didn’t know if it was ‘smart enough’ to know, 'oh, but there is no device to do that, so lets just implement it with a Malloc type function on the host. I mean there may be many use cases where you want to write a program that supports both CUDA and non-CUDA enabled devices. I didn’t know if that meant writing double routines in most cases or…

Also, apologizes-- I only finished the ‘Getting Started’ course the other week, so I am feeling my way through the details here now…

It is possible (left for you) to write an abstraction layer (e.g. C++ classes, which either use Cuda or host memory) or there are some existing (higher-level) libraries, which can run on different hardware.

cudaMalloc itself is specifically for allocating (Nvidia GPU) device memory.

You do not have to apologize.

There are several forum posts about reusing code functions between host and device. And to a certain degree that is possible.

For the moment I would recommend the mindset to write an optimized Cuda function and perhaps one slow one for the CPU as a reference for comparing the results.

Even on the CPU alone, if one tries to accelerate as much as possible, one could write SSE or AVX code in addition to a more compatible implementation. Compared to those, Cuda is mostly quite high-level.

@Curefab thank you for the consideration.

I mean these days, for ‘basic’ questions I could probably pull up some variant of a GPT, but still that is not my ‘cup of tea’; Especially, I think when it is something new to you. I still write in CodeBlocks :D nothing fancy here. The only times I might use it is when there is something I know, but forgot, but then can also quickly confirm in my head that the answer is right.

If I might ask, in the same thread of thought, one thing I haven’t decided yet is, at least for original implementation stage, is it ‘more or less okay’ to set up (well, this is C, not C++, but they are basically like vectors) my computation elements on the CPU, and then just before the calculation is performed transfer them from host to device (obviously in the case in which that is warranted) ?

I’m not doing anything ‘fancy or streaming’ here with my data yet at this time, and I’m basically looking to perform standard tensor operations. I also know NVidia has libraries for all this, but I am avoiding that for the moment because clarity, rather than purely speed is the goal.

Or, from the very get go ought I create everything I need on the device first (i.e. allocation, ‘loading’ the tensor’, etc ?). I am trying to make my CUDA implementation mirror the CPU one as much as possible from at least a code perspective so one can learn something.

(For a very slow handling, you can use managed memory or zero-copy memory, which both do not need explicit memory transfers. But I avoid the first and use the second only in special circumstances.)

Otherwise I would put as much as possible on the GPU and directly in global device memory.

If you have simple functions to setup data (e.g. coefficients), why not run those too on the GPU, even if it is done by 1 or just a few threads. They can even do double calculations if speed is no issue (for setting up or generally).

The actual functions you can define as __device__ __host__, so they would run on both the CPU and GPU with a single codebase.

E.g. you can have such a function with an iteration parameter.
On the CPU it is called in a for loop or from several CPU threads and gets the iteration variable or the thread id, on the GPU it is called from a __global__ function with blockIdx.x and/or threadIdx.x as parameter.

I once did a small ‘Cuda simulator’ on CPU. Created templated classes to store the type 32 times (for each of the 32 threads of a warp). Created functions for shuffling, warp matrix multiplication or synchronization. It worked well enough as debugging help to have the same general code structure as Cuda. Some blocks I implemented with lambdas as parameters, e.g. for doing an if block, which has to be executed by only some of the 32 threads of a warp. Whenever an operation was done, e.g. two variables were multiplied (operator*), the C++ classes did it for the currently active ones of the 32 ‘threads’ (they basically were a [32] array with a for loop) in the background. So you can make it quite far to make Cuda and CPU code similar.

1 Like