Memory allocation When to allocate

I just wanted to verify this before working on it in my CUDA kernel. The use of matrix operations in a CUDA kernel requires that memory be allocated on the host and then transferred to the device. Likewise the resultant matrix must be transfered back from the device memory to the host memory. I guess the bottom line here is that one must allocate when doing matrix operations in the CUDA kernel. I know the commands, I just do not know when to use them sometimes.


No. It requires that memory is allocated on the device. Optionally you can copy initializing data from the host if it is necessary, and that requires a memory source on the host. It doesn’t necessarily have to be dynamically allocated. If you compute the matrix product B=AX, on the device, then A,X and B all need memory allocations, but only A and X require data for the operation to be valid. If you compute the classic gemm operation B=alphaAX + betaB, then B should have valid values that might require copying data to the device from host. But it equally might not, if B were the result of a prior operation in GPU memory.


Again, only if you need the results on the device back in host memory. That isn’t always the case, for exmaple if you solve the matrix equation AX=B on the device and then compute the residual as R=AX-B, as you would do in a typical iterative scheme, then you might choose only to periodically copy R back to the device for convergence checking, then copy the final X back to the host.

Yes, but the memory allocation is device memory, not host memory as you described in your first sentence.

That is reassuring.

Te following code is named a simple CUDA matrix multiplication program. The host allocates two matrices a_h, b_h on the host and allocates two matrices a_d, b_d on the device.

All the code is doing is multiplying every element of A by 2 and putting the result in B.

When it allocates memory on the device the code itself is on the host.

Can A be allocated and populated on the device? Why allocate matrix A on the host, populate it, move it to device and multiply each element by 2 and put result in B on the device. It seems allocating A on host a_h and moving to a_d is an unneeded step; just start with a_d. I can see having b_h and b_d, but there is no reason for a_h.

is it possible this was done for exhibit purposes or maybe the programmer was unaware of doing it a shorter way? It seems in a lot of example code people are allocating on the host and moving to device and then using there and never using it again. So just start out on the device and save a step.

#include <iostream>

#include <cuda.h>

#include "sys/time.h"

__global__ void vecMult_d(int *A, int *B, int N)


   int i = blockIdx.x * blockDim.x + threadIdx.x;

   if(i<N) { B[i] = A[i]*2; }


void vecMult_h(int *A, int *B, int N)


   for(int i=0;i<N;i++) { B[i] = A[i]*2; }


int main() {

   int *a_h, *b_h; // pointers to host memory; a.k.a. CPU

   int *a_d, *b_d; // pointers to device memory; a.k.a. GPU

   int blocksize=512, grid_size, n=501;

   struct timeval t1_start,t1_end,t2_start,t2_end;

   double time_d, time_h;

// allocate arrays on host

   a_h = (int *)malloc(sizeof(int)*n);

   b_h = (int *)malloc(sizeof(int)*n);

// allocate arrays on device

   cudaMalloc((void **)&a_d,n*sizeof(int));

   cudaMalloc((void **)&b_d,n*sizeof(int));

   dim3 dimBlock( blocksize );

   dim3 dimGrid( ceil(float(n)/float(dimBlock.x)) );

for(int j=0;j<n;j++) a_h[j]=j;

// GPU







// CPU




time_d = (t1_end.tv_sec-t1_start.tv_sec)*1000000 + t1_end.tv_usec - t1_start.tv_usec;

   time_h = (t2_end.tv_sec-t2_start.tv_sec)*1000000 + t2_end.tv_usec - t2_start.tv_usec;

   printf("%d %lf %lf\n",n,time_d,time_h);







Yes that is how the CUDA api works.

Allocated no, populated yes.