Impact of cudaMalloc() on CPU LLC

Hello!
I am working on a machine with 2 Quadro P4000 GPUs, CUDA Version: 11.2.
I am monitoring LLC accesses on CPU using MSRs.
Here are two pieces of code-

  1. On initializing and accessing result array before cudaMalloc(), the no. of LLC references and LLC misses for monitored code are equal to the no. of blocks required to store the result array.
        float *result = (float *) malloc (sizeof(float)* size);
        for(int i=0; i< size; i++)  result[i] = i;
        int temp0=0;
        for(int i=0; i< size; i++)  temp0 += result[i];
        printf("temp0 = %d\n", temp0);
	float *gpubuf;
	cudaMalloc( &gpubuf, sizeof(float) * size);
        int j, temp=0;

        //Read MSRs
        //monitored code starts here
        for(j=0; j< size; j++)  temp += result[j];
        //monitored code ends here
        // Read MSRs
       printf("temp = %d\n", temp);
  1. On initializing and accessing result array after cudaMalloc(), the no. of LLC references and LLC misses for monitored code fall to nearly zero.
        float *gpubuf;
	cudaMalloc( &gpubuf, sizeof(float) * size);
        float *result = (float *) malloc (sizeof(float)* size);
        for(int i=0; i< size; i++)  result[i] = i;
        int temp0=0;
        for(int i=0; i< size; i++)  temp0 += result[i];
        printf("temp0 = %d\n", temp0);
        int j, temp=0;

        //Read MSRs 
        //monitored code starts here
        for(j=0; j< size; j++)  temp += result[j];
        //monitored code ends here
        //Read MSRs
        printf("temp = %d\n", temp);

It looks like cudaMalloc() is causing eviction of result array from LLC. Can someone help me understand the reason behind the two different LLC stats caused due to cudaMalloc() call?

  • you seem to have posted the same code snippet twice
  • I’m not intimately familiar with MSR monitoring of caches, but if the LLC misses fall to zero does that mean it is hitting in LLC?
  • cudaMalloc is an opaque library call whose detailed behavior is not documented anywhere that I know of. I would assume that any library call, in any library, could cause the LLC contents to be modified, depending on what it is doing. If for some reason cudaMalloc needs to load a lot of data, it might happen. I imagine there could be other reasons.
  • I’m not suggesting that I know that cudaMalloc does this, just that it seems “theoretically possible”.
  • General advice for use of cudaMalloc is to get it out of performance sensitive areas of code. For example, if you have a work processing loop in your code, it’s not advisable to perform a cudaMalloc at each iteration, Instead, seek to do your cudaMalloc operations prior to entering the loop, perhaps by allocating everything that is needed up front, and/or reusing allocations.
  • If cudaMalloc is doing this, and you don’t like that behavior, you’re welcome to file a bug requesting a change in behavior. Be advised that you’re likely to be asked for a complete code that demonstrates the issue, displaying measurements, etc.

Hi!
Thanks for quick response. I’ve edited the second piece of code.

  • I am accessing the same result array twice on CPU one after another.
  • cudaMalloc() allocates memory on device so it should not affect the LLC allocation on host(or that’s what I thought).
  • The first access to result array makes sure that the result array is allocated in LLC and second access checks if cudaMalloc() affects LLC allocation. The expected behavior is that for the second access of result array it should not miss in LLC if it doesn’t exceed LLC size. Even for a few 100 floats(which can be fitted in L1 and L2 caches easily) it misses in LLC if I do cudaMalloc() as in 1st piece of code.

Since I could not find any documentation I was curious if someone has observed similar behavior or knew about cudaMalloc() working.

My simple test case doesn’t seem to suggest a caching problem after cudaMalloc:

$ cat t1855.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const int size = 32768;

int main(){
  float a = 0;
  float *result = (float *)malloc(sizeof(float)*size);
  for (int i = 0; i < size; i++) result[i] = 1;
#ifdef USE_CM
  float *gpubuf;
  cudaMalloc(&gpubuf, sizeof(float)*size);
#endif
  unsigned long long dt = dtime_usec(0);
  for (int i = 0; i < size; i++) a += result[i];
  dt = dtime_usec(dt);
  std::cout << "elapsed time: " << dt << "us" << std::endl;
  std::cout << "sum: " << a << std::endl;
}

$ nvcc -O3 -o t1855 t1855.cu
$ ./t1855
elapsed time: 83us
sum: 32768
$ ./t1855
elapsed time: 96us
sum: 32768
$ ./t1855
elapsed time: 76us
sum: 32768
$ ./t1855
elapsed time: 82us
sum: 32768
$ ./t1855
elapsed time: 82us
sum: 32768
$ nvcc -O3 -o t1855 t1855.cu -DUSE_CM
$ ./t1855
elapsed time: 39us
sum: 32768
$ ./t1855
elapsed time: 56us
sum: 32768
$ ./t1855
elapsed time: 48us
sum: 32768
$ ./t1855
elapsed time: 56us
sum: 32768
$ ./t1855
elapsed time: 74us
sum: 32768
$

CentOS 7, g++ 7.3.1, CUDA 11.2.67, Tesla V100