Slow cudaMalloc (~1.5s) and slow mem access there, allocating nearly whole memory, with WDDM

Hi all,

we have found a very strange behavior of cudaMalloc on Windows with WDDM driver. There is a memory amount point, which crossed, slows down cudaMalloc very much - about 1.1s opposite to 8ms when not crossing the point. We haven’t seen any similar problem on any Linux system or with TCC driver on Windows.

What’s more painful for us is that working with the “badly” allocated memory from kernels is always much slower (less than 2/3 of usual behavior, see the end of the post).

Let’s see the profiler output for Titan:

  • Here we allocated 1/2 of the free memory and then 33550000B less than the remaining free mem:
  • External Media

  • But allocating 1/2 of the free memory and then 33600000B less than the remaining free mem:
  • External Media

    This is our simple test code to reproduce:

    #include "cuda_runtime.h"
    
    #include <stdio.h>
    #include <stdlib.h>
    
    int main(int argc, char * argv[]){
        if (argc != 3)
            fprintf(stderr, "usage: memory_test G N\n where G is a device number, N is a size in bytes to be reserved from allocation\n");
        // Choose which GPU to run on, change this on a multi-GPU system.
        cudaError_t cudaStatus = cudaSetDevice(atoi(argv[1]));
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?\n");
            return cudaStatus;
        }
    
        // Initialize context and filter out first cuda call latency (see https://devtalk.nvidia.com/default/topic/401294/cuda-programming-and-performance/help-regarding-slow-cudamalloc/post/2831547/#2831547)
        cudaStatus = cudaFree(0);
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaFree(0) failed!\n");
            return cudaStatus;
        }
    
        void * ptr, * ptr2;
        size_t freeMem, totalMem;
        
        // Get memory info
        cudaStatus = cudaMemGetInfo(&freeMem, &totalMem);
        printf("free %lu, total %lu\n", freeMem, totalMem);
    
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "Cannot get info about the device!\n");
            return cudaStatus;
        }
    
        // Allocate a half of the memory - the WDDM won't allow us much more per 1 allocation
        cudaStatus = cudaMalloc(&ptr, freeMem/2);
        printf("malloc %lu\n", freeMem/2);
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "Cannot allocate!\n");
            return cudaStatus;
        }
    
        cudaStatus = cudaMemGetInfo(&freeMem, &totalMem);
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "Cannot get info about the device!\n");
            return cudaStatus;
        }
    
        // Allocate the other half with specified bytes reserved - let's see which number causes the slow allocation
        // on a 740M: 33603584 ok; 33603583 not ok 
        printf("malloc %lu\n", freeMem - atoi(argv[2]));
        cudaStatus = cudaMalloc(&ptr2, freeMem - atoi(argv[2]));
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "Cannot allocate!\n");
            return cudaStatus;
        }
    
        cudaFree(ptr);
        cudaFree(ptr2);
    
        // Enable profiling
        cudaStatus = cudaDeviceReset();
        if (cudaStatus != cudaSuccess) {
            fprintf(stderr, "cudaDeviceReset failed!\n");
            return cudaStatus;
        }
    
        return 0;
    }
    

    Here the app can be launched like this:

    memory_test.exe 0 33603584
    

    where 0 is the cuda device number and 33603584 is an amount of memory to be reserved from allocation.

    There is always an exact amount which allocated is ok, and allocating 1 more byte causes the described problem. E.g. on a laptop GF 740M the “ok” number was 33603584 less than the whole free mem, allocating 33603583 less than the whole free mem caused the problem. There were similar numbers for Titan, Quadro K5000 (with WDDM, no such problem with TCC), but for GF 660Ti the limit is more than (free mem - 250MB) (we couldn’t measure exactly due some allocation errors when allocating 300-500MB less than the free mem).

    In our application the more data we can get to the memory the more effective our kernels can run. So we’d like to allocate as much as we can. That’s why we need to know (at least approximately, but never to overdraw) that memory limit. Does anyone know if it was documented somewhere?

    Working with the “badly” allocated memory from a kernel is much slower. See these metrics from profiler (1st kernel worked on “well” allocated mem, the second one with the same data on “badly” allocated mem):

    Metric 	 	 				On good mem	On bad mem
    --------------------------------------------------------------------------
    Duration(ns)					72652507	131700681
    GridX						236		236
    GridY						1		1
    GridZ						1		1
    BlockX						256		256
    BlockY						1		1
    BlockZ						1		1
    Registers/Thread				59		59
    Static Shared Memory				15232		15232
    Dynamic Shared Memory				0		0
    Requested Global Store Throughput(bytes/sec)	1741384461	1164707884
    L1/Shared Memory Utilization			Low (2)		Low (1)
    L2 Cache Utilization				Low (2)		Low (1)
    System Memory Utilization			Low (1)		Low (1)
    Device Memory Utilization			Low (3)		Low (2)
    L2 Throughput (L1 Reads)(bytes/sec)		21857689560	14632399743
    Atomic Transactions Per Request			0		0
    Load/Store Function Unit Utilization		Low (2)		Low (2)
    L1 Global Hit Rate(%)				0		0
    L2 Hit Rate (Texture Reads)(%)			0		0
    Atomic Transactions				0		0
    Arithmetic Function Unit Utilization		Low (3)		Low (2)
    L1 Local Hit Rate(%)				32,434		32,565
    Control-Flow Function Unit Utilization		Low (1)		Low (1)
    Local Memory Overhead(%)			90,851		90,847
    Texture Function Unit Utilization		Idle		Idle
    L2 Throughput (Texture Reads)(bytes/sec)	0		0
    Global Memory Load Efficiency(%)		30,29		30,29
    Multiprocessor Activity(%)			98,573		99,028
    Shared Memory Efficiency(%)			33,269		32,92
    Requested Global Load Throughput(bytes/sec)	126093183	84336186
    L2 Hit Rate (L1 Reads)(%)			0,611		0,787
    Achieved Occupancy				0,34		0,351
    L2 Throughput (Atomic requests)(bytes/sec)	0		0
    Global Memory Store Efficiency(%)		50		50
    Global Store Throughput(bytes/sec)		3482768922	2329415769
    Device Memory Write Throughput(bytes/sec)	20400175692	13831453058
    Global Load Throughput(bytes/sec)		416287642	278429898
    Device Memory Read Throughput(bytes/sec)	22120492614	14620500074
    Global Store Transactions			9508392		9508392
    System Memory Read Transactions			0		0
    Shared Store Transactions			12430239	12588523
    Global Load Transactions			936553		936553
    Local Store Transactions			15024188	15263690
    Shared Load Transactions			42553335	42977970
    Global Store Transactions Per Request		30,671		30,671
    Local Load Transactions				25608537	23835218
    Shared Memory Store Transactions Per Request	1,129		1,143
    Global Load Transactions Per Request		2,576		2,576
    Local Memory Store Transactions Per Request	3,206		3,257
    Shared Memory Load Transactions Per Request	1,273		1,286
    Local Memory Load Transactions Per Request	2,618		2,437
    System Memory Write Throughput(bytes/sec)	732		1224
    System Memory Read Throughput(bytes/sec)	0		0
    L2 Throughput (Writes)(bytes/sec)		20760228532	13850871324
    L2 Throughput (Reads)(bytes/sec)		22626715722	14979626399
    Shared Memory Store Throughput(bytes/sec)	36423950617	24672019400
    Shared Memory Load Throughput(bytes/sec)	124692741036	84231749000
    Local Memory Store Throughput(bytes/sec)	12619884097	8434329568
    Local Memory Load Throughput(bytes/sec)		25198882143	16854024698
    L2 Write Transactions				56678004	56537573
    L2 Read Transactions				61773746	61145014
    Device Memory Write Transactions		55695015	56458310
    Device Memory Read Transactions			60391694	59679104
    Texture Cache Transactions			0		0
    System Memory Write Transactions		2		5
    

    Edit: we’ve tested with the latest stable driver (337.88) and some older ones. No differences between drivers.