Slow cudaMalloc (~1.5s) and slow mem access there, allocating nearly whole memory, with WDDM

jan_brothanek · June 18, 2014, 12:52pm

Hi all,

we have found a very strange behavior of cudaMalloc on Windows with WDDM driver. There is a memory amount point, which crossed, slows down cudaMalloc very much - about 1.1s opposite to 8ms when not crossing the point. We haven’t seen any similar problem on any Linux system or with TCC driver on Windows.

What’s more painful for us is that working with the “badly” allocated memory from kernels is always much slower (less than 2/3 of usual behavior, see the end of the post).

Let’s see the profiler output for Titan:

Here we allocated 1/2 of the free memory and then 33550000B less than the remaining free mem:

External Media

But allocating 1/2 of the free memory and then 33600000B less than the remaining free mem:

External Media

This is our simple test code to reproduce:

#include "cuda_runtime.h"

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char * argv[]){
    if (argc != 3)
        fprintf(stderr, "usage: memory_test G N\n where G is a device number, N is a size in bytes to be reserved from allocation\n");
    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaError_t cudaStatus = cudaSetDevice(atoi(argv[1]));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?\n");
        return cudaStatus;
    }

    // Initialize context and filter out first cuda call latency (see https://devtalk.nvidia.com/default/topic/401294/cuda-programming-and-performance/help-regarding-slow-cudamalloc/post/2831547/#2831547)
    cudaStatus = cudaFree(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaFree(0) failed!\n");
        return cudaStatus;
    }

    void * ptr, * ptr2;
    size_t freeMem, totalMem;
    
    // Get memory info
    cudaStatus = cudaMemGetInfo(&freeMem, &totalMem);
    printf("free %lu, total %lu\n", freeMem, totalMem);

    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "Cannot get info about the device!\n");
        return cudaStatus;
    }

    // Allocate a half of the memory - the WDDM won't allow us much more per 1 allocation
    cudaStatus = cudaMalloc(&ptr, freeMem/2);
    printf("malloc %lu\n", freeMem/2);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "Cannot allocate!\n");
        return cudaStatus;
    }

    cudaStatus = cudaMemGetInfo(&freeMem, &totalMem);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "Cannot get info about the device!\n");
        return cudaStatus;
    }

    // Allocate the other half with specified bytes reserved - let's see which number causes the slow allocation
    // on a 740M: 33603584 ok; 33603583 not ok 
    printf("malloc %lu\n", freeMem - atoi(argv[2]));
    cudaStatus = cudaMalloc(&ptr2, freeMem - atoi(argv[2]));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "Cannot allocate!\n");
        return cudaStatus;
    }

    cudaFree(ptr);
    cudaFree(ptr2);

    // Enable profiling
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!\n");
        return cudaStatus;
    }

    return 0;
}

Here the app can be launched like this:

memory_test.exe 0 33603584

where 0 is the cuda device number and 33603584 is an amount of memory to be reserved from allocation.

There is always an exact amount which allocated is ok, and allocating 1 more byte causes the described problem. E.g. on a laptop GF 740M the “ok” number was 33603584 less than the whole free mem, allocating 33603583 less than the whole free mem caused the problem. There were similar numbers for Titan, Quadro K5000 (with WDDM, no such problem with TCC), but for GF 660Ti the limit is more than (free mem - 250MB) (we couldn’t measure exactly due some allocation errors when allocating 300-500MB less than the free mem).

In our application the more data we can get to the memory the more effective our kernels can run. So we’d like to allocate as much as we can. That’s why we need to know (at least approximately, but never to overdraw) that memory limit. Does anyone know if it was documented somewhere?

Working with the “badly” allocated memory from a kernel is much slower. See these metrics from profiler (1st kernel worked on “well” allocated mem, the second one with the same data on “badly” allocated mem):

Metric 	 	 				On good mem	On bad mem
--------------------------------------------------------------------------
Duration(ns)					72652507	131700681
GridX						236		236
GridY						1		1
GridZ						1		1
BlockX						256		256
BlockY						1		1
BlockZ						1		1
Registers/Thread				59		59
Static Shared Memory				15232		15232
Dynamic Shared Memory				0		0
Requested Global Store Throughput(bytes/sec)	1741384461	1164707884
L1/Shared Memory Utilization			Low (2)		Low (1)
L2 Cache Utilization				Low (2)		Low (1)
System Memory Utilization			Low (1)		Low (1)
Device Memory Utilization			Low (3)		Low (2)
L2 Throughput (L1 Reads)(bytes/sec)		21857689560	14632399743
Atomic Transactions Per Request			0		0
Load/Store Function Unit Utilization		Low (2)		Low (2)
L1 Global Hit Rate(%)				0		0
L2 Hit Rate (Texture Reads)(%)			0		0
Atomic Transactions				0		0
Arithmetic Function Unit Utilization		Low (3)		Low (2)
L1 Local Hit Rate(%)				32,434		32,565
Control-Flow Function Unit Utilization		Low (1)		Low (1)
Local Memory Overhead(%)			90,851		90,847
Texture Function Unit Utilization		Idle		Idle
L2 Throughput (Texture Reads)(bytes/sec)	0		0
Global Memory Load Efficiency(%)		30,29		30,29
Multiprocessor Activity(%)			98,573		99,028
Shared Memory Efficiency(%)			33,269		32,92
Requested Global Load Throughput(bytes/sec)	126093183	84336186
L2 Hit Rate (L1 Reads)(%)			0,611		0,787
Achieved Occupancy				0,34		0,351
L2 Throughput (Atomic requests)(bytes/sec)	0		0
Global Memory Store Efficiency(%)		50		50
Global Store Throughput(bytes/sec)		3482768922	2329415769
Device Memory Write Throughput(bytes/sec)	20400175692	13831453058
Global Load Throughput(bytes/sec)		416287642	278429898
Device Memory Read Throughput(bytes/sec)	22120492614	14620500074
Global Store Transactions			9508392		9508392
System Memory Read Transactions			0		0
Shared Store Transactions			12430239	12588523
Global Load Transactions			936553		936553
Local Store Transactions			15024188	15263690
Shared Load Transactions			42553335	42977970
Global Store Transactions Per Request		30,671		30,671
Local Load Transactions				25608537	23835218
Shared Memory Store Transactions Per Request	1,129		1,143
Global Load Transactions Per Request		2,576		2,576
Local Memory Store Transactions Per Request	3,206		3,257
Shared Memory Load Transactions Per Request	1,273		1,286
Local Memory Load Transactions Per Request	2,618		2,437
System Memory Write Throughput(bytes/sec)	732		1224
System Memory Read Throughput(bytes/sec)	0		0
L2 Throughput (Writes)(bytes/sec)		20760228532	13850871324
L2 Throughput (Reads)(bytes/sec)		22626715722	14979626399
Shared Memory Store Throughput(bytes/sec)	36423950617	24672019400
Shared Memory Load Throughput(bytes/sec)	124692741036	84231749000
Local Memory Store Throughput(bytes/sec)	12619884097	8434329568
Local Memory Load Throughput(bytes/sec)		25198882143	16854024698
L2 Write Transactions				56678004	56537573
L2 Read Transactions				61773746	61145014
Device Memory Write Transactions		55695015	56458310
Device Memory Read Transactions			60391694	59679104
Texture Cache Transactions			0		0
System Memory Write Transactions		2		5

Edit: we’ve tested with the latest stable driver (337.88) and some older ones. No differences between drivers.

Topic		Replies	Views
How much GPU memory can cudaMalloc get? CUDA Programming and Performance	17	15408	April 2, 2022
cuMemAlloc limited to 1/4 total GPU memory? CUDA Programming and Performance	10	12930	April 1, 2010
Does cudaMalloc increases the private bytes used on host? CUDA Programming and Performance	9	1547	July 24, 2023
cudaMalloc fails after CUDA Programming and Performance	6	7771	June 18, 2012
CUDA Memory allocation issue on GTX 850M on Windows 7 (64) CUDA Programming and Performance	5	1493	October 29, 2015
Cudamalloc attempting to allocate more memory than it is supposed to CUDA Programming and Performance cuda	15	841	January 13, 2025
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	10001	December 4, 2017
cudamalloc not allocating memeory CUDA Programming and Performance	0	1308	May 1, 2012
cudaMalloc with sysmem fallback CUDA NVCC Compiler	1	91	October 15, 2025
Does CUDA automatically allocate more GPU memory during the initialization of the application? CUDA Programming and Performance cuda	2	293	November 26, 2024

Slow cudaMalloc (~1.5s) and slow mem access there, allocating nearly whole memory, with WDDM

Related topics