Hi all,
we have found a very strange behavior of cudaMalloc on Windows with WDDM driver. There is a memory amount point, which crossed, slows down cudaMalloc very much - about 1.1s opposite to 8ms when not crossing the point. We haven’t seen any similar problem on any Linux system or with TCC driver on Windows.
What’s more painful for us is that working with the “badly” allocated memory from kernels is always much slower (less than 2/3 of usual behavior, see the end of the post).
Let’s see the profiler output for Titan:
This is our simple test code to reproduce:
#include "cuda_runtime.h"
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char * argv[]){
if (argc != 3)
fprintf(stderr, "usage: memory_test G N\n where G is a device number, N is a size in bytes to be reserved from allocation\n");
// Choose which GPU to run on, change this on a multi-GPU system.
cudaError_t cudaStatus = cudaSetDevice(atoi(argv[1]));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?\n");
return cudaStatus;
}
// Initialize context and filter out first cuda call latency (see https://devtalk.nvidia.com/default/topic/401294/cuda-programming-and-performance/help-regarding-slow-cudamalloc/post/2831547/#2831547)
cudaStatus = cudaFree(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaFree(0) failed!\n");
return cudaStatus;
}
void * ptr, * ptr2;
size_t freeMem, totalMem;
// Get memory info
cudaStatus = cudaMemGetInfo(&freeMem, &totalMem);
printf("free %lu, total %lu\n", freeMem, totalMem);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "Cannot get info about the device!\n");
return cudaStatus;
}
// Allocate a half of the memory - the WDDM won't allow us much more per 1 allocation
cudaStatus = cudaMalloc(&ptr, freeMem/2);
printf("malloc %lu\n", freeMem/2);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "Cannot allocate!\n");
return cudaStatus;
}
cudaStatus = cudaMemGetInfo(&freeMem, &totalMem);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "Cannot get info about the device!\n");
return cudaStatus;
}
// Allocate the other half with specified bytes reserved - let's see which number causes the slow allocation
// on a 740M: 33603584 ok; 33603583 not ok
printf("malloc %lu\n", freeMem - atoi(argv[2]));
cudaStatus = cudaMalloc(&ptr2, freeMem - atoi(argv[2]));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "Cannot allocate!\n");
return cudaStatus;
}
cudaFree(ptr);
cudaFree(ptr2);
// Enable profiling
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!\n");
return cudaStatus;
}
return 0;
}
Here the app can be launched like this:
memory_test.exe 0 33603584
where 0 is the cuda device number and 33603584 is an amount of memory to be reserved from allocation.
There is always an exact amount which allocated is ok, and allocating 1 more byte causes the described problem. E.g. on a laptop GF 740M the “ok” number was 33603584 less than the whole free mem, allocating 33603583 less than the whole free mem caused the problem. There were similar numbers for Titan, Quadro K5000 (with WDDM, no such problem with TCC), but for GF 660Ti the limit is more than (free mem - 250MB) (we couldn’t measure exactly due some allocation errors when allocating 300-500MB less than the free mem).
In our application the more data we can get to the memory the more effective our kernels can run. So we’d like to allocate as much as we can. That’s why we need to know (at least approximately, but never to overdraw) that memory limit. Does anyone know if it was documented somewhere?
Working with the “badly” allocated memory from a kernel is much slower. See these metrics from profiler (1st kernel worked on “well” allocated mem, the second one with the same data on “badly” allocated mem):
Metric On good mem On bad mem
--------------------------------------------------------------------------
Duration(ns) 72652507 131700681
GridX 236 236
GridY 1 1
GridZ 1 1
BlockX 256 256
BlockY 1 1
BlockZ 1 1
Registers/Thread 59 59
Static Shared Memory 15232 15232
Dynamic Shared Memory 0 0
Requested Global Store Throughput(bytes/sec) 1741384461 1164707884
L1/Shared Memory Utilization Low (2) Low (1)
L2 Cache Utilization Low (2) Low (1)
System Memory Utilization Low (1) Low (1)
Device Memory Utilization Low (3) Low (2)
L2 Throughput (L1 Reads)(bytes/sec) 21857689560 14632399743
Atomic Transactions Per Request 0 0
Load/Store Function Unit Utilization Low (2) Low (2)
L1 Global Hit Rate(%) 0 0
L2 Hit Rate (Texture Reads)(%) 0 0
Atomic Transactions 0 0
Arithmetic Function Unit Utilization Low (3) Low (2)
L1 Local Hit Rate(%) 32,434 32,565
Control-Flow Function Unit Utilization Low (1) Low (1)
Local Memory Overhead(%) 90,851 90,847
Texture Function Unit Utilization Idle Idle
L2 Throughput (Texture Reads)(bytes/sec) 0 0
Global Memory Load Efficiency(%) 30,29 30,29
Multiprocessor Activity(%) 98,573 99,028
Shared Memory Efficiency(%) 33,269 32,92
Requested Global Load Throughput(bytes/sec) 126093183 84336186
L2 Hit Rate (L1 Reads)(%) 0,611 0,787
Achieved Occupancy 0,34 0,351
L2 Throughput (Atomic requests)(bytes/sec) 0 0
Global Memory Store Efficiency(%) 50 50
Global Store Throughput(bytes/sec) 3482768922 2329415769
Device Memory Write Throughput(bytes/sec) 20400175692 13831453058
Global Load Throughput(bytes/sec) 416287642 278429898
Device Memory Read Throughput(bytes/sec) 22120492614 14620500074
Global Store Transactions 9508392 9508392
System Memory Read Transactions 0 0
Shared Store Transactions 12430239 12588523
Global Load Transactions 936553 936553
Local Store Transactions 15024188 15263690
Shared Load Transactions 42553335 42977970
Global Store Transactions Per Request 30,671 30,671
Local Load Transactions 25608537 23835218
Shared Memory Store Transactions Per Request 1,129 1,143
Global Load Transactions Per Request 2,576 2,576
Local Memory Store Transactions Per Request 3,206 3,257
Shared Memory Load Transactions Per Request 1,273 1,286
Local Memory Load Transactions Per Request 2,618 2,437
System Memory Write Throughput(bytes/sec) 732 1224
System Memory Read Throughput(bytes/sec) 0 0
L2 Throughput (Writes)(bytes/sec) 20760228532 13850871324
L2 Throughput (Reads)(bytes/sec) 22626715722 14979626399
Shared Memory Store Throughput(bytes/sec) 36423950617 24672019400
Shared Memory Load Throughput(bytes/sec) 124692741036 84231749000
Local Memory Store Throughput(bytes/sec) 12619884097 8434329568
Local Memory Load Throughput(bytes/sec) 25198882143 16854024698
L2 Write Transactions 56678004 56537573
L2 Read Transactions 61773746 61145014
Device Memory Write Transactions 55695015 56458310
Device Memory Read Transactions 60391694 59679104
Texture Cache Transactions 0 0
System Memory Write Transactions 2 5
Edit: we’ve tested with the latest stable driver (337.88) and some older ones. No differences between drivers.