GTX 1050 ti does not have Compute Preemption?

I’ve just been testing out our new GTX 1050ti, and I’ve found that according to cudaDeviceGetAttribute() this device does not support Compute Preemption. I was under the impression that all Pascal GPUs had this feature- is this feature not enabled on the 1050ti, or is this a bug in the CUDA runtime?

Is this on Windows or Linux?

Windows 10 64-bit.

Hmm, too bad that deviceQuery does not show this property. If it did, it would be easy to check support from various online posted deviceQuery results like the following for a GTX 1080

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GeForce GTX 1080”
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8112 MBytes (8506179584 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1835 MHz (1.84 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

I have a pascal titan X running on Ubuntu 14.04 linux with CUDA 8 and it supports this property:

$ cat t36.cpp
#include <stdio.h>
#include <cuda_runtime_api.h>
#include <assert.h>
#include <stdlib.h>

int main(int argc, char *argv[]){

  int support, device = 0;
  if (argc > 1) device = atoi(argv[1]);
  cudaDeviceProp my_prop;
  cudaError_t err = cudaDeviceGetAttribute(&support, cudaDevAttrComputePreemptionSupported, device);
  assert(err == cudaSuccess);
  err = cudaGetDeviceProperties(&my_prop, device);
  assert(err == cudaSuccess);
  if (support) printf("%s device %d supports compute preemption\n", my_prop.name, device);
  else printf("%s device %d does not support compute preemption\n", my_prop.name, device);
  return 0;
}
$ g++ -I/usr/local/cuda/include  t36.cpp -o t36 -L/usr/local/cuda/lib64 -lcudart
$ ./t36
TITAN X (Pascal) device 0 supports compute preemption
$

That’s great information, thanks! Are you running your Titan X in TCC mode, or under the regular display driver?

I thought the WDDM/TCC mode distinction was a Windows thing only?

Ah, you might be right- I’ve only toyed about with CUDA on Linux (most of the time I’ve worked on Windows) so I’m less familiar with the details.

Aha, just found this in the Pascal Tuning Guide (which seems to have appeared in the past few weeks):

“Compute Preemption is a new feature specific to GP100.”
https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#preemption

Explains why it isn’t on the 1050ti.

The GTX 1080 whitepaper says that Pascal has preemption of both graphics and compute pipelines at instruction level. However the CUDA 8 toolchain has not exposed this yet as far as I know, even for GP100. I had not noticed that line in the tuning guide, which contradicts the Pascal whitepaper.

Additionally, Pascal’s driver level automatic support for silent preemption of compute tasks (eliminating the longstanding kernel time limit watchdog killer) has not been implemented (yet?) either.