nvidia-smi and exclusive compute mode

Andrey1984 · April 24, 2018, 7:47pm

Does anyone have an idea how to implement the following at jetson tx2:

export CUDA_VISIBLE_DEVICES="0"
nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

source

linuxdev · April 24, 2018, 8:14pm

I’m not positive, but I think smi requires a PCI interface. The direct wiring from memory controller to GPU could exclude this as being possible.

Andrey1984 · April 24, 2018, 8:32pm

@linuxdev, thank you for your response.

The issue I investigated was how to execute multiple processes concurrently at single GPU.

For CPU ,I know w,e can use ./taskset -c [number of cpu] program.sh.

In my opinion, there should be a way to execute somehow on GPU, in parallel. I thought MPS example is a way to do that.

dusty_nv · April 24, 2018, 8:39pm

Hi Andrey1984, nvidia-smi and NVML don’t support Jetson’s iGPU, so those tools aren’t available for JetPack-L4T (including the MPS example).

Andrey1984 · April 24, 2018, 8:56pm

Thank you for letting me know.
What is the right way to run a batch of 100 parallel executions of a cuda file at jetson GPU?
However, perhaps for that purpose should rather be used somewhat HPC GPU devices.
It appears that without MPS all processes or files executions that are started will serialize.

snarky · April 24, 2018, 11:33pm

While GPUs are highly data parallel, they are not typically highly code parallel.
As long as your different data sets all run the same code, you can pack them all into wider input/output arrays and run them in parallel that way.
If you need to run different shaders on the 100 different pieces of data, and each shader only uses a small amount of the GPU compute capabilities, then you’re maybe not a great match for what the GPU is designed to do.

Andrey1984 · April 25, 2018, 8:04am

It appears that images are processed best on GPU processors.

However, there are represented diverse methods to process a regular code as images, as it seems to me.

Moreover, several other intrinsic and sophisticated solutions are described in the article.

Andrey1984 · April 26, 2018, 11:17am

with

#include <stdio.h>

int main()
{
  /*
   * Device ID is required first to query the device.
   */

  int deviceId;
  cudaGetDevice(&deviceId);

  cudaDeviceProp props;
  cudaGetDeviceProperties(&props, deviceId);

  /*
   * `props` now contains several properties about the current device.
   */

  int computeCapabilityMajor = props.major;
  int computeCapabilityMinor = props.minor;
  int multiProcessorCount = props.multiProcessorCount;
  int warpSize = props.warpSize;

printf("Device ID: %d\nNumber of SMs: %d\nCompute Capability Major: %d\nCompute Capability Minor: %d\nWarp Size: %d\n", deviceId, multiProcessorCount, computeCapabilityMajor, computeCapabilityMinor, warpSize);
}

It turned out that tx2 has 2 SM :

Device ID: 0
Number of SMs: 2
Compute Capability Major: 6
Compute Capability Minor: 2
Warp Size: 32

I am just wondering If I can somehow assign an executing of a particular file to SM1 or SM2, or assign execution of file1 to SM1 and file2 to SM2 concurrently.
Thanks.

dusty_nv · April 26, 2018, 11:42am

Hi Andrey, you can’t explicitly assign it to a particular SM, but TX2 supports concurrent kernels, so you can launch multiple CUDA kernels within an application simultaneously and the GPU will handle the scheduling. The kernels will need to be launched on independent CUDA streams.

pnshhz · November 20, 2018, 4:11pm

[Jetson TX2 ] cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime
version

Any pointers on how to solve the following problem:
I am trying to execute tensorflow with GPU support on my Jetson TX2. The code is shown below:
python3

import tensorflow as tf
hello = tf.constant(‘Hello, TensorFlow!’)
sess = tf.Session()
print(sess.run(hello))
An attempt to execute this code generates the following error:
2018-11-16 12:13:43.983272: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:864] ARM64
does not support NUMA - returning NUMA node zero
2018-11-16 12:13:43.983397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found
device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 5.32GiB
2018-11-16 12:13:43.983517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding
visible gpu devices: 0
2018-11-16 12:13:43.983919: E tensorflow/core/common_runtime/direct_session.cc:158] Internal:
cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
File “”, line 1, in
File “/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py”, line 1563, in init
super(Session, self).init(target, graph, config=config)
File “/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py”, line 633, in init
self._session = tf_session.TF_NewSession(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

Compiler driver version:
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Sun_Nov_19_03:16:56_CST_2017
Cuda compilation tools, release 9.0, V9.0.252
nvidia@tegra-ubuntu:~/DeepLearning/rnn-based-af-detection$