|| programming, basic question

Andrey1984 · April 24, 2018, 2:06pm

A friend of mine have asked me
if his code that takes 1 process to execute on 1 core in 300 minutes, or 150 minutes at 2 cores or 75 minutes on 4 cores and so forth
can be executed in parallel at gpu device and what time expenses will be. Since CUDA cores are more like ALU’s I can not give answer to him, and hope you could explain if the code can be accelerated with GPU from that pint of view.

Andrey1984 · April 24, 2018, 7:07pm

I have found an instruction : https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
that somehow describes multiprocessing. But comparing with CPU’s structure that has explicit cores that are displayed as cpus and to that we can assign tasks like executables, for example with

./taskset -c 4 program.sh

The GPU device is one and I can not address a separate single core, as I understand. Does the parallelism occur when the GPU device is sliced with kernels somehow? Do I need to slice the GPU to 10 kernels if I want to run 10 ./program.sh processes simultaneously?
Shall I split it to 100 kernels if I need 100 executables running simultaneously?

Andrey1984 · April 24, 2018, 7:28pm

I think I have found ssome illustration here gpu - How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications? - Stack Overflow
and some explanation:
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
Multi-Process Service :: GPU Deployment and Management Documentation

Andrey1984 · April 24, 2018, 7:36pm

but when I execute

sudo ./start_as_root.bash

it returns : No devices found
However, according to nvidia-smi I have a bus-id device at "00000000:01:00.0 On "
shall I somehow specify the bus id within the start_as_root.bash file?

cat start_as_root.bash 
#!/bin/bash
# the following must be performed with root privilege
export CUDA_VISIBLE_DEVICES="0"
nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

Robert_Crovella · April 24, 2018, 8:14pm

If you read the point 1 in the answer you linked:

[url]gpu - How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications? - Stack Overflow

it says:

" The machine I am using for test is a CentOS 6.2 node using a K40c (cc3.5/Kepler) GPU, with CUDA 7.0. There are other GPUs in the node. In my case, the CUDA enumeration order places my K40c at device 0, but the nvidia-smi enumeration order happens to place it as id 2 in the order. All of these details matter in a system with multiple GPUs, impacting the scripts given below."

If you only have 1 GPU in your system (or really, any other configuration) you will need to modify the scripts. If you don’t understand what the scripts are doing, this would probably be a lengthy explanation. However if you only have a single device, then simply modifying

nvidia-smi -i 2 …

to

nvidia-smi -i 0 …

in all the scripts, should fix that issue.

If that doesn’t fix it, please provide the full output from running nvidia-smi on your system.

Andrey1984 · April 24, 2018, 8:34pm

got it

Andrey1984 · April 24, 2018, 8:35pm

another challenge is to get the script running under embedded jetson tx2 platform that has GPU
[update: iGPU doesn’t support MPS],
however it is said to have direct access from memory to GPU.
I am just trying to figure out method and applications of || code writing with use of GPU.
Thank you for pointing out the cause!

Andrey1984 · April 24, 2018, 8:40pm

./stop_as_root.bash 
Set compute mode to DEFAULT for GPU 00000000:01:00.0.
All done.
exit
./mps_run 
kernel duration: 2.968831s
kernel duration: 2.978857s
sudo su
 ./start_as_root.bash 
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:01:00.0.
All done.
exit
/mps_run 
kernel duration: 3.366548s
kernel duration: 3.371385s

Andrey1984 · April 24, 2018, 8:42pm

the error has vanished after I adjusted the number of GPU’s used to 0 [or id as per your explanation]
but I can not get a feeling of performance optimization with it;
I was looking for acceleration of parallel execution of files like it can be achieved with

./taskset -c 2,4 program.sh

or like
./taskset -c 2 program.sh &
./taskset -c 4 program.sh &
./taskset -c 0 program.sh
That In my understanding will run the code on different cpus, though not sure if non in a sequential way.
Ideally I am looking for parallelization method to be able to run a number of program executions in || with reduced time.
However, I am a novice in || application.

Andrey1984 · April 24, 2018, 8:47pm

May I apply || method to execution of concurrent programs at embedded device somehow?

Andrey1984 · April 24, 2018, 8:52pm

Now I added few dozens of

./t1034.sh &

and it returns:

./mps_run 
Fatal error: kernel fail (out of memory at t1034.cu:46)
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
Fatal error: kernel fail (out of memory at t1034.cu:46)
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
*** FAILED - ABORTING
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
kernel duration: 2.964777s
kernel duration: 2.983556s
kernel duration: 3.082982s
kernel duration: 3.091910s
kernel duration: 3.107616s
kernel duration: 3.130660s
kernel duration: 3.143603s
$kernel duration: 3.159239s
kernel duration: 3.164827s
kernel duration: 3.176685s
kernel duration: 3.197945s
kernel duration: 3.193859s
kernel duration: 3.190122s
kernel duration: 3.194038s
kernel duration: 3.200547s
kernel duration: 3.182228s

seems to get out of memory

Andrey1984 · April 24, 2018, 8:53pm

the question is if I can run a pack of let say 256 processes to run concurrently and what method can I apply to get acceleration with it comparing with single sequential processing,
Thanks

Andrey1984 · April 24, 2018, 9:04pm

and what model of gpu device will be a good fit for that sort of parallel computations?
presuming computations generate 24/7 load of cpu and require ten times more hours.

Andrey1984 · April 25, 2018, 8:07am

It appears that I have found some generic article on || processing of a general purpose code on GPU’s.

Andrey1984 · April 26, 2018, 11:11am

It appears to me that GPU has SM’s.
I am querying

nvcc sol.cu -arch=sm_61 -o get-device-properties  --run

#include <stdio.h>

int main()
{
  /*
   * Device ID is required first to query the device.
   */

  int deviceId;
  cudaGetDevice(&deviceId);

  cudaDeviceProp props;
  cudaGetDeviceProperties(&props, deviceId);

  /*
   * `props` now contains several properties about the current device.
   */

  int computeCapabilityMajor = props.major;
  int computeCapabilityMinor = props.minor;
  int multiProcessorCount = props.multiProcessorCount;
  int warpSize = props.warpSize;


  printf("Device ID: %d\nNumber of SMs: %d\nCompute Capability Major: %d\nCompute Capability Minor: %d\nWarp Size: %d\n", deviceId, multiProcessorCount, computeCapabilityMajor, computeCapabilityMinor, warpSize);
}

the GPU device how many SM does it have with :

and it returns:

Device ID: 0
Number of SMs: 30
Compute Capability Major: 6
Compute Capability Minor: 1
Warp Size: 32

Does that mean that the device has 30, not 60 SM’s? And each of them can execute 64 threads, as I understand.
Thanks

BulatZiganshin · April 26, 2018, 4:28pm

each SM 6.1 has 128 cores and can execute 2048 threads. even if you don’t take into account tail effect, this means that you need 60K threads to fill your GPU. With tail effect, it’s recommended to split workload into 1M+ threads

overall, GPU isn’t just CPU with exotic name. It has much more horsepower, but this comes at price: first, SIMT. In short, this means that each warp (group of 32 threads) executes the same code, so if threads in this code diverses, then each thread in the warp performs code for each branch, just w/o writing results in irrelevant branches

second, with 60K threads vs at most 60 threads for CPU and pretty small caches, you have ~1000x smaller cache per thread. Essentially, it’s just an order of 100 bytes per thread and serves more as scatter/gather buffer rather than long-live store. How much slower your algo will became with 100-byte cache size?

Andrey1984 · April 26, 2018, 5:53pm

Thank you for the explanation!
The code is so far hypothetical and the research is so far rather theoretical, but I am looking for practical use and for getting more practice with the technology.

Andrey1984 · April 26, 2018, 7:03pm

As per the Wikipedia:
"Most modern desktop and server CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a translation lookaside buffer (TLB) "

As cache becomes a limit, reading appears to process from memory, as I understand after reading the article. And the latter is much slower than cache reading.
References:

http://web.mit.edu/vex/www/Parallel.pdf
https://statistics.berkeley.edu/computing/parallel
https://statistics.berkeley.edu/computing/gpu

Andrey1984 · April 30, 2018, 8:30pm

I am not sure where to ask, and because of that I will ask in the current thread.
The course, the new course Courses – NVIDIA
It has nvvp part that uses noVNC xfce part.
The question is how to locate the compiled program [I do compile in somewhat iNotepads] , and need to locate the output file somehow within the noVNC to get it to nvvp. But I can not do that, because It appears complicated to locate the files required for the exercise, unless they are located in some default/ intuitive location like /Desktop or /work folders, in my opinion.
Update: I have found Deep Learning Institute mail contact form and redirected the question to them.
Thanks

Topic		Replies	Views
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20148	May 4, 2007
Cuda code performance CUDA Programming and Performance	14	3149	December 16, 2014
Simultaneous kernel executions not possible? Disappointing news for me CUDA Programming and Performance	7	6096	November 3, 2008
'Computations server' application design advice CUDA Programming and Performance	24	12675	March 23, 2007
I can't realize the kernel concurrent with Hyper-Q CUDA Programming and Performance	7	887	July 27, 2017
GPU - CPU Performance comparison on string conversion i7 860 3.5GHz beat out NVidia 9800 GT CUDA Programming and Performance	11	10662	January 4, 2011
Using CUDA to run many instances CUDA Programming and Performance	10	3405	April 1, 2012
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8618	December 18, 2008
cant call any kernel function CUDA Programming and Performance	8	4834	June 6, 2011
Question about CUDA kernels parallel execution CUDA Programming and Performance cuda , parallel-computing	7	2636	April 27, 2024

|| programming, basic question

Related topics