|| programming, basic question

A friend of mine have asked me
if his code that takes 1 process to execute on 1 core in 300 minutes, or 150 minutes at 2 cores or 75 minutes on 4 cores and so forth
can be executed in parallel at gpu device and what time expenses will be. Since CUDA cores are more like ALU’s I can not give answer to him, and hope you could explain if the code can be accelerated with GPU from that pint of view.

I have found an instruction : https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
that somehow describes multiprocessing. But comparing with CPU’s structure that has explicit cores that are displayed as cpus and to that we can assign tasks like executables, for example with

./taskset -c 4 program.sh

The GPU device is one and I can not address a separate single core, as I understand. Does the parallelism occur when the GPU device is sliced with kernels somehow? Do I need to slice the GPU to 10 kernels if I want to run 10 ./program.sh processes simultaneously?
Shall I split it to 100 kernels if I need 100 executables running simultaneously?

I think I have found ssome illustration here https://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app/34711344#34711344
and some explanation:
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
https://docs.nvidia.com/deploy/mps/index.html

but when I execute

sudo ./start_as_root.bash

it returns : No devices found
However, according to nvidia-smi I have a bus-id device at "00000000:01:00.0 On "
shall I somehow specify the bus id within the start_as_root.bash file?

cat start_as_root.bash 
#!/bin/bash
# the following must be performed with root privilege
export CUDA_VISIBLE_DEVICES="0"
nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

If you read the point 1 in the answer you linked:

https://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app/34711344#34711344

it says:

" The machine I am using for test is a CentOS 6.2 node using a K40c (cc3.5/Kepler) GPU, with CUDA 7.0. There are other GPUs in the node. In my case, the CUDA enumeration order places my K40c at device 0, but the nvidia-smi enumeration order happens to place it as id 2 in the order. All of these details matter in a system with multiple GPUs, impacting the scripts given below."

If you only have 1 GPU in your system (or really, any other configuration) you will need to modify the scripts. If you don’t understand what the scripts are doing, this would probably be a lengthy explanation. However if you only have a single device, then simply modifying

nvidia-smi -i 2 …

to

nvidia-smi -i 0 …

in all the scripts, should fix that issue.

If that doesn’t fix it, please provide the full output from running nvidia-smi on your system.

got it

another challenge is to get the script running under embedded jetson tx2 platform that has GPU
[update: iGPU doesn’t support MPS],
however it is said to have direct access from memory to GPU.
I am just trying to figure out method and applications of || code writing with use of GPU.
Thank you for pointing out the cause!

./stop_as_root.bash 
Set compute mode to DEFAULT for GPU 00000000:01:00.0.
All done.
exit
./mps_run 
kernel duration: 2.968831s
kernel duration: 2.978857s
sudo su
 ./start_as_root.bash 
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:01:00.0.
All done.
exit
/mps_run 
kernel duration: 3.366548s
kernel duration: 3.371385s

the error has vanished after I adjusted the number of GPU’s used to 0 [or id as per your explanation]
but I can not get a feeling of performance optimization with it;
I was looking for acceleration of parallel execution of files like it can be achieved with

./taskset -c 2,4 program.sh

or like
./taskset -c 2 program.sh &
./taskset -c 4 program.sh &
./taskset -c 0 program.sh
That In my understanding will run the code on different cpus, though not sure if non in a sequential way.
Ideally I am looking for parallelization method to be able to run a number of program executions in || with reduced time.
However, I am a novice in || application.

May I apply || method to execution of concurrent programs at embedded device somehow?

Now I added few dozens of

./t1034.sh &

and it returns:

./mps_run 
Fatal error: kernel fail (out of memory at t1034.cu:46)
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
Fatal error: kernel fail (out of memory at t1034.cu:46)
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
*** FAILED - ABORTING
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
Fatal error: kernel fail (out of memory at t1034.cu:46)
*** FAILED - ABORTING
kernel duration: 2.964777s
kernel duration: 2.983556s
kernel duration: 3.082982s
kernel duration: 3.091910s
kernel duration: 3.107616s
kernel duration: 3.130660s
kernel duration: 3.143603s
$kernel duration: 3.159239s
kernel duration: 3.164827s
kernel duration: 3.176685s
kernel duration: 3.197945s
kernel duration: 3.193859s
kernel duration: 3.190122s
kernel duration: 3.194038s
kernel duration: 3.200547s
kernel duration: 3.182228s

seems to get out of memory

the question is if I can run a pack of let say 256 processes to run concurrently and what method can I apply to get acceleration with it comparing with single sequential processing,
Thanks

and what model of gpu device will be a good fit for that sort of parallel computations?
presuming computations generate 24/7 load of cpu and require ten times more hours.

It appears that I have found some generic article on || processing of a general purpose code on GPU’s.

It appears to me that GPU has SM’s.
I am querying

nvcc sol.cu -arch=sm_61 -o get-device-properties  --run
#include <stdio.h>

int main()
{
  /*
   * Device ID is required first to query the device.
   */

  int deviceId;
  cudaGetDevice(&deviceId);

  cudaDeviceProp props;
  cudaGetDeviceProperties(&props, deviceId);

  /*
   * `props` now contains several properties about the current device.
   */

  int computeCapabilityMajor = props.major;
  int computeCapabilityMinor = props.minor;
  int multiProcessorCount = props.multiProcessorCount;
  int warpSize = props.warpSize;


  printf("Device ID: %d\nNumber of SMs: %d\nCompute Capability Major: %d\nCompute Capability Minor: %d\nWarp Size: %d\n", deviceId, multiProcessorCount, computeCapabilityMajor, computeCapabilityMinor, warpSize);
}

the GPU device how many SM does it have with :

and it returns:

Device ID: 0
Number of SMs: 30
Compute Capability Major: 6
Compute Capability Minor: 1
Warp Size: 32

Does that mean that the device has 30, not 60 SM’s? And each of them can execute 64 threads, as I understand.
Thanks

each SM 6.1 has 128 cores and can execute 2048 threads. even if you don’t take into account tail effect, this means that you need 60K threads to fill your GPU. With tail effect, it’s recommended to split workload into 1M+ threads

overall, GPU isn’t just CPU with exotic name. It has much more horsepower, but this comes at price: first, SIMT. In short, this means that each warp (group of 32 threads) executes the same code, so if threads in this code diverses, then each thread in the warp performs code for each branch, just w/o writing results in irrelevant branches

second, with 60K threads vs at most 60 threads for CPU and pretty small caches, you have ~1000x smaller cache per thread. Essentially, it’s just an order of 100 bytes per thread and serves more as scatter/gather buffer rather than long-live store. How much slower your algo will became with 100-byte cache size?

Thank you for the explanation!
The code is so far hypothetical and the research is so far rather theoretical, but I am looking for practical use and for getting more practice with the technology.

As per the Wikipedia:
"Most modern desktop and server CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a translation lookaside buffer (TLB) "

As cache becomes a limit, reading appears to process from memory, as I understand after reading the article. And the latter is much slower than cache reading.
References:
http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/lectures/08_mem_hierarchy.pdf
http://web.mit.edu/vex/www/Parallel.pdf
https://statistics.berkeley.edu/computing/parallel
https://statistics.berkeley.edu/computing/gpu
https://www.youtube.com/watch?v=98Xis1W1mMk

I am not sure where to ask, and because of that I will ask in the current thread.
The course, the new course https://courses.nvidia.com/courses/course-v1:DLI+C-AC-01+V1/
It has nvvp part that uses noVNC xfce part.
The question is how to locate the compiled program [I do compile in somewhat iNotepads] , and need to locate the output file somehow within the noVNC to get it to nvvp. But I can not do that, because It appears complicated to locate the files required for the exercise, unless they are located in some default/ intuitive location like /Desktop or /work folders, in my opinion.
Update: I have found Deep Learning Institute mail contact form and redirected the question to them.
Thanks