How many kernels you can launch in WSL wiith respect to Windows

k.glimps · July 6, 2022, 6:12am

I tried first to approach CUDA from python stuff because of sage. I am doing intgrable models.
we solve system of non-linear pde using different methods, algebraic , geometric, topoloogical.
I first tried Mathematica cuda package, but it is written more than 10 years ago
and didn’t evolve too much since then. BIG mistake of WRI.

So, I played with python stuff in order “to couple” with sage and cython.
Numba has a limited CUDA functionality. CUPY is more oriented. They are nice
within their field of application = makes life easier.

Pycuda and Pyopencl were different for me, having a Ryzen 7, It is attractive
to write a code and run it simultaneously on AMD (built in) and Nvidia 1660Ti
(Boosted to 1850MHz)

That is said here my CUDA code , under Windows and WSL. They are the same
but Windows can go further a little bit .

The idea is easy:
1- Create a big zero array
2- send it to the GPU
3- the CUDA kernel goes over it at once and change to “ones”
4- sum it over the gpu and cpu
5- time every thing, kernel launch - copy -sum

I thought creating a zero matrix will use zero RAM. But, it seems that CUDA
doesn’t like dynamical allocation. It is very fair a matrix is a matrix
wether zero or not.

(BUT AT LEAST YOU SHOULD ALLOW SYMMETRIC or SKEW-SYMMETRIC)

Pycuda creates a big cache. If I am creating 5(+) GB Array, I don’t
know from task manager Pycuda and PyopenCL uses more than 15 GB ram
and sometime even fails. I have 16 GB of RAM and having fast SSD I increased my
pagefile to 8G, then 16 G

Here the first code

k.glimps · July 6, 2022, 7:08am

I shall explain the code in steps so beginners like me can follow (I am teaching=my job)

First I intialize

import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
from time import time as _time
import numpy

then i shall use Tensor of Rank 4, (bacause most of the time i work in 4d or above)

nn=188
x_gpu = gpuarray.to_gpu(numpy.zeros(nn * nn * nn * nn).astype(numpy.int16))

clearly the dim = nn^4 and i worked with int16 in order to push dim
essentially double is 64bits=8bytes, float32 is 32 bits=4bytes and int16 is 16 bits= 2bytes

so use your calculator to calculate how much you reserve of the RAM
8 nn^4 bytes in case of double and 4 nn^4 bytes in case of flot32 and should be 2 nn^4 bytes in cae on anything 16.

Theoretically CUDA can launch ??
compile deviceQuery from CUDA samples
./deviceQuery

Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)

the important is the restriction 1024 Max thread/block

Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)

Theoretically 1024*2147483647 256^2 * 256^2 = 2^10 * (2^31 -1) (2^8)^2 (2^8)^2
do the math (x^n)^m = (x^m)^n = x^(nm) while x^n * x^m = x^(n+m)
CUDA can Launch a lot “Thoeretically.”

Now the main kernel

mod = SourceModule(“”"
global void zt4(int16_t *a)
{
//int id = blockIdx.x *blockDim.x + threadIdx.x; // (1B-1G)

  int blockId = blockIdx.x + blockIdx.y * gridDim.x
                + gridDim.x * gridDim.y * blockIdx.z;   // (1B-3G)
  int id = blockId * blockDim.x + threadIdx.x;
  //int t = threadIdx.x +1;
  //int x=  blockIdx.x  +1;
  //int y=  blockIdx.y  +1;
  //int z=  blockIdx.z  +1;
  a[id] = 1;
      }
""")

Here it is what the kernel
1- you know // for comment
2- the first comment when i need 2d problems (for easy testing)
3- a 4d sytem 1Block of threads and 3d grid

  int blockId = blockIdx.x + blockIdx.y * gridDim.x
                + gridDim.x * gridDim.y * blockIdx.z;   // (1B-3G)
  int id = blockId * blockDim.x + threadIdx.x;

4- if needed the 4d explicit coordinates

   //int t = threadIdx.x +1;
  //int x=  blockIdx.x  +1;
  //int y=  blockIdx.y  +1;
  //int z=  blockIdx.z  +1;

5- here esentially the active part of the glorious kernel

a[id] = 1;

after the kernel here we just call it and time the call

tt1=_time()
func = mod.get_function(“zt4”)
func(x_gpu, block=(nn,1,1),grid=( nn, nn, nn), shared=0)
tt2=_time()
print("for nn= ",nn,“func gpu time is =”,tt2-tt1)

Nothing fancy we register the time, define then call the kernel, register time again
print our nn and kernel launch , you can modify it and print how many kernels
you actually launched.

we shall continue latter. Don’t forget try it under Windows and WSL
Now, We can “collect it” in one file .py and run it under windows and WSL
feel free to change nn=188 to see how far you can go and when it breaks down

**** be sure that nvcc is callable try first
nvcc -V

k.glimps · July 6, 2022, 7:16am

of course we are bounded by 1024 thread per SM
but the CUDA magic it launchs a lot and coordinats them commutes between
them for you. It is magic, if you have ever programed multi-threading on cpu
headeach, you follow your threads one by one a true pain in the … neck!
CUDA can launch kernels for you and makes your life easier
in the old way the CUDA kernel is equal to 4 nested loops
for …
for …
for …
for …
a[*] =1

now CUDA do it at once for you.
Happy CUDA coding, we shall continue latter, to see how WSL and Windows
compares.
c u

k.glimps · July 8, 2022, 6:15am

I shall continue but, for now, just give me some time to focus on something else