When WSL is faster than Windows?!

WSL 2 is a true Linux system, it has a full Linux kernel. Whatever, it has been reported
some memory problems with WSL. It takes a lot of memory because Linux does not
like free memory and tends to use it as cache. Microsoft limited memory usage by
half and you can increase it as you like by .wslconfig file in your profile.

Here, some cases where I have noticed that WSL is faster than Windows CPU/GPU
First and formost I haven’t tried different WSL, I am using Ubuntu 18.04 and that all.
i am using Windows 11 (W11)
May be in the future I shall try something else, and for sure if you already know
some fast WSL then please let me/us know.

1- Numba under WSL is faster
here in Ubuntu 18.04 I have

numba.version
‘0.53.1’
while in W11
numba.version
‘0.55.2’

My CPU calulations are faster (parallel/prange) strange but true.

2- True Linux stuff posix/pthread
They are terrible in W11 under cygwin shell they are slower than single thread.
But in WSL they work and rock. Really faster. In general gcc is at home works great.
for example Pari compiled under WSL is 7x faster.

3- nvc++/OpenACC(CPU/GPU) and Unified Memory Programming and many HPC stuffs
They are great tools and simply they don’t exist for W11.

4- OpenMP works with gcc. In windows, MS c compiler has limited support.

5- For GPU, it dependeds what you are doing. But, I found it great to have PyCUDA/
PyopenCl and CUPY / Numba in two different worlds at the same time in one single
system. Memory copying is some milliseconds slower but it is global i.e
some fixed slowdown.

I am working with more GPU calculations and shall see.
But usually, the most important factor is the math, no single multithread compiler
so far uses algebraic mathematical methods (group theory - combinatorics -…)
Also, it shall be nice to see Julia under WSL/W11.

Good News. I have installed HPC 22.5 and Cuda tools 11.7
Now NVPROF works !!!

NVIDIA GUYS YOU SHOULD INFORM US ABOUT NEWS.
By, the way when downloading these big files I had no problems
of session expired. THANK YOU for lestening and responding.

So, We can see some numbers. In general, I have noticed that Python-stuff
works faster in WSL than under W11. Strange but True.
Pycuda, Pyopencl, Numba, Cupy and even Reikna.
cuda-python from nvidia hasnt so far worked under WSL. Neither
pip install cuda-python or compiling from the source.
It works under W11 but i haven’t tried anything serious with it. so for.

It shall be nice to test Julia. Why python works better under WSL ?!

Let us see. Here a nice prog in pycuda I have stolen it from somewhere
and just pumped up the values. make harder and longer a little bit.

I shall give it, then run then profile it then see the difference
three tests. the same test done GPU kernel (the fastest)
then Elementwise (nice high level) then GPU Array (higher level)
It shall be clear the pycuda GPU Array needs further tunning.

Of course, in principle it should be compared with Numba and Cupy.

Nothing fancy , many thanks to the original


#!python 
# SimpleSpeedTest.py
# Very simple speed testing code
# Shows you how to run a loop over sin() using different methods
# with a note of the time each method takes
# For the GPU this uses SourceModule, ElementwiseKernel, GPUArray
# For the CPU this uses numpy
# Ian@IanOzsvald.com
# Using a WinXP Intel Core2 Duo 2.66GHz CPU (1 CPU used)
# with a 9800GT GPU I get the following timings (smaller is better):
# Using nbr_values == 8192
# Calculating 100000 iterations
# SourceModule time and first three results:
# 0.166590s, [ 0.005477  0.005477  0.005477]
# Elementwise time and first three results:
# 0.171657s, [ 0.005477  0.005477  0.005477]
# Elementwise Python looping time and first three results:
# 1.487470s, [ 0.005477  0.005477  0.005477]
# GPUArray time and first three results:
# 4.740007s, [ 0.005477  0.005477  0.005477]
# CPU time and first three results:
# 32.933660s, [ 0.005477  0.005477  0.005477]
 
# Using Win 7 x64, GTX 470 GPU, X5650 Xeon,
# Driver v301.42, CUDA 4.2, Python 2.7 x64,
# PyCuda 2012.1 gave the following results:
# Using nbr_values == 8192
# Calculating 100000 iterations
# SourceModule time and first three results:
# 0.058321s, [ 0.005477  0.005477  0.005477]
# Elementwise time and first three results:
# 0.102110s, [ 0.005477  0.005477  0.005477]
# Elementwise Python looping time and first three results:
# 2.428810s, [ 0.005477  0.005477  0.005477]
# GPUArray time and first three results:
# 8.421861s, [ 0.005477  0.005477  0.005477]
# CPU time measured using :
# 5.905661s, [ 0.005477  0.005477  0.005477]
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy
from pycuda.compiler import SourceModule
import pycuda.gpuarray as gpuarray
import pycuda.cumath
from pycuda.elementwise import ElementwiseKernel

blocks = 64
block_size = 1024    #128 orig
nbr_values = blocks * block_size
print("Using nbr_values ==", nbr_values)
# Number of iterations for the calculations,
# 100 is very quick, 2000000 will take a while
n_iter = 500000  #100000
print("Calculating %d iterations" % (n_iter))

# create two timers so we can speed-test each approach
start = drv.Event()
end = drv.Event()

######################
# SourceModele SECTION
# We write the C code and the indexing and we have lots of control

mod = SourceModule("""
__global__ void gpusin(float *dest, float *a, int n_iter)
{
  const int i = blockDim.x*blockIdx.x + threadIdx.x;
  for(int n = 0; n < n_iter; n++) {
    a[i] = sin(a[i]);
  }
  dest[i] = a[i];
}
""")

gpusin = mod.get_function("gpusin")

# create an array of 1s
a = numpy.ones(nbr_values).astype(numpy.float32)
# create a destination array that will receive the result
dest = numpy.zeros_like(a)

start.record() # start timing
gpusin(drv.Out(dest), drv.In(a), numpy.int32(n_iter), grid=(blocks,1), block=(block_size,1,1) )
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print("SourceModule time and first three results:")
print("%fs, %s" % (secs, str(dest[:3])))


#####################
# Elementwise SECTION
# use an ElementwiseKernel with sin in a for loop all in C call from Python
kernel = ElementwiseKernel(
   "float *a, int n_iter",
   "for(int n = 0; n < n_iter; n++) { a[i] = sin(a[i]);}",
   "gpusin")

a = numpy.ones(nbr_values).astype(numpy.float32)
a_gpu = gpuarray.to_gpu(a)
start.record() # start timing
#kernel(a_gpu, numpy.int(n_iter))
kernel(a_gpu, int(n_iter))
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print("Elementwise time and first three results:")
print("%fs, %s" % (secs, str(a_gpu.get()[:3])))


####################################
# Elementwise Python looping SECTION
# as Elementwise but the for loop is in Python, not in C
kernel = ElementwiseKernel(
   "float *a",
   "a[i] = sin(a[i]);",
   "gpusin")

a = numpy.ones(nbr_values).astype(numpy.float32)
a_gpu = gpuarray.to_gpu(a)
start.record() # start timing
for i in range(n_iter):
    kernel(a_gpu)
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print("Elementwise Python looping time and first three results:")
print("%fs, %s" % (secs, str(a_gpu.get()[:3])))


##################
# GPUArray SECTION
# The result is copied back to main memory on each iteration, this is a bottleneck

a = numpy.ones(nbr_values).astype(numpy.float32)
a_gpu = gpuarray.to_gpu(a)
start.record() # start timing
for i in range(n_iter):
    a_gpu = pycuda.cumath.sin(a_gpu)
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print("GPUArray time and first three results:")
print("%fs, %s" % (secs, str(a_gpu.get()[:3])))


#############
# CPU SECTION
# use numpy the calculate the result on the CPU for reference

a = numpy.ones(nbr_values).astype(numpy.float32)
start.record() # start timing
start.synchronize()

for i in range(n_iter):
    a = numpy.sin(a)

end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print("CPU time and first three results:")
print("%fs, %s" % (secs, str(a[:3])))


simple_speed_test.py (5.0 KB)

I forget I am using
under W11

(base) D:\CPyCUDA>pip show pycuda
Name: pycuda
Version: 2022.1
Summary: Python wrapper for Nvidia CUDA
Home-page: PyCUDA
Author: Andreas Kloeckner
Author-email: inform@tiker.net
License: MIT
Location: c:\programdata\anaconda3\lib\site-packages
Requires: appdirs, mako, pytools
Required-by:

and under WSL

mabd@LAPTOP-T8DQ9UK0:~/pymine$ pip show pycuda
Name: pycuda
Version: 2021.1
Summary: Python wrapper for Nvidia CUDA
Home-page: PyCUDA
Author: Andreas Kloeckner
Author-email: inform@tiker.net
License: MIT
Location: /home/mabd/.local/lib/python3.6/site-packages
Requires: appdirs, mako, pytools
Required-by:

The Same but python different
W11

(base) D:\CPyCUDA>python -V
Python 3.8.10
and my WSL Ubuntu 18.04

mabd@LAPTOP-T8DQ9UK0:~/pymine$ python3 -V
Python 3.6.9

OK.

here a sample of what you should get profiling the code under wsl

and under w11

you can click the picture to zoom out

it is amazing wsl is fantastic and with nvprof it is magic

the first difference between nvprof w11 vs. wsl is the warning I shall omit it from my output.
So let me explain once and for all.

mabd@LAPTOP-T8DQ9UK0:~/pymine$ nvprof python3 simple_speed_test.py


==211== NVPROF is profiling process 211, command: python3 simple_speed_test.py
==211== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: Programming Guide :: CUDA Toolkit Documentation


I am working over Ryzen7, it has an internal renoir AMD GPU
and then my Nvidia GPU. I like pycuda because i can change it easily
to pyopencl and let my code work simultaneously over AMD GPU and Nvidia GPU.
That is nice and then I collect them and mix them together.
the profile is complaining about unified memory programing and problems about
accessing ram between the two different GPU.
That is what i am doing i use w11 for opencl and wsl for pycuda.
and fire both gpus.

ok done. forget it.
why this warning doesn’t appear in w11 nvprof because UM is not applicable.

Now, if you run the pycuda code of simple_speed_test
under w11 and wsl .

1-Raw kernel 500000 iterations time is almost the same
2- 500000 iterations gpu array in pycuda is the worst result for
w11 and wsl but still faster than cpu(single thread).
of course changing single thread cpu to multithreaded cpu
will make cpu array faster than gpu (pycuda) array
3- Elementwise pycuda is the big winner under wsl it is almost
as fast as (500000 itr.) as raw kernel.

So, the most important lessons if you shall use pycuda for your
work then
1- raw kernel is the same whether wsl or w11
2- elementwise is better and cool under wsl, you can use it instead of
raw kernel.
3- gpu (pycuda) arrays sxs, if you have to use it then may be multithreaded cpu
array with Numba(njit,prange) shall be faster. Reconsider.

Now, profiling can explain WHY?!!
luckly computers have profile while, in general, life has not any.

Now for those who might interested reading in Numba

nvidia has some tutorials

Numba: High-Performance Python with CUDA Acceleration | NVIDIA Technical Blog

GitHub - ContinuumIO/gtc2017-numba: Numba tutorial for GTC 2017 conference

Seven Things You Might Not Know about Numba | NVIDIA Technical Blog

and the reference manual of numba is great.

by the way mark harris has a great tuts collections, the greatest on the net.

Author: Mark Harris | NVIDIA Technical Blog

just start from down to up.

It shall be nice to have a cuda monthy piodic in a pdf file from nvidia.
every month selection of topics, something new and some tuts basic, intermediate
and advanced.

you know like in electronics Circuit_Cellar or msdn from microsoft.

sorry guys, this is a poorman guide to nvidia or gpu home-super-computers

here output of nvprof wsl

mabd@LAPTOP-T8DQ9UK0:~/pymine$ nvprof python3 simple_speed_test.py
==211== NVPROF is profiling process 211, command: python3 simple_speed_test.py
==211== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory
Using nbr_values == 65536
Calculating 500000 iterations
SourceModule time and first three results:
0.365042s, [0.00244915 0.00244915 0.00244915]
Elementwise time and first three results:
0.352039s, [0.00244915 0.00244915 0.00244915]
Elementwise Python looping time and first three results:
8.084059s, [0.00244915 0.00244915 0.00244915]
GPUArray time and first three results:
33.873863s, [0.00244915 0.00244915 0.00244915]
CPU time and first three results:
70.477805s, [0.00244915 0.00244915 0.00244915]
==211== Profiling application: python3 simple_speed_test.py
==211== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   71.77%  6.40808s    500000  12.816us  4.0000us  43.359us  sinf_kernel
                   28.22%  2.51965s    500002  5.0390us  2.6240us  360.88ms  gpusin
                    0.00%  273.21us         4  68.301us  39.967us  153.21us  [CUDA memcpy DtoH]
                    0.00%  166.43us         4  41.606us  41.086us  42.239us  [CUDA memcpy HtoD]
      API calls:   60.04%  18.1712s    500005  36.342us  1.3330us  778.77us  cuMemFree
                   31.34%  9.48508s   1000002  9.4850us  3.2660us  323.73ms  cuLaunchKernel
                    3.01%  910.15ms    500005  1.8200us  1.1220us  1.1851ms  cuMemAlloc
                    1.84%  557.04ms         1  557.04ms  557.04ms  557.04ms  cuCtxCreate
                    1.19%  360.98ms         1  360.98ms  360.98ms  360.98ms  cuCtxSynchronize
                    1.02%  308.12ms         6  51.353ms  23.554us  298.85ms  cuEventSynchronize
                    0.96%  289.53ms   1000002     289ns     190ns  615.23us  cuFuncSetBlockShape
                    0.34%  103.85ms    500008     207ns     150ns  133.77us  cuCtxGetDevice
                    0.22%  65.311ms         1  65.311ms  65.311ms  65.311ms  cuCtxDetach
                    0.01%  3.8047ms        10  380.47us  5.4710us  3.6753ms  cuEventRecord
                    0.01%  2.5364ms         4  634.09us  16.892us  2.4436ms  cuModuleUnload
                    0.01%  2.0628ms         4  515.71us  335.29us  1.0251ms  cuModuleLoadDataEx
                    0.01%  2.0581ms         1  2.0581ms  2.0581ms  2.0581ms  cuDeviceGetPCIBusId
                    0.00%  1.1918ms         4  297.96us  186.22us  402.62us  cuMemcpyDtoH
                    0.00%  1.1432ms         4  285.80us  50.725us  955.65us  cuMemcpyHtoD
                    0.00%  19.797us         9  2.1990us     321ns  6.6520us  cuCtxPopCurrent
                    0.00%  16.751us         5  3.3500us  1.6940us  6.6120us  cuEventElapsedTime
                    0.00%  16.671us         2  8.3350us  1.3020us  15.369us  cuEventCreate
                    0.00%  6.3530us         8     794ns     150ns  1.8530us  cuDeviceGetAttribute
                    0.00%  5.0580us         9     562ns     160ns  1.7230us  cuCtxPushCurrent
                    0.00%  4.0080us         7     572ns     140ns  1.0830us  cuDeviceComputeCapability
                    0.00%  3.7380us         4     934ns     842ns  1.0020us  cuModuleGetFunction
                    0.00%  3.5370us         3  1.1790us     701ns  1.7940us  cuDeviceGetCount
                    0.00%  3.4060us         2  1.7030us     912ns  2.4940us  cuEventDestroy
                    0.00%  2.6950us         2  1.3470us     842ns  1.8530us  cuDeviceGet

and here output of nvprof w11

(base) D:\CPyCUDA>nvprof python simple_speed_test.py
==744== NVPROF is profiling process 744, command: python simple_speed_test.py
Using nbr_values == 65536
Calculating 500000 iterations
SourceModule time and first three results:
0.340462s, [0.00244915 0.00244915 0.00244915]
Elementwise time and first three results:
27.758891s, [0.00244915 0.00244915 0.00244915]
Elementwise Python looping time and first three results:
32.454152s, [0.00244915 0.00244915 0.00244915]
GPUArray time and first three results:
46.710250s, [0.00244915 0.00244915 0.00244915]
CPU time and first three results:
94.244711s, [0.00244914 0.00244914 0.00244914]
==744== Profiling application: python simple_speed_test.py
==744== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   63.60%  5.07620s    500000  10.152us  7.0720us  48.958us  sinf_kernel
                   36.40%  2.90534s    500002  5.8100us  2.9120us  625.90ms  gpusin
                    0.00%  196.79us         4  49.198us  39.583us  77.757us  [CUDA memcpy DtoH]
                    0.00%  165.88us         4  41.470us  40.862us  42.270us  [CUDA memcpy HtoD]
      API calls:   50.10%  9.86267s   1000002  9.8620us  4.8000us  11.157ms  cuLaunchKernel
                   37.42%  7.36676s    500005  14.733us  1.3000us  546.00us  cuMemFree
                    4.69%  923.75ms    500005  1.8470us  1.3000us  370.70us  cuMemAlloc
                    3.18%  626.29ms         6  104.38ms  1.7000us  625.95ms  cuEventSynchronize
                    1.72%  338.85ms         1  338.85ms  338.85ms  338.85ms  cuCtxSynchronize
                    1.16%  227.65ms   1000002     227ns     100ns  127.70us  cuFuncSetBlockShape
                    1.02%  200.15ms         1  200.15ms  200.15ms  200.15ms  cuCtxCreate
                    0.44%  86.538ms    500008     173ns     100ns  95.300us  cuCtxGetDevice
                    0.24%  47.417ms         1  47.417ms  47.417ms  47.417ms  cuCtxDetach
                    0.02%  3.8980ms         4  974.50us  809.10us  1.0985ms  cuModuleLoadDataEx
                    0.01%  1.0450ms         4  261.25us  15.500us  963.80us  cuModuleUnload
                    0.00%  880.70us         4  220.18us  161.50us  312.80us  cuMemcpyDtoH
                    0.00%  605.80us        10  60.580us  8.3000us  413.10us  cuEventRecord
                    0.00%  326.60us         4  81.650us  56.500us  96.500us  cuMemcpyHtoD
                    0.00%  18.700us        12  1.5580us     200ns  7.8000us  cuCtxPopCurrent
                    0.00%  14.000us         2  7.0000us  1.0000us  13.000us  cuEventCreate
                    0.00%  9.9000us         5  1.9800us  1.0000us  4.8000us  cuEventElapsedTime
                    0.00%  7.6000us         4  1.9000us     700ns  4.3000us  cuModuleGetFunction
                    0.00%  5.8000us        12     483ns     100ns  1.6000us  cuCtxPushCurrent
                    0.00%  5.3000us         7     757ns     300ns  2.3000us  cuDeviceComputeCapability
                    0.00%  5.1000us         3  1.7000us     200ns  3.8000us  cuDeviceGetCount
                    0.00%  3.4000us         8     425ns     200ns     700ns  cuDeviceGetAttribute
                    0.00%  2.6000us         2  1.3000us     500ns  2.1000us  cuEventDestroy
                    0.00%  1.9000us         2     950ns     300ns  1.6000us  cuDeviceGet

the first thing to notice they are not the same

compare GPU firt part

wsl

 GPU activities:   71.77%  6.40808s    500000  12.816us  4.0000us  43.359us  sinf_kernel
                   28.22%  2.51965s    500002  5.0390us  2.6240us  360.88ms  gpusin
                    0.00%  273.21us         4  68.301us  39.967us  153.21us  [CUDA memcpy DtoH]
                    0.00%  166.43us         4  41.606us  41.086us  42.239us  [CUDA memcpy HtoD]

and w11

 GPU activities:   63.60%  5.07620s    500000  10.152us  7.0720us  48.958us  sinf_kernel
                   36.40%  2.90534s    500002  5.8100us  2.9120us  625.90ms  gpusin
                    0.00%  196.79us         4  49.198us  39.583us  77.757us  [CUDA memcpy DtoH]
                    0.00%  165.88us         4  41.470us  40.862us  42.270us  [CUDA memcpy HtoD]

I have launched the kernel 500000 times on the average the kernel launch time is similar whatever it is nice to notice that Max-Min is different for gpusin

CUDA memcpy HtoD is similar BUT DtoH is different I have seen this in many different cases. This is the difference between WSL and W11 but it is not that big
or crucial.

Now for the second part it is interesting how they are interchanged



      API calls:   60.04%  18.1712s    500005  36.342us  1.3330us  778.77us  cuMemFree
                   31.34%  9.48508s   1000002  9.4850us  3.2660us  323.73ms  cuLaunchKernel
                    3.01%  910.15ms    500005  1.8200us  1.1220us  1.1851ms  cuMemAlloc
                    1.84%  557.04ms         1  557.04ms  557.04ms  557.04ms  cuCtxCreate
                    1.19%  360.98ms         1  360.98ms  360.98ms  360.98ms  cuCtxSynchronize
                    1.02%  308.12ms         6  51.353ms  23.554us  298.85ms  cuEventSynchronize
                    0.96%  289.53ms   1000002     289ns     190ns  615.23us  cuFuncSetBlockShape
                    0.34%  103.85ms    500008     207ns     150ns  133.77us  cuCtxGetDevice

while



      API calls:   50.10%  9.86267s   1000002  9.8620us  4.8000us  11.157ms  cuLaunchKernel
                   37.42%  7.36676s    500005  14.733us  1.3000us  546.00us  cuMemFree
                    4.69%  923.75ms    500005  1.8470us  1.3000us  370.70us  cuMemAlloc
                    3.18%  626.29ms         6  104.38ms  1.7000us  625.95ms  cuEventSynchronize
                    1.72%  338.85ms         1  338.85ms  338.85ms  338.85ms  cuCtxSynchronize
                    1.16%  227.65ms   1000002     227ns     100ns  127.70us  cuFuncSetBlockShape
                    1.02%  200.15ms         1  200.15ms  200.15ms  200.15ms  cuCtxCreate
                    0.44%  86.538ms    500008     173ns     100ns  95.300us  cuCtxGetDevice

Do you see the difference

cuMemFree - cuCtxCreate - cuEventSynchronize

60.04%  18.1712s    500005  36.342us  1.3330us  778.77us  cuMemFree
37.42%  7.36676s    500005  14.733us  1.3000us  546.00us  cuMemFree
1.84%  557.04ms         1  557.04ms  557.04ms  557.04ms  cuCtxCreate
1.02%  200.15ms         1  200.15ms  200.15ms  200.15ms  cuCtxCreate
1.02%  308.12ms         6  51.353ms  23.554us  298.85ms  cuEventSynchronize
3.18%  626.29ms         6  104.38ms  1.7000us  625.95ms  cuEventSynchronize

the numbers are quite different, aren’t they.
Now, what is really interseting

                    0.00%  4.0080us         7     572ns     140ns  1.0830us  cuDeviceComputeCapability
                    0.00%  3.7380us         4     934ns     842ns  1.0020us  cuModuleGetFunction
                    0.00%  3.5370us         3  1.1790us     701ns  1.7940us  cuDeviceGetCount
                    0.00%  3.4060us         2  1.7030us     912ns  2.4940us  cuEventDestroy
                    0.00%  2.6950us         2  1.3470us     842ns  1.8530us  cuDeviceGet

vs

                    0.00%  5.3000us         7     757ns     300ns  2.3000us  cuDeviceComputeCapability
                    0.00%  5.1000us         3  1.7000us     200ns  3.8000us  cuDeviceGetCount
                    0.00%  3.4000us         8     425ns     200ns     700ns  cuDeviceGetAttribute
                    0.00%  2.6000us         2  1.3000us     500ns  2.1000us  cuEventDestroy
                    0.00%  1.9000us         2     950ns     300ns  1.6000us  cuDeviceGet

why these numbers are different?!

why cuDeviceGetCount - cuDeviceGet - cuEventDestroy … are different, they are utilities nothing special about them.

Lucky Microsoft guys, now, they can easily identify weak points of windows and make it better.
Lucky users like us can improve themselves and can make their progs run faster
and can now write better progs.
Elementwise is ok under wsl pycuda = good news for people who believe that high level CUDA is possible.

let us try somthing cool

nbody problem is very demanding computational problem
it doesn’t show up on the screen neither w11 or wsl for gpu. .
-benchmark works and show us the speed of computation

we will do 4 tests, single/double prec for cpu/gpu.


cpu/w11


(base) E:\cuda-samples-11.6\bin\win64\Release>nbody -cpu -benchmark

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
1 Devices used for simulation
Simulation with CPU
4096 bodies, total time for 10 iterations: 731.708 ms
= 0.229 billion interactions per second
= 4.586 single-precision GFLOP/s at 20 flops per interaction

(base) E:\cuda-samples-11.6\bin\win64\Release>nbody -cpu -fp64 -benchmark

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Windowed mode
Simulation data stored in video memory
Double precision floating point simulation
1 Devices used for simulation
Simulation with CPU
4096 bodies, total time for 10 iterations: 667.029 ms
= 0.252 billion interactions per second
= 7.546 double-precision GFLOP/s at 30 flops per interaction


cpu/wsl


mabd@LAPTOP-T8DQ9UK0:~/cuda-samples-11.6/Samples/5_Domain_Specific/nbody$ ./nbody -cpu -benchmark

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
1 Devices used for simulation
Simulation with CPU
4096 bodies, total time for 10 iterations: 3213.046 ms
= 0.052 billion interactions per second
= 1.044 single-precision GFLOP/s at 20 flops per interaction

mabd@LAPTOP-T8DQ9UK0:~/cuda-samples-11.6/Samples/5_Domain_Specific/nbody$ ./nbody -cpu -fp64 -benchmark

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Windowed mode
Simulation data stored in video memory
Double precision floating point simulation
1 Devices used for simulation
Simulation with CPU
4096 bodies, total time for 10 iterations: 4042.778 ms
= 0.041 billion interactions per second
= 1.245 double-precision GFLOP/s at 30 flops per interaction


gpu/w11


(base) E:\cuda-samples-11.6\bin\win64\Release>nbody -benchmark

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
1 Devices used for simulation
GPU Device 0: “Turing” with compute capability 7.5

Compute 7.5 CUDA device: [NVIDIA GeForce GTX 1660 Ti with Max-Q Design]
24576 bodies, total time for 10 iterations: 57.415 ms
= 105.195 billion interactions per second
= 2103.907 single-precision GFLOP/s at 20 flops per interaction

(base) E:\cuda-samples-11.6\bin\win64\Release>nbody -fp64 -benchmark

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Windowed mode
Simulation data stored in video memory
Double precision floating point simulation
1 Devices used for simulation
GPU Device 0: “Turing” with compute capability 7.5

Compute 7.5 CUDA device: [NVIDIA GeForce GTX 1660 Ti with Max-Q Design]
24576 bodies, total time for 10 iterations: 1400.539 ms
= 4.312 billion interactions per second
= 129.374 double-precision GFLOP/s at 30 flops per interaction


gpu/wsl


mabd@LAPTOP-T8DQ9UK0:~/cuda-samples-11.6/Samples/5_Domain_Specific/nbody$ ./nbody -benchmark

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
1 Devices used for simulation
GPU Device 0: “Turing” with compute capability 7.5

Compute 7.5 CUDA device: [NVIDIA GeForce GTX 1660 Ti with Max-Q Design]
24576 bodies, total time for 10 iterations: 50.156 ms
= 120.420 billion interactions per second
= 2408.399 single-precision GFLOP/s at 20 flops per interaction

mabd@LAPTOP-T8DQ9UK0:~/cuda-samples-11.6/Samples/5_Domain_Specific/nbody$ ./nbody -fp64 -benchmark

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Windowed mode
Simulation data stored in video memory
Double precision floating point simulation
1 Devices used for simulation
GPU Device 0: “Turing” with compute capability 7.5

Compute 7.5 CUDA device: [NVIDIA GeForce GTX 1660 Ti with Max-Q Design]
24576 bodies, total time for 10 iterations: 1419.236 ms
= 4.256 billion interactions per second
= 127.670 double-precision GFLOP/s at 30 flops per interaction
mabd@LAPTOP-T8DQ9UK0:~/cuda-samples-11.6/Samples/5_Domain_Specific/nbody$

let me summarize the numbers. the numbers talk

cpu/w11: f32
= 4.586 single-precision GFLOP/s at 20 flops per interaction
cpu/w11:f64
= 7.546 double-precision GFLOP/s at 30 flops per interaction

cpu/wsl:f32
= 1.044 single-precision GFLOP/s at 20 flops per interaction
cpu/wsl:f64
= 1.245 double-precision GFLOP/s at 30 flops per interaction

WINDOWS WINS

gpu/w11:f32
= 2103.907 single-precision GFLOP/s at 20 flops per interaction
gpu/w11:f64
= 129.374 double-precision GFLOP/s at 30 flops per interaction

gpu/wsl:f32
= 2408.399 single-precision GFLOP/s at 20 flops per interaction
gpu/wsl:f64
= 127.670 double-precision GFLOP/s at 30 flops per interaction

======================================
VERY INTERESTING CASE
gpu/w11:f32
= 2103.907 single-precision GFLOP/s at 20 flops per interaction
gpu/wsl:f32
= 2408.399 single-precision GFLOP/s at 20 flops per interaction
I win 300 single prec GF/s using wsl:f32

300GF/s WOW

GOOD NEWS

1- it is amazing the difference between gpu/cpu
2- cpu calculations are single thread. THAT IS VERY IMPORTANT.

you should not look to this as (2400-2100)/2100 =1/7 ~ 14% gain
No you should look to it as (2400-2100)/7=300/7~43x100 = 4300% gain
where the 7 down is the 7GF/s from single thread cpu calculation.

2.5 TF/s over a laptop and .3 TF/s gain by one mouse click.
REALLY COOL

To be honest, I have just noticed that the number of particles
cpu = 4096 bodies, total time for 10 iteration
gpu = 24576 bodies, total time for 10 iterations

I should had changed this and consequently cpu scores should be less.
but anyway 7GF/s for single thread 4Gz cpu is already too much.

This correction shall not change the main result wsl-gpu-f32 vs w11-gpu-f32
this is the main point and they are done under the same conditions.

ok.

what next gpu array pycuda+numba+cupy
or julia. Julia is great. People knows sage but very few knows that julia has oscar.
sage(python) was terribly slow even slower than pure python under w11.
No wonder william stein moved new releases of sage to wsl

But it is very nice to compare oscar/flint vs pari(multiple thread) and
oscar/julia has some nice parallelization tools, not to mention, julia /CUDA
vs cupy/numba. WOW.

oscar/julia is wsl not w11 but now, i want to check if they work with CUDA.
In sage/windows, it was impossible because llvm is not installable over cygwin/shell
and something like numba or even pycuda were not working with sage/windows
and I haven’t seen sage/linux and pycuda-numba-cupy.

Now, with wsl oscar/julia we have new alternative and they already exist with julia/CUDA.
May be the big problem, oscar docs /tuts are not too much.

I am using WSL Ubuntu 18.04 and W11.
By the way, a couple of weeks ago, I have installed perf for WSL from
Index of /pub/linux/kernel/tools/perf/

intially i followed from

linux - Is there any method to run perf under WSL? - Stack Overflow

Gloit

You can install linux-tools-generic.

apt install linux-tools-generic

Then run perf using the install path /usr/lib/linux-tools/<linux-version>-generic/perf.

Then, I reinstalled another one from
Index of /pub/linux/kernel/tools/perf/v5.10.0/
that matchs my uname -r
and just followed the instruction how to build in
https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/HOWTO.build.perf

and it helps with the standard python profilers, like, KCacheGrind.

I have installed this before installing the new CUDATools 11.7 and HPC-SDK 22.5
and now nvprof works with me.

To install Cuda Tools, first I installed Cuda Tools 11.7 for Windows W11 with the new driver 516
W11 (updated) drivers where for Cuda Tools 11.6 (something ~ 512)
then
To install my cuda tools 11.7 for WSL I simply followed the download page

CUDA Toolkit 11.7 Downloads | NVIDIA Developer

then

Linux —> x86_64 —> WSL-Ubuntu —> 2.0 —> Runfile (local)
then the two steps
1- wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run
2- sudo sh cuda_11.7.0_515.43.04_linux.run
in steps 2 i used the default i didn’t change anything
and it works fine after export the correct path

export PATH=/usr/local/cuda/bin:$PATH

Now, life is much easier than the first time two years ago.
Many webpage about WSL are old outdated.

I installed HPC-SDK following their instruction

NVIDIA HPC SDK Current Release Downloads | NVIDIA Developer

for Linux x86_64 DEB

$ echo ‘deb [trusted=yes] https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64 /’ | sudo tee /etc/apt/sources.list.d/nvhpc.list
$ sudo apt-get update -y
$ sudo apt-get install -y nvhpc-22-5

Now to make it work I use

export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/22.5/compilers/bin:$PATH
export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/22.5/cuda/11.7/bin:$PATH

Then I found a better way

module load /opt/nvidia/hpc_sdk/modulefiles/nvhpc/22.5

That is basicaly my system. Installing now is much easier and straightforward

GitHub - pypr/compyle: Execute a subset of Python on HPC platforms

Welcome to Compyle’s documentation! — Compyle 0.9.dev documentation

Now, Let 's continue.
Numba, Cython/openmp, Pycuda, Numpy, Cupy.
Here is the big picture first, then we shall see the details and the code.
There is a nice package compyle, it generates parallel output of python code.
1- cpu: cython+openmp
2- gpu: OpenCL/Pycuda
Actually, it promises that cython/openmp is faster than numba whatever these
words should be taken with a grain of salt. Which Numba???
Numba is very powerfull and it is really a big project with many backends.

But I like to have something that generates for me workable cython+openmp.
It is useful to have alternative. Compyle has it own limitation but it is useful
and small code, so you can look inside easily than numba.

I shall be interested in their benchmark. But I shall modify.
At least for me, cuda/pycuda is better cuda than cuda/pyopencl.

We can test it. cuda/pycuda is double performance.

To make it cuda/pycuda, all what is needed it change their backend in just two
files kernel_vm and benchmark_vm. So good.

If you look inside their numba_vm, you will find they used just @jit
change it if you like @njit(parallel=True)
and import prange then changing range to prange the code will fly high.
But as I said I am mor interested in something that generates for me
workable cython/openmp. But it is really unfair to compare multiple thread
cython with single thread numba. Moreover they should had used cfunc
in their numba code. Numba will be very fast.

Again, I am using it for showing newbies like me how to generate cython/openmp
code without pain.

Then, their benchmark is actually ratios, not numbers which means it
will not be transparent how to compare wsl vs w11.

But it is easily solved, you find values for one test and everything else will scale.

Now, what about the cuda test beside changing pyopencl to pycuda.
CUDA shines when the problem get bigger and bigger, so, to see really cuda
we should pump higher our values, I think in their example they try to make
easily testable on small gpus, But if you have more ram, you can higher
and beat the cpu. The CPU gain stablizes because you don’t have a lot of cores.
Whereas CUDA can do more.