Python / Numba further optimization on CUDA?

This is my first Python / Numba run time speed comparison, just to help me learn Numba, only in my first steps (final goal is a much more complex program to compute arithmetic operations on thousands of 2D arrays on multiple GPUs using CUDA). The comparison is a simple program to print all prime numbers in an interval. First code is default Python compiler, runs in approximately 12.38 seconds with interval 2-50000. Second code is using Numba and jit, runs in about 0.70 with the same interval values. That’s already a nice increase of about 17x. Didn’t even take into account the compilation time in the total run time for code #2.
Was wondering :

  1. What could be optimized just with Numba and jit (and maybe numpy)?
  2. What further speed increase can I expect with CUDA on latest GPU architecture?
    Thanks!

Code #1:
import timeit
start = timeit.default_timer()

lower = 2
upper = 50000

print(“Prime numbers between”, lower, “and”, upper, “are:”)

for num in range(lower, upper + 1):
if num > 1:
for i in range(2, num):
if (num % i) == 0:
break
else:
print(num)
stop = timeit.default_timer()
print('Time: ', stop - start)

Code #2:
from numba import jit
import timeit
start = timeit.default_timer()

@jit(nopython=True)
def go_fast(num):
if num > 1:
for i in range(2, num):
if (num % i) == 0:
break
else:
print(num)
return

lower = 2
upper = 50000

print(“Prime numbers between”, lower, “and”, upper, “are:”)
for x in range(lower, upper + 1):
go_fast(x)
stop = timeit.default_timer()
print('Time: ', stop - start)