Is the following feasible using CUDA and a GPU

I’m unfamiliar with CUDA but have a problem which I think may be suited to to it. Before spending a lot of time reading documentation and spending money on hardware please could you tell me if the following is feasible and sensible

load 5000 1000 * 1000 matrices of bytes from the host onto the GPU // assume there is sufficient memory on the GPU
repeat {
  on the host calculate the index of a matrix on the GPU
  select the matrix on the GPU
  form a matrix of floats on the GPU from the selected matrix of bytes using float = toInt(byte) * 256.0
  repeat {
    calculate a vector on host
    copy vector to GPU
    multiply vector by float matrix
    copy result back to host 
  } until some condition is satisfied on the host
} until some condition is satisfied on the host

I’d prefer to avoid C/C++ on the host if possible. Python would be fine.

It should be feasible. It’s hard to avoid some amount of C/C++ if you want fast CUDA code, but pycuda allows you to write most of the host code in python while just writing a kernel in CUDA C++. numba would allow you to actually write the kernel code in python, but it will have more limited flexibility compared to CUDA C++.

Thanks!