I’m unfamiliar with CUDA but have a problem which I think may be suited to to it. Before spending a lot of time reading documentation and spending money on hardware please could you tell me if the following is feasible and sensible
load 5000 1000 * 1000 matrices of bytes from the host onto the GPU // assume there is sufficient memory on the GPU
repeat {
on the host calculate the index of a matrix on the GPU
select the matrix on the GPU
form a matrix of floats on the GPU from the selected matrix of bytes using float = toInt(byte) * 256.0
repeat {
calculate a vector on host
copy vector to GPU
multiply vector by float matrix
copy result back to host
} until some condition is satisfied on the host
} until some condition is satisfied on the host
I’d prefer to avoid C/C++ on the host if possible. Python would be fine.