I have divided the workload into 4096 threads (64x64), and each thread calls a kernel function. Inside the kernel function, there are 50 two-dimensional arrays, each allocated in CPU memory with a double-precision floating-point format and a size of 64x64. Could you help me calculate the required GPU memory for this setup?
Are you talking about CPU or GPU threads?
BTW: 64 GPU blocks with 64 GPU threads each would be not enough to fully occupy modern GPUs.
Normally CPU threads call kernel functions (except when using Dynamic Parallelism).
50*64*64*8 = 1.6 MBytes?
64*64*1.6 MBytes = 6.4 GBytes?
What do you mean by arrays in CPU memory are inside the kernel? That they are used there?
There are different ways to do it, if you have not enough GPU memory.
Are the 6.4 GBytes used as buffer memory or needed for input and output?