I’m doing 4D data computing and any suggestions for the optimal parallelism framework using CUDA will be highly appreciated. The first 3 dimension represents the x,y,z coordinate, the 4th is the time (about 1000 timepoints). The data size is about 900Mb. For each spatial location, I will need to extract the surrounding neighbors (around 40-90) and extract the time series of all the neighbors for pattern extraction. Each location can be processed independently, so does each time point. I’m now using a parent kernel for each spatial location and then a child kernel for the time domain for each neighborhood but the speed gain is not as fast as I expected. One reason might be the frequent memory access (the 900Mb is loaded into the unified memory. The code I’m using now split the 3D space into a series of small blocks in a for loop and run the kernel one by one to avoid overwhelming the GPU. My few questions are: will the pinned memory be a better choice? Will creating multiple streams help improving the speed by avoiding the synchronization called after each loop? Any suggestions for the framework?