Hello!
I’m a beginner in CUDA, and new in this forum. (sorry, my english is pretty poor)
I’m looking for a concept-like Shared Memory in CUDA, but with device memory.
My kernel function need huge memory amount to compute his result. Something like 5000 double per thread. This memory is used only in thread execution, i don’t need the data stored in it.
Actually all this memory is allocated before the kernel call in a big cudaMalloc. But in this way, the vRAM of my Titan-X is full (like 11/12 Go). And i will need soon more space.
Well, the easy answer is to device malloc inside each thread execution, but it’s hit really hard the execution time.
So, i was thinking about a way to allocate only the memory needed for all the running thread only. Exactly like Shared Memory, but handled in code, because Shared memory Size per block thread is not enough.
Go from
size_t nb_total_thread = 1500000;
size_t nb_double_required_per_thread = 5000;
size_t nb_compute_double = nb_total_thread * nb_double_required_per_thread;
double *compute_double;
cudaMalloc((void **) &compute_double, nb_compute_double * sizeof(double));
To something like this
size_t nb_total_thread_per_block = 1024;
size_t nb_double_required_per_thread = 5000;
size_t max_block_executed_at_the_same_time = 8; //(I don't already know how to get that info)
size_t nb_compute_double = nb_total_thread_per_block
* nb_double_required_per_thread * max_block_executed_at_the_same_time;
double *compute_double;
cudaMalloc((void **) &compute_double, nb_compute_double * sizeof(double));
I assume that all thread of a block have to end their task before a new block start.
Is it a viable solution? I don’t know yet how i will find where the current thread can use the memory, but i think it’s possible. Current kernel look like
__global__
void my_kernel(double *compute_double)
{
int index = threadIdx.x + blockIdx.x*blockDim.x;
// i will try to find a way to have a smart index to not overlap on double
// allready used by other running blocks
double *thread_compute_double = compute_double + index;
// do some stuff here with compute double like store
// intermediate results and multiply them
}
I don’t know if i’m very concise.