Hi,
I have been writing kernels for CUDA for about 2 years and learned parallel programming style.
Now that I have to implement a custom memory allocator on the CPU side I am thinking to use the same style.
What do I mean? I mean, instead of doing a sequential programming, emulate parallel programming, like for example:
uint bitm_malloc(nm_pool_t *pptr,uint *sizes,uint num_sizes,uint *indexes,uint *errors) {
.......
uint s_top_level[BITM_NUM_THREADS], u_top_level[BITM_NUM_THREADS];
uint result[BITM_NUM_THREADS];
.............
for(i=0;i<num_sizes;i++) {
errors[i]=0;
zeros[i]=__builtin_clz(sizes[i]);
s_top_level=pptr->top_level;
u_top_level=32-zeros[i];
if (u_top_level[i]>s_top_level[i]) {
errors[i]=BITM_ERR_ELTSIZE_OVERFLOW;
}
}
The above code, emulates GPU programming style. The user will specify how much memory blocks it wants to be allocated in sizes variable and the addresses will be returned in indexes array. The result array will hold the returning codes for each request, if it was successful or not What is inside for {} is supposed to execute in ‘parallel’.(maybe I will unroll them later) I am declaring arrays like shared memory in CUDA, they should be in L1 cache by default. The number of ‘threads’ is defined like this, and it is the number of ‘simultaneous’ requests that will be executed:
#define BITM_NUM_THREADS 16
// this value is set for cache line of 64 bytes (16 * sizeof(uint) = 64)
I had the idea and now I am testing it, but the question I have (before I write too much code), is: will I achieve faster execution on CPU with such programming style when the CPU core executes the instructions sequentially? Does someone uses parallel programming style on CPU side? If so, what speed up did you achieve?
Thanks in advance.