Hi Brano thanks for your comments. You did not misunderstood the problem :). Thanks for the tip about the parallel prefix scan, I will take a look at it. We managed to eliminate all for loops, see below the piece of code (still in C version), that we are going to use.
const long TA = SIZE_J * SIZE_K * SIZE_L;
const unsigned short int TB = SIZE_K * SIZE_L;
for (long l = 1; l < 100000000; l++)
{
dif = (PG_H(NO, fprS[(int)(ceil(l/(TA)) % fprS_SIZE)], K, fprV[(int)(ceil(l/(TB)) % SIZE_J)], T, fprJ[(int)(ceil(l/(SIZE_L)) % SIZE_K)], fprC[(int)(l % SIZE_L)]) * TP) - MV;
if (dif < min)
{
min = dif;
}
}
Now we are working on the kernel invocation, and we believe that it need to be something like the following (in pseudocode), working with chunks of data from our arrays.
for(arraysChunk to arraysChunk)
{
kernel<<<22, 512>>>(..., arraysChunk, some_offset, ...);
// update the offset
some_offset = some_offset + something;
// update the arraysChunk
arraysChunk += some_offset;
}
Since we are using (22 x 512 = 11264 threads), we are going to need 10.000 host for loop interactions to complete our 100.000.000 pricing calculation per client.
We will make some tests and experiment different configurations. Thank you.
You probably want to use more blocks, and perform work on larger sets of data.
Also you will have to upload the data to the GPU in each iteration before kernel invoke.
Looking at your pseduo code, be aware that modulo operations and divisions are expensive on a GPU. If you are calculating modulo with a number that is a power of 2, you could use the bitwise & operation and the same is true for division.
Other possibilities to generate an unique hash for each thread id is to think of the 32 bit unsigned int representation as several concatenated indices. Then you could just mask out the relevant part to know where each thread is going to do it’s own lookup.
Example: thread idx will be responsible for prS[(idx>>24)&255], K, fprV[(idx>>16)&255], fprJ[(idx>>8)&255], fprC[idx&255] and the output index for dif is idx. I have not tried this out but I guess that it should work.
Hi Brano, we find a better way to go through all possible combinations among the elements of our data structures, lets remembering the data structures below.
We just used CUDA variables to do the job, and besides that, our results are now matching with the CPU’s. Our best time so far for PG_D device function is 100 million calculations (with four data structures of 100 elements) in less than one second, and one billion calculations (using one data structure with 1000 elements and three other with 100 elements) in 30 seconds.
See below the C code that call our kernel.
#define MAXARRAYSIZE 50000000
...
dim3 dimGrid(1000, 100, 1);
dim3 dimBlock(100, 5, 1);
for (unsigned short int k = 0; k < 20; k++)
{
// calling CUDA kernel <<<number of blocks, threads per block>>>
// calling the kernel in blocks of data
CRD<<<dimGrid, dimBlock>>>(devPtrR, devPtrS, devPtrJ, devPtrV, devPtrC, NO, K, T, TP, MV, 5 * k);
// coping memory back from device to host
cudaMemcpy(C, devPtrR, MEMSIZER, cudaMemcpyDeviceToHost);
// get the minimum
for (long z = 0; z < MAXARRAYSIZE; z++)
if (C[z] < minAux)
minAux = C[z];
}
See below the kernel code, and where lies the “magic” with the CUDA variables.
We are very satisfied about the execution times, but we will try to better it even more with Streams, pinned memory, multi GPU’s, Parallel Prefix Scan and everything else that we got at our disposal External Image.
If we really manage to get Parallel Prefix Scan working, we can certainly eliminate the data structure C, that hold the results for each thread.
We have one doubt, we are facing a very slow allocation of data in GPU at our first call to cudaMalloc External Image. This first call is taking almost 90% of our execution time. There are anything we can do to better the execution time of this first call?
I guess I will give it a try to nvidia-smi utility and see if that fix the performance problem with the first cudaMalloc.
Exclusive compute mode. You can do that by: “nvidia-smi -g 0 -c 1” where -g denotes the ID of the GPU (so you probably have to do “nvidia-smi -g 1 -c 1”, too) and -c specifies the compute mode: 1 means exclusive compute mode (0 is the default). Test it by using: “nvidia-smi -s” → All GPUs and their compute mode number are listed.
Be aware to write/read data in global memory coalesced, by letting threads in x-dimension access data close to eachother.
For output C you could swap places of threadIdx.x and threadIdx.y.
Also the block configuration is not the optimal one. It would be better to have the x-dimentsion as a multiple of 16 (half warp) and also the total number of threads per block as a multiple of 16.
This makes it a little bit more complicated for your solution but the one i suggested above would not have the conflicts.
If you manage to use cudpp to find the min on the GPU you will gain performance because you don’t have to copy huge amount of data back from GPU to CPU and run a loop on the CPU to find the min.
Guys we are going to have a lot of trouble trying to work with our data structures of size 10000, 100, 100 and 100 (this will be one of our tests cases) because of 80GB of double allocated memory for our C array, and then we will have to looking all over this array to find the min value.
I was talking to my team and maybe, we will need some kind of atomicMin() that works with double values, because if we try to use Parallel Prefix Scan with min, we still need to creat the array of 80GB for the scenario that I mentioned. But if we manage to do the atomicMin on double, we can eliminate the C array inside our kernel, so I have a doubt.
This topic [topic=199402]atomicMin on Char[/topic] is very interesting, as Tera shows that is possible to build any atomic function up to 32-bit, so.
It is possible to implement an atomicMin() for a 64-bit type, in our case double values? Thanks.
Yes, of course you can build an atomicMin on double in global memory if your device is compute capability 1.2 or higher.
I’m haven’t followed this topic closely though and thus I’m not sure what you want to use it for. In almost all cases a reduction scheme should work better than using atomic operations, particularly if they have to be created in software.
Hi Tera thanks for your remark once again. We need to use this atomicMin on double instead of the cudppScan function, because with cudppScan we need to keep an array of size X, and use it to feed the cudppScan function, am I right on this afirmative? This array of size X is too big to fit in any card, so we really need to eliminate it from our kernel.
So we thought it would be better to use the atomicMin (and we made some tests with the int version that worked just fine) on double to do the “find the min value” part of our algorithm without creating the array of size X.
Can you point us to the right direction, so we can build our own atomicMin to work on doubles? Thank you very much.
atomic operations will kill the performance. You can see the problem with atomics as “trying to fill a bathtub with water through a straw” where your goal is to fill the bathtub as fast as you can :D
I understand your concern about the huge memory issue. But you can think of the problem as subsets of work you do on the GPU.
So on the CPU side you loop over subsets of work that will generate ~1GB of data in C output and you use cudpp to find the min. Then you transfer this single value back to CPU and you continue with next iteration. Finally you will process 80GB of data in 80 iterations (if using 1GB storage for C) and you will send back total of 80 doubles of min values to the CPU (one min for each iteration) and accumulate the min in each iteration to get the final result.
This isn’t a real problem, even if it takes minutes. You need to divide your program into 3 parts 1, Initialization ( ex allocatiing buffers) 2, Processing 3, Shutdown (ex deallocating buffers)
In stage 1, you do all necessary allocations to be able to do any type of processing. This is considered “offline”. SO:
Hi Tera thank you so much, this is exactly what I was trying to do. I will test it later on today, and then I come back to tell you the results. I’m a little concerned with the instrution “__double_as_longlong”, about losing precision, but our tests will respond that too. Thank you so much for your help.
Hi Brano thanks once again. When we did some tests with atomicMin for int, we indeed lose some performance, something like 2 seconds (which in our case is a lot). Without this atomic operation our time is less than 1 second. But on the other hand, with atomic operations we do not need to keep huge amount of memory, so to be honest, we really do not know yet what way we are going to choose, so we will do a lot of tests and see what is the best for our program. Thanks for your remarks, we will take them in consideration.
Hi Jimmy thanks for your concern. The problem is that in our case, the time spent in malloc is a real problem for us, so we really need to do something about it. If is not possible to do anything, we have to live with it the way it is. Thanks.