Fiding an elegant solution for an On^4 problem

gviktora · May 17, 2011, 2:50pm

Hi Brano thanks for your comments. You did not misunderstood the problem :). Thanks for the tip about the parallel prefix scan, I will take a look at it. We managed to eliminate all for loops, see below the piece of code (still in C version), that we are going to use.

const long TA = SIZE_J * SIZE_K * SIZE_L;

const unsigned short int TB = SIZE_K * SIZE_L;

for (long l = 1; l < 100000000; l++)

{

  dif =  (PG_H(NO, fprS[(int)(ceil(l/(TA)) % fprS_SIZE)], K, fprV[(int)(ceil(l/(TB)) % SIZE_J)], T, fprJ[(int)(ceil(l/(SIZE_L)) % SIZE_K)], fprC[(int)(l % SIZE_L)]) * TP) - MV;

  if (dif < min)

  {

    min = dif;

  }

}

Now we are working on the kernel invocation, and we believe that it need to be something like the following (in pseudocode), working with chunks of data from our arrays.

for(arraysChunk to arraysChunk)

{

  kernel<<<22, 512>>>(..., arraysChunk, some_offset, ...);

  // update the offset

  some_offset = some_offset + something;

  // update the arraysChunk

  arraysChunk += some_offset;

}

Since we are using (22 x 512 = 11264 threads), we are going to need 10.000 host for loop interactions to complete our 100.000.000 pricing calculation per client.

We will make some tests and experiment different configurations. Thank you.

brano · May 18, 2011, 6:45am

Hi Brano thanks for your comments. You did not misunderstood the problem :). Thanks for the tip about the parallel prefix scan, I will take a look at it. We managed to eliminate all for loops, see below the piece of code (still in C version), that we are going to use.
const long TA = SIZE_J * SIZE_K * SIZE_L;

const unsigned short int TB = SIZE_K * SIZE_L;

for (long l = 1; l < 100000000; l++)

{

  dif =  (PG_H(NO, fprS[(int)(ceil(l/(TA)) % fprS_SIZE)], K, fprV[(int)(ceil(l/(TB)) % SIZE_J)], T, fprJ[(int)(ceil(l/(SIZE_L)) % SIZE_K)], fprC[(int)(l % SIZE_L)]) * TP) - MV;

  if (dif < min)

  {

    min = dif;

  }

}
Now we are working on the kernel invocation, and we believe that it need to be something like the following (in pseudocode), working with chunks of data from our arrays.
for(arraysChunk to arraysChunk)

{

  kernel<<<22, 512>>>(..., arraysChunk, some_offset, ...);

  // update the offset

  some_offset = some_offset + something;

  // update the arraysChunk

  arraysChunk += some_offset;

}
Since we are using (22 x 512 = 11264 threads), we are going to need 10.000 host for loop interactions to complete our 100.000.000 pricing calculation per client.

We will make some tests and experiment different configurations. Thank you.

You probably want to use more blocks, and perform work on larger sets of data.

Also you will have to upload the data to the GPU in each iteration before kernel invoke.

Looking at your pseduo code, be aware that modulo operations and divisions are expensive on a GPU. If you are calculating modulo with a number that is a power of 2, you could use the bitwise & operation and the same is true for division.

Other possibilities to generate an unique hash for each thread id is to think of the 32 bit unsigned int representation as several concatenated indices. Then you could just mask out the relevant part to know where each thread is going to do it’s own lookup.

Example: thread idx will be responsible for prS[(idx>>24)&255], K, fprV[(idx>>16)&255], fprJ[(idx>>8)&255], fprC[idx&255] and the output index for dif is idx. I have not tried this out but I guess that it should work.

gviktora · May 18, 2011, 1:13pm

You probably want to use more blocks, and perform work on larger sets of data.

Also you will have to upload the data to the GPU in each iteration before kernel invoke.

Looking at your pseduo code, be aware that modulo operations and divisions are expensive on a GPU. If you are calculating modulo with a number that is a power of 2, you could use the bitwise & operation and the same is true for division.

Other possibilities to generate an unique hash for each thread id is to think of the 32 bit unsigned int representation as several concatenated indices. Then you could just mask out the relevant part to know where each thread is going to do it’s own lookup.

Example: thread idx will be responsible for prS[(idx>>24)&255], K, fprV[(idx>>16)&255], fprJ[(idx>>8)&255], fprC[idx&255] and the output index for dif is idx. I have not tried this out but I guess that it should work.

Hi Brano one more time thank you very much. I will try your comments today during the day, then return to tell whats happened. Thanks.

gviktora · May 23, 2011, 1:00pm

You probably want to use more blocks, and perform work on larger sets of data.

Also you will have to upload the data to the GPU in each iteration before kernel invoke.

Looking at your pseduo code, be aware that modulo operations and divisions are expensive on a GPU. If you are calculating modulo with a number that is a power of 2, you could use the bitwise & operation and the same is true for division.

Other possibilities to generate an unique hash for each thread id is to think of the 32 bit unsigned int representation as several concatenated indices. Then you could just mask out the relevant part to know where each thread is going to do it’s own lookup.

Example: thread idx will be responsible for prS[(idx>>24)&255], K, fprV[(idx>>16)&255], fprJ[(idx>>8)&255], fprC[idx&255] and the output index for dif is idx. I have not tried this out but I guess that it should work.

Hi Brano, we find a better way to go through all possible combinations among the elements of our data structures, lets remembering the data structures below.

prS[(idx>>24)&255], fprV[(idx>>16)&255], fprJ[(idx>>8)&255], fprC[idx&255]

We just used CUDA variables to do the job, and besides that, our results are now matching with the CPU’s. Our best time so far for PG_D device function is 100 million calculations (with four data structures of 100 elements) in less than one second, and one billion calculations (using one data structure with 1000 elements and three other with 100 elements) in 30 seconds.

See below the C code that call our kernel.

#define MAXARRAYSIZE 50000000

...

dim3 dimGrid(1000, 100, 1);

dim3 dimBlock(100, 5, 1);

for (unsigned short int k = 0; k < 20; k++)

{

    // calling CUDA kernel <<<number of blocks, threads per block>>>

    // calling the kernel in blocks of data

    CRD<<<dimGrid, dimBlock>>>(devPtrR, devPtrS, devPtrJ, devPtrV, devPtrC, NO, K, T, TP, MV, 5 * k);

    // coping memory back from device to host

    cudaMemcpy(C, devPtrR, MEMSIZER, cudaMemcpyDeviceToHost);

    // get the minimum

    for (long z = 0; z < MAXARRAYSIZE; z++)

    if (C[z] < minAux)

        minAux = C[z];

}

See below the kernel code, and where lies the “magic” with the CUDA variables.

static __device__ __global__ void CRD(double * C, double * fprS, double * fprJ, double * fprV, double * fprC, const bool NO, const double K, 

const double T, const double TP, const double MV, unsigned short int offset)

{

    // lets work with blocks of data

    C[blockIdx.x * gridDim.y * blockDim.x * blockDim.y + blockIdx.y * blockDim.x * blockDim.y + threadIdx.x * blockDim.y + threadIdx.y] = 

(PG(NO, fprS[blockIdx.x], K, fprV[blockIdx.y], T, fprJ[threadIdx.x], fprC[threadIdx.y + offset]) * TP) - MV;

}

We are very satisfied about the execution times, but we will try to better it even more with Streams, pinned memory, multi GPU’s, Parallel Prefix Scan and everything else that we got at our disposal External Image.

If we really manage to get Parallel Prefix Scan working, we can certainly eliminate the data structure C, that hold the results for each thread.

We have one doubt, we are facing a very slow allocation of data in GPU at our first call to cudaMalloc External Image. This first call is taking almost 90% of our execution time. There are anything we can do to better the execution time of this first call?

const size_t MEMSIZER = MAXARRAYSIZE * sizeof(double);

...

cudaMalloc((void **) &devPtrR, MEMSIZER); // ULTRA SLOW

cudaMalloc((void **) &devPtrS, MEMSIZES);

cudaMalloc((void **) &devPtrJ, MEMSIZEJ);

cudaMalloc((void **) &devPtrV, MEMSIZEV);

cudaMalloc((void **) &devPtrC, MEMSIZEC);

gviktora · May 23, 2011, 2:29pm

I guess I will give it a try to nvidia-smi utility and see if that fix the performance problem with the first cudaMalloc.

Exclusive compute mode. You can do that by: “nvidia-smi -g 0 -c 1” where -g denotes the ID of the GPU (so you probably have to do “nvidia-smi -g 1 -c 1”, too) and -c specifies the compute mode: 1 means exclusive compute mode (0 is the default). Test it by using: “nvidia-smi -s” → All GPUs and their compute mode number are listed.

Exclusive access

#nvidia-smi --loop-continuously --interval=60 --filename=/var/log/nvidia.log &

#nvidia-smi -g 1 -c 1

#nvidia-smi -g 0 -c 1

#nvidia-smi -g 1 -s

Compute-mode rules for GPU=0x1: 0x1

#nvidia-smi -g 0 -s

Compute-mode rules for GPU=0x0: 0x1

eldadk · May 23, 2011, 2:33pm

if you’re using the nvisia-smi - you should put your device in TCC mode

gviktora · May 23, 2011, 2:55pm

Thanks Eldadk for your remarks. That’s exactly what I will try to do External Image, today later on. Thank you.

brano · May 24, 2011, 6:57am

Hi Brano, we find a better way to go through all possible combinations among the elements of our data structures, lets remembering the data structures below.
prS[(idx>>24)&255], fprV[(idx>>16)&255], fprJ[(idx>>8)&255], fprC[idx&255]
We just used CUDA variables to do the job, and besides that, our results are now matching with the CPU’s. Our best time so far for PG_D device function is 100 million calculations (with four data structures of 100 elements) in less than one second, and one billion calculations (using one data structure with 1000 elements and three other with 100 elements) in 30 seconds.

See below the C code that call our kernel.
#define MAXARRAYSIZE 50000000

...

dim3 dimGrid(1000, 100, 1);

dim3 dimBlock(100, 5, 1);

for (unsigned short int k = 0; k < 20; k++)

{

    // calling CUDA kernel <<<number of blocks, threads per block>>>

    // calling the kernel in blocks of data

    CRD<<<dimGrid, dimBlock>>>(devPtrR, devPtrS, devPtrJ, devPtrV, devPtrC, NO, K, T, TP, MV, 5 * k);

    // coping memory back from device to host

    cudaMemcpy(C, devPtrR, MEMSIZER, cudaMemcpyDeviceToHost);

    // get the minimum

    for (long z = 0; z < MAXARRAYSIZE; z++)

    if (C[z] < minAux)

        minAux = C[z];

}
See below the kernel code, and where lies the “magic” with the CUDA variables.
static __device__ __global__ void CRD(double * C, double * fprS, double * fprJ, double * fprV, double * fprC, const bool NO, const double K, 

const double T, const double TP, const double MV, unsigned short int offset)

{

    // lets work with blocks of data

    C[blockIdx.x * gridDim.y * blockDim.x * blockDim.y + blockIdx.y * blockDim.x * blockDim.y + threadIdx.x * blockDim.y + threadIdx.y] = 

(PG(NO, fprS[blockIdx.x], K, fprV[blockIdx.y], T, fprJ[threadIdx.x], fprC[threadIdx.y + offset]) * TP) - MV;

}
We are very satisfied about the execution times, but we will try to better it even more with Streams, pinned memory, multi GPU’s, Parallel Prefix Scan and everything else that we got at our disposal External Image.

If we really manage to get Parallel Prefix Scan working, we can certainly eliminate the data structure C, that hold the results for each thread.

We have one doubt, we are facing a very slow allocation of data in GPU at our first call to cudaMalloc External Image. This first call is taking almost 90% of our execution time. There are anything we can do to better the execution time of this first call?
const size_t MEMSIZER = MAXARRAYSIZE * sizeof(double);

...

cudaMalloc((void **) &devPtrR, MEMSIZER); // ULTRA SLOW

cudaMalloc((void **) &devPtrS, MEMSIZES);

cudaMalloc((void **) &devPtrJ, MEMSIZEJ);

cudaMalloc((void **) &devPtrV, MEMSIZEV);

cudaMalloc((void **) &devPtrC, MEMSIZEC);

Hi,

some comments:

Be aware to write/read data in global memory coalesced, by letting threads in x-dimension access data close to eachother.

For output C you could swap places of threadIdx.x and threadIdx.y.

Also the block configuration is not the optimal one. It would be better to have the x-dimentsion as a multiple of 16 (half warp) and also the total number of threads per block as a multiple of 16.

This makes it a little bit more complicated for your solution but the one i suggested above would not have the conflicts.

If you manage to use cudpp to find the min on the GPU you will gain performance because you don’t have to copy huge amount of data back from GPU to CPU and run a loop on the CPU to find the min.

http://www.gpgpu.org/static/developer/cudpp/rel/cudpp_1.1/html/group__public_interface.html#g6d637cd697e495985afbb1c3e64a21f7

use the cudppscan and set the CUDPPHandle to use the min operator.

So, your kernel will generate the C output and the C output should be fed to the cudppscan to find the min.

gviktora · May 24, 2011, 2:01pm

Hi,

some comments:

Be aware to write/read data in global memory coalesced, by letting threads in x-dimension access data close to eachother.

For output C you could swap places of threadIdx.x and threadIdx.y.

Also the block configuration is not the optimal one. It would be better to have the x-dimentsion as a multiple of 16 (half warp) and also the total number of threads per block as a multiple of 16.

This makes it a little bit more complicated for your solution but the one i suggested above would not have the conflicts.

If you manage to use cudpp to find the min on the GPU you will gain performance because you don’t have to copy huge amount of data back from GPU to CPU and run a loop on the CPU to find the min.

http://www.gpgpu.org/static/developer/cudpp/rel/cudpp_1.1/html/group__public_interface.html#g6d637cd697e495985afbb1c3e64a21f7

use the cudppscan and set the CUDPPHandle to use the min operator.

So, your kernel will generate the C output and the C output should be fed to the cudppscan to find the min.

Thanks once again for your help Brano. About the kernel launch, maybe one better configuration would be the one below External Image.

dim3 dimGrid(1024, 128, 1);

dim3 dimBlock(128, 16, 1);

We will try to make things work with it and use some variables to control the dimensions inside the kernel. Thanks.

brano · May 24, 2011, 2:14pm

Thanks once again for your help Brano. About the kernel launch, maybe one better configuration would be the one below External Image.
dim3 dimGrid(1024, 128, 1);

dim3 dimBlock(128, 16, 1);
We will try to make things work with it and use some variables to control the dimensions inside the kernel. Thanks.

Hi,

dimBlock(128,16,1) will not work. A block can not use more then 512/1024 threads per block depending if it is fermi or not (128*16 = 2048).

gviktora · May 24, 2011, 2:43pm

Yeah you right Brano thanks, the maximum number is 1024 for my card. Hum, this will not be easy. Thanks.

gviktora · May 24, 2011, 6:46pm

Guys we are going to have a lot of trouble trying to work with our data structures of size 10000, 100, 100 and 100 (this will be one of our tests cases) because of 80GB of double allocated memory for our C array, and then we will have to looking all over this array to find the min value.

I was talking to my team and maybe, we will need some kind of atomicMin() that works with double values, because if we try to use Parallel Prefix Scan with min, we still need to creat the array of 80GB for the scenario that I mentioned. But if we manage to do the atomicMin on double, we can eliminate the C array inside our kernel, so I have a doubt.

This topic [topic=199402]atomicMin on Char[/topic] is very interesting, as Tera shows that is possible to build any atomic function up to 32-bit, so.

It is possible to implement an atomicMin() for a 64-bit type, in our case double values? Thanks.

tera · May 24, 2011, 8:37pm

Yes, of course you can build an atomicMin on double in global memory if your device is compute capability 1.2 or higher.

I’m haven’t followed this topic closely though and thus I’m not sure what you want to use it for. In almost all cases a reduction scheme should work better than using atomic operations, particularly if they have to be created in software.

gviktora · May 24, 2011, 9:21pm

Hi Tera thanks for your remark once again. We need to use this atomicMin on double instead of the cudppScan function, because with cudppScan we need to keep an array of size X, and use it to feed the cudppScan function, am I right on this afirmative? This array of size X is too big to fit in any card, so we really need to eliminate it from our kernel.

So we thought it would be better to use the atomicMin (and we made some tests with the int version that worked just fine) on double to do the “find the min value” part of our algorithm without creating the array of size X.

Can you point us to the right direction, so we can build our own atomicMin to work on doubles? Thank you very much.

tera · May 25, 2011, 12:34am

Just take the example code from appendix B.11 of the Programming Guide and replace addition with [font=“Courier New”]min()[/font], like so:

__device__ double atomicMin(double* address, double val)

{

    double old = *address, assumed;

    do {

        assumed = old;

        old =

            __longlong_as_double(

                atomicCAS((unsigned long long int*)address,

                __double_as_longlong(assumed),

                __double_as_longlong(min(val, assumed))));

    } while (assumed != old);

    return old;

}

Tomorrow I will think about why you actually need it.

brano · May 25, 2011, 8:01am

Guys we are going to have a lot of trouble trying to work with our data structures of size 10000, 100, 100 and 100 (this will be one of our tests cases) because of 80GB of double allocated memory for our C array, and then we will have to looking all over this array to find the min value.

I was talking to my team and maybe, we will need some kind of atomicMin() that works with double values, because if we try to use Parallel Prefix Scan with min, we still need to creat the array of 80GB for the scenario that I mentioned. But if we manage to do the atomicMin on double, we can eliminate the C array inside our kernel, so I have a doubt.

This topic [topic=199402]atomicMin on Char[/topic] is very interesting, as Tera shows that is possible to build any atomic function up to 32-bit, so.

It is possible to implement an atomicMin() for a 64-bit type, in our case double values? Thanks.

Hi,

atomic operations will kill the performance. You can see the problem with atomics as “trying to fill a bathtub with water through a straw” where your goal is to fill the bathtub as fast as you can :D

I understand your concern about the huge memory issue. But you can think of the problem as subsets of work you do on the GPU.

So on the CPU side you loop over subsets of work that will generate ~1GB of data in C output and you use cudpp to find the min. Then you transfer this single value back to CPU and you continue with next iteration. Finally you will process 80GB of data in 80 iterations (if using 1GB storage for C) and you will send back total of 80 doubles of min values to the CPU (one min for each iteration) and accumulate the min in each iteration to get the final result.

Jimmy_Pettersson · May 25, 2011, 12:02pm

const size_t MEMSIZER = MAXARRAYSIZE * sizeof(double);

...

cudaMalloc((void **) &devPtrR, MEMSIZER); // ULTRA SLOW

cudaMalloc((void **) &devPtrS, MEMSIZES);

cudaMalloc((void **) &devPtrJ, MEMSIZEJ);

cudaMalloc((void **) &devPtrV, MEMSIZEV);

cudaMalloc((void **) &devPtrC, MEMSIZEC);

This isn’t a real problem, even if it takes minutes. You need to divide your program into 3 parts 1, Initialization ( ex allocatiing buffers) 2, Processing 3, Shutdown (ex deallocating buffers)

In stage 1, you do all necessary allocations to be able to do any type of processing. This is considered “offline”. SO:

1, Offline

2, online

3, Going offline

gviktora · May 25, 2011, 5:16pm

Just take the example code from appendix B.11 of the Programming Guide and replace addition with [font=“Courier New”]min()[/font], like so:
__device__ double atomicMin(double* address, double val)

{

    double old = *address, assumed;

    do {

        assumed = old;

        old =

            __longlong_as_double(

                atomicCAS((unsigned long long int*)address,

                __double_as_longlong(assumed),

                __double_as_longlong(min(val, assumed))));

    } while (assumed != old);

    return old;

}
Tomorrow I will think about why you actually need it.

Hi Tera thank you so much, this is exactly what I was trying to do. I will test it later on today, and then I come back to tell you the results. I’m a little concerned with the instrution “__double_as_longlong”, about losing precision, but our tests will respond that too. Thank you so much for your help.

Hi Brano thanks once again. When we did some tests with atomicMin for int, we indeed lose some performance, something like 2 seconds (which in our case is a lot). Without this atomic operation our time is less than 1 second. But on the other hand, with atomic operations we do not need to keep huge amount of memory, so to be honest, we really do not know yet what way we are going to choose, so we will do a lot of tests and see what is the best for our program. Thanks for your remarks, we will take them in consideration.

Hi Jimmy thanks for your concern. The problem is that in our case, the time spent in malloc is a real problem for us, so we really need to do something about it. If is not possible to do anything, we have to live with it the way it is. Thanks.

gviktora · June 1, 2011, 8:16pm

Guys, is there a way to lower the CUDA context creation time on Linux? It seems that Linux can unload the driver if itâ€™s idle for too long.

If anyone already suffered with this issue, how do you fixed this?

Thanks for any help on this question.

mfatica · June 1, 2011, 8:52pm

You can run nvidia-smi in a loop or if you are using 4.0 you can run it in persistent mode ( nvidia-smi -pm 1)

Topic		Replies	Views
matrix multiply reduction CUDA Programming and Performance	41	35549	January 15, 2011
Accuracy of 1D linear interpolation by CUDA texture interpolation CUDA Programming and Performance	25	14627	January 29, 2013
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
Lots of small matrices CUDA Programming and Performance	27	73323	September 23, 2011
Urgent help with threads please! CUDA Programming and Performance	21	10787	March 6, 2008
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7725	July 22, 2010
Cuda code performance CUDA Programming and Performance	14	3148	December 16, 2014
An Easy Introduction to CUDA C and C++ Technical Blog	48	1238	July 19, 2018
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011

Fiding an elegant solution for an On^4 problem

Related topics