Parallelization, thread scheduling Misunderstood of thread scheduling

Hi,

I’m trying to solve this issue for weeks and since I don’t understand how this thing works, I hoped someone here could help me.

Here is the source code I want to parallelize. Basically, what I want to do is quite simple : I want to remove the for loop using the k integer and replace it in SimulatPath using parallelization.

Currently, it is not parallelized yet since I can’t figure out how threads are scheduled.

Simple replacing my for loop using the k integer by a thread number seems not to work, neither a group number. I’ve checked on the NVidia samples too and still I can’t understand why it’s not working at all. In fact I misunderstand how threads are scheduled.

__global__ void SimulatePath(TOptionPlan plan, int k)

{  

	float result;

	

	result = plan.p0 * exp(plan.m_A0 + plan.m_B0 * plan.d_Samples[k*plan.pathN]);  

	for(int i = 2 + k * plan.pathN; i <= plan.pathN + k * plan.pathN; i ++){		

		result = result * exp( plan.m_A + plan.m_B * plan.d_Samples[(i-1)] );

		plan.d_Buffer[k].Expected = result;

	}

}

for(int k = 0; k < plan.optionCount; k++)

{

	SimulatePath<<<1,1,0>>>(plan, k);

}

I must miss a really important point but I can’t figure out what it is. It should be easy :(

Thanks a lot for your help. I try to solve this thing for weeks…

Assuming [font=“Courier New”]plan.optionCount <= 512[/font]:

__global__ void SimulatePath(TOptionPlan plan)

{  

	int k = threadIdx.x;

	float result;

	

	result = plan.p0 * exp(plan.m_A0 + plan.m_B0 * plan.d_Samples[k*plan.pathN]);  

	for(int i = 2 + k * plan.pathN; i <= plan.pathN + k * plan.pathN; i ++){		

		result = result * exp( plan.m_A + plan.m_B * plan.d_Samples[(i-1)] );

	}

	plan.d_Buffer[k].Expected = result;

}

{

	SimulatePath<<<1, plan.optionCount, 0>>>(plan);

}

Assuming [font=“Courier New”]plan.optionCount <= 512[/font]:

__global__ void SimulatePath(TOptionPlan plan)

{  

	int k = threadIdx.x;

	float result;

	

	result = plan.p0 * exp(plan.m_A0 + plan.m_B0 * plan.d_Samples[k*plan.pathN]);  

	for(int i = 2 + k * plan.pathN; i <= plan.pathN + k * plan.pathN; i ++){		

		result = result * exp( plan.m_A + plan.m_B * plan.d_Samples[(i-1)] );

	}

	plan.d_Buffer[k].Expected = result;

}

{

	SimulatePath<<<1, plan.optionCount, 0>>>(plan);

}

Thanks a lot :) It’s working well for optionCount <= 512.
That was my mistake then… I didn’t pay attention to the fact that only 512 threads can be launched simultaneously on the card.

As for me, I need to launch much more options than 512 (arround 10 000 or 100 000). :( What shall I do? Use a global memory array to store intermediate results?
I didn’t seem that optimized…

Thanks a lot for your help.

Thanks a lot :) It’s working well for optionCount <= 512.
That was my mistake then… I didn’t pay attention to the fact that only 512 threads can be launched simultaneously on the card.

As for me, I need to launch much more options than 512 (arround 10 000 or 100 000). :( What shall I do? Use a global memory array to store intermediate results?
I didn’t seem that optimized…

Thanks a lot for your help.

Not a problem:

__global__ void SimulatePath(TOptionPlan plan, unsigned int n)

{  

	int k = blockIdx.x * blockDim.x + threadIdx.x;

	float result;

	

	if (k < n) {

		result = plan.p0 * exp(plan.m_A0 + plan.m_B0 * plan.d_Samples[k*plan.pathN]);  

		for(int i = 2 + k * plan.pathN; i <= plan.pathN + k * plan.pathN; i ++) {		

			result = result * exp( plan.m_A + plan.m_B * plan.d_Samples[(i-1)] );

		}

		plan.d_Buffer[k].Expected = result;

	}

}

{

	unsigned int num_threads = plan.optionCount;

	unsigned int threads_per_block = 128;

	unsigned int num_blocks = (num_threads + threads_per_block - 1) / threads_per_block;

	SimulatePath<<<num_blocks, threads_per_block, 0>>>(plan, num_threads);

}

Not a problem:

__global__ void SimulatePath(TOptionPlan plan, unsigned int n)

{  

	int k = blockIdx.x * blockDim.x + threadIdx.x;

	float result;

	

	if (k < n) {

		result = plan.p0 * exp(plan.m_A0 + plan.m_B0 * plan.d_Samples[k*plan.pathN]);  

		for(int i = 2 + k * plan.pathN; i <= plan.pathN + k * plan.pathN; i ++) {		

			result = result * exp( plan.m_A + plan.m_B * plan.d_Samples[(i-1)] );

		}

		plan.d_Buffer[k].Expected = result;

	}

}

{

	unsigned int num_threads = plan.optionCount;

	unsigned int threads_per_block = 128;

	unsigned int num_blocks = (num_threads + threads_per_block - 1) / threads_per_block;

	SimulatePath<<<num_blocks, threads_per_block, 0>>>(plan, num_threads);

}

Wow thanks a lot man. Simple, elegant, very nice solution :)
Thanks !

Wow thanks a lot man. Simple, elegant, very nice solution :)
Thanks !

it’s the standard solution. Save your praise for some really elegant stuff. External Media

it’s the standard solution. Save your praise for some really elegant stuff. External Media