Urgent help with threads please!

mpags2001 · February 29, 2008, 4:32am

Please help us. We are new to the CUDA technology and can somebody please help us on using threads with at most 256,000 elements of some structure. This is not the actual structure but it is similar.

Point { float x; float y; float z; }

We would like to add and multiply some values to an array of this structure. Please this is an urgent help.

MisterAnderson42 · February 29, 2008, 2:54pm

This is a very general question, and I don’t know how to answer you. Manipulating such arrays is easy in CUDA, just use 1 thread per element. Make sure you read the programming guide, it is much more than just a reference. Especially pay attention to the section on memory coalescing.

Also, check out the examples in the SDK.

mpags2001 · February 29, 2008, 3:39pm

Yep. I already read the guide. I also checked the samples. However, we tried to do the same but it only works for 39,200 elements. Am really sorry but we really need a fast solution to this problem. We only have 5 days left. If you could help us please, it would help us greatly.

We just need a simple one to add and multiply some values to 256,000 points in an array.

kristleifur · February 29, 2008, 4:47pm

I’d also like to help, but I can’t say that I understand what you’re trying to do. Can you post some pseudocode?

MisterAnderson42 · February 29, 2008, 5:05pm

Well, nobody can help you if you don’t provide some more details, the problem could be anything. Code that reproduces the problem would be useful.

What error message are you getting? Are you checking for errors after every kernel launch? What behavior are you seeing?

“It only works for 39,200 elements” doesn’t explain much. You could be exceeding maximum limits in the device with your kernel grid configuration, you could be writing past the end of an allocated array, you could be hitting the 5 s limitation.

mpags2001 · March 1, 2008, 8:12pm

First, I want to thank you for trying to help me. :lol: :lol: :lol:

Actually, I just want to get an idea on how threads work in GPU. Let’s say my main.cpp is like this:

define NUM 16000

extern “C” void temp (float* temp, int nSize);

int main(int argc, char *argv)

{

float* temp;

temp = (float *) malloc (sizeof(float) * NUM);

// change array values to 5.0

for (int x = 0; x < NUM; x++)

temp[x] = 5.0;

test(temp, NUM);

}

and my threadstest.cu contains:

global void compute_testd(float* temp, int nSize)

{

// change array values to 3.0

for(int i = 0; i < nSize; i++)

 temp[i] =  3.0;

}

extern “C” void test (float* temp, int nSize)

{

float* tempd;

int data1 = nSize/2;

cudaMalloc((void**)&tempd, sizeof(temp) * nSize);

cudaMemcpy( tempd, points, sizeof(temp) * nSize, cudaMemcpyHostToDevice);

if(data1/2 < 512)

compute_testd<<<nSize/data1, data1/2>>> (tempd, nSize);

else

compute_testd<<<nSize/data1, 512>>> (tempd, nSize);

cudaMemcpy(temp, tempd, sizeof(float) * nSize, cudaMemcpyDeviceToHost);

cudaFree(tempd);

}

Now will I be able to change the .cu file so as to use threads in GPU. Our NUM is not only limited to 16000 but can be 64000 and 256000 as well.

We really need to use threads to optimize everything. Please help! :( <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ /> :wacko:

DenisR · March 2, 2008, 6:27am

First, I want to thank you for trying to help me. :lol: :lol: :lol:

Actually, I just want to get an idea on how threads work in GPU. Let’s say my main.cpp is like this:
#define NUM 16000

extern "C" void temp (float* temp, int nSize);

int main(int argc, char *argv[])

{

  float* temp;

 temp = (float *) malloc (sizeof(float) * NUM);

 // change array values to 5.0

  for (int x = 0; x < NUM; x++)

    temp[x] = 5.0;

 test(temp, NUM);

}
and my threadstest.cu contains:
__global__ void compute_testd(float* temp, int nSize)

{

  // change array values to 3.0

   for(int i = 0; i < nSize; i++)

     temp[i] =  3.0;

}

extern "C" void test (float* temp, int nSize)

{

 float* tempd;

  int data1 = nSize/2;

	

  cudaMalloc((void**)&tempd, sizeof(temp) * nSize);

 cudaMemcpy( tempd, points, sizeof(temp) * nSize, cudaMemcpyHostToDevice);

 if(data1/2 < 512)

  	compute_testd<<<nSize/data1, data1/2>>> (tempd, nSize);

  else

	compute_testd<<<nSize/data1, 512>>> (tempd, nSize);

 cudaMemcpy(temp, tempd, sizeof(float) * nSize, cudaMemcpyDeviceToHost);

 cudaFree(tempd);

}
Now will I be able to change the .cu file so as to use threads in GPU. Our NUM is not only limited to 16000 but can be 64000 and 256000 as well.

We really need to use threads to optimize everything. Please help! :( <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ /> :wacko:

[snapback]336233[/snapback]

just to be able to read the code.

DenisR · March 2, 2008, 6:32am

I would suggest you start to look at the examples in the SDK. Your code does not look like something written for GPU at all so better start to understand the well-written examples before trying to write something yourself.

I would do something like:

__global__ void compute_testd(float* temp, int nSize)

{

 unsigned int index = threadIdx.x + blockIdx.x * threadDim.x;

 if (index < nSize)

    temp[index] =  3.0;

}

extern "C" void test (float* temp, int nSize)

{

float* tempd;

cudaMalloc((void**)&tempd, sizeof(temp) * nSize);

 cudaMemcpy( tempd, points, sizeof(temp) * nSize, cudaMemcpyHostToDevice);

compute_testd<<<ceil(nSize/256), 256>>> (tempd, nSize);

cudaMemcpy(temp, tempd, sizeof(float) * nSize, cudaMemcpyDeviceToHost);

cudaFree(tempd);

}

MisterAnderson42 · March 2, 2008, 7:01am

This will have every single thread on the GPU loop over nSize elements. If nSize is large, you are going to hit the 5s limitation. You should check for errors after calling your kernel to discover when this occurs. Either use CUT_CHECK_ERROR from the SDK or call cudaThreadSynchronize and then get the error code and print the error message if it is not cudaSuccess.

DenisR’s modifications to the code are probably what you really want: each thread handles a single element with coalesced reads and writes.

mpags2001 · March 2, 2008, 7:35am

Thank you guys. I will try them later. Hope you’re still there to help later.

^_^ External Media :lol: :magic:

mpags2001 · March 2, 2008, 8:16am

I would suggest you start to look at the examples in the SDK. Your code does not look like something written for GPU at all so better start to understand the well-written examples before trying to write something yourself.

I would do something like:
__global__ void compute_testd(float* temp, int nSize)

{

 unsigned int index = threadIdx.x + blockIdx.x * threadDim.x;

 if (index < nSize)

    temp[index] =  3.0;

}

extern "C" void test (float* temp, int nSize)

{

float* tempd;

cudaMalloc((void**)&tempd, sizeof(temp) * nSize);

 cudaMemcpy( tempd, points, sizeof(temp) * nSize, cudaMemcpyHostToDevice);

compute_testd<<<ceil(nSize/256), 256>>> (tempd, nSize);

cudaMemcpy(temp, tempd, sizeof(float) * nSize, cudaMemcpyDeviceToHost);

cudaFree(tempd);

}
[snapback]336431[/snapback]

I tried the sample you gave me. It doesn’t seem to wrong. Is there any problem on how I used it?

main.cpp:

include <stdio.h>

include <stdlib.h>

define NUM 16000

extern “C” void test (float* temp, int nSize);

int main(int argc, char *argv)

{

float* temp;

temp = (float *) malloc (sizeof(float) * NUM);

// change array values to 5.0

for (int x = 0; x < NUM; x++)

temp[x] = 5.0;

test(temp, NUM);

bool test = true;

for (int x = 0; x < NUM; x++)

{

if (temp[x] != 3.0)

  test = false;

}

if (test == true)

printf ("All are 3.0\n");

else

printf ("Not all are 3.0\n");

}

threadstest.cu:

include <stdio.h>

include <stdlib.h>

global void compute_testd(float* temp, int nSize)

{

int aBegin = blockIdx.x * blockDim.x;

int aEnd = aBegin + blockDim.x;

for (int a = aBegin; a < aEnd; a++)

{

temp[a] = 3.0;

__syncthreads();

}

extern “C” void test (float* temp, int nSize)

{

float* tempd;

cudaMalloc((void**)&tempd, sizeof(temp) * nSize);

cudaMemcpy(tempd, temp, sizeof(temp) * nSize, cudaMemcpyHostToDevice);

compute_testd<<< ceil(nSize/256), 256>>> (tempd, nSize);

cudaMemcpy(temp, tempd, sizeof(float) * nSize, cudaMemcpyDeviceToHost);

cudaFree(tempd);

}

… and when I use them, this is the result:

ecnop@ecnop-desktop:~/Desktop/Threads_Test/Simple_Test$ nvcc -run main.cpp threadstest.cu

Not all are 3.0

… but if I replace NUM with 256000, this is the result:

ecnop@ecnop-desktop:~/Desktop/Threads_Test/Simple_Test$ nvcc -run main.cpp threadstest.cu

All are 3.0

Thank you very much. Hope you could help me on this. I really need to this. Super many thanks. :lol: :lol: :lol: External Media External Media External Media

DenisR · March 2, 2008, 11:03am

So my kernel code does give you the right results? Because the code you posted is not the same.

__global__ void compute_testd(float* temp, int nSize)

{

  int aBegin = blockIdx.x * blockDim.x;

  int aEnd   = aBegin + blockDim.x;

 for (int a = aBegin; a < aEnd; a++)

  {

    temp[a] = 3.0;

    __syncthreads();

  }

}

This again has all threads in 1 block do the same stuff. Also syncthreads is not needed, since there are no dependencies between threads. And any code that does not use threadIdx is basically buggy CUDA code.

Check how my kernel code calculates the index into the array and read again the programming guide about grid and block dimensions.

mpags2001 · March 2, 2008, 12:31pm

Ah i see. Sorry if I was wrong.

I just would like to ask something in this code:

global void compute_testd(float* temp, int nSize)

{

unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;

if (index < nSize)

temp[index] =  3.0;

}

Somebody said that 1 thread accesses 1 index. But there are only 512 threads that are available right. What if I have more than a 1000 values, how will it handle this?

compute_testd<<<ceil(nSize/256), 256>>> (tempd, nSize);

Also, from I understand, if nSize is, for example, 1000, it allocated 3 blocks and has 256 threads. If so, how does the device advance each data since 3 blocks are allocated for each thread.

Thank you very much for the replies!

:magic: :magic: :magic: ^^ ^^ ^_^

External Media External Media External Media External Media External Media External Media

DenisR · March 2, 2008, 1:25pm

__global__ void compute_testd(float* temp, int nSize)

{

  unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;

  if (index < nSize)

    temp[index] =  3.0;

}

Somebody said that 1 thread accesses 1 index. But there are only 512 threads that are available right. What if I have more than a 1000 values, how will it handle this?

compute_testd<<<ceil(nSize/256), 256>>> (tempd, nSize);
Also, from I understand, if nSize is, for example, 1000, it allocated 3 blocks and has 256 threads. If so, how does the device advance each data since 3 blocks are allocated for each thread.

No there are not only 512 threads available. There are a maximum of 512 threads available per block! You have more things the way around. There are not 3 blocks allocated for each thread. There are (in this example) 256 threads allocated for each block.

In the above code, only 256 threads are running per block (that way 3 blocks will be running per multiprocessor, whereas you can only have 1 block per multiprocessor when using 512 threads per block, check the occupancy calculator for details). And each block calculates a different block of different indices.

blockIdx.x gives you which block the thread belongs to.

blockDim.x gives you how many threads there are per block.

So if nSize = 1000 and the number of threads per block is 256, you will get 4 blocks (and not 3, look at the ceil()). The first block will have

blockIdx.x = 0, so it will process element 0 - 255 (threadIdx.x)

the second block will have

blockIdx.x = 1, so it will process element 256 - 511 (threadIdx.x + 1*256)

the third block will have

blockIdx.x = 2, so it will process element 512 - 767 (threadIdx.x + 2*256)

the fourth block will have

blockIdx.x = 3, so it will process element 768 - 1023 (threadIdx.x + 3*256)

That is the reason for the if (index < nSize), otherwise the last 24 threads of the fourth block would be writing past the end of the array.

Another option would be to pad the input-array to be a multiple of 256, so you can skip the if (index < nSize).

I hope this helps, and I would really want to advise you to read the programming guide again, and study the examples from the SDK.

mpags2001 · March 3, 2008, 7:52am

No there are not only 512 threads available. There are a maximum of 512 threads available per block! You have more things the way around. There are not 3 blocks allocated for each thread. There are (in this example) 256 threads allocated for each block.

In the above code, only 256 threads are running per block (that way 3 blocks will be running per multiprocessor, whereas you can only have 1 block per multiprocessor when using 512 threads per block, check the occupancy calculator for details). And each block calculates a different block of different indices.

blockIdx.x gives you which block the thread belongs to.

blockDim.x gives you how many threads there are per block.

So if nSize = 1000 and the number of threads per block is 256, you will get 4 blocks (and not 3, look at the ceil()). The first block will have

blockIdx.x = 0, so it will process element 0 - 255 (threadIdx.x)

the second block will have

blockIdx.x = 1, so it will process element 256 - 511 (threadIdx.x + 1*256)

the third block will have

blockIdx.x = 2, so it will process element 512 - 767 (threadIdx.x + 2*256)

the fourth block will have

blockIdx.x = 3, so it will process element 768 - 1023 (threadIdx.x + 3*256)

That is the reason for the if (index < nSize), otherwise the last 24 threads of the fourth block would be writing past the end of the array.

Another option would be to pad the input-array to be a multiple of 256, so you can skip the if (index < nSize).

I hope this helps, and I would really want to advise you to read the programming guide again, and study the examples from the SDK.

[snapback]336500[/snapback]

OMG! Thank you very much! It worked!

I read the book and I now somehow understand how it works! However, I tried using ceil(nSize/256)+1, 256 instead of ceil(nSize/256), 256 because when I try it on 16000, the last block is not changed. Will that be ok?

And also, what if for example in the device code with threads, I want to check if the total of all the values I added to an array is, let’s say, 1000. How will I do that? Like a shared variable among all threads?

DenisR · March 3, 2008, 10:39am

The last block should be okay, maybe try ceil((float)nsize/256.0f) ?

To get a sum you can check the reduction example from the SDK.

mpags2001 · March 3, 2008, 1:05pm

Ah I see, thank you very much. This would be of great help.

I just want to ask what is the 5S rule you mentioned earlier? and how do you declare a variable that is shared among all of the threads?

You are a gift from above! ^_^ You saved us.

DenisR · March 3, 2008, 7:27pm

I might be a gift from above, but I think you should just read the programming guide to find out about shared variables.

The 5sec limitation is the fact that in windows (and linux if the CUDA device is your primary display adapter) a kernel cannot run for more than 5 seconds, otherwise a watchdog will reset the card and the kernel will not run to completion. The error you will get when using CUT_CHECK_ERROR will be unspecified launch failure.

mpags2001 · March 3, 2008, 8:18pm

Hehehe… I will look into it then. Thanks for the many infos. Hope I could still ask some questions in the future.

By the way, we are actually making this for our thesis. We are trying to improve an algorithm. If you want, we could add you to our resource persons since you were of big help to us. Well, that is only if you want. Many thanks again. Keep in touch.

mpags2001 · March 5, 2008, 8:30pm

I also want to ask to what GPU do I select in the CUDA Occupancy Calculator for GeForce 8400 GS GPU. I tried to search it and it says its G86. But there is no G86 from the list. Could you help me on this? ^_^

Many thanks again. ^_^

Topic		Replies	Views
Newbie help on thread blocks CUDA Programming and Performance	22	10600	December 24, 2008
An Easy Introduction to CUDA C and C++ Technical Blog	48	1258	July 19, 2018
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
Reduction CUDA Programming and Performance	19	3469	May 16, 2012
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16310	January 30, 2011
Is it possible to process multidimensional arrays inside the kernel? CUDA Programming and Performance	13	9050	March 31, 2015
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	52421	June 7, 2022
limit of computation CUDA Programming and Performance	44	32905	April 8, 2008
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29327	April 23, 2010
Cuda code performance CUDA Programming and Performance	14	3156	December 16, 2014

Urgent help with threads please!

Related topics