Please help us. We are new to the CUDA technology and can somebody please help us on using threads with at most 256,000 elements of some structure. This is not the actual structure but it is similar.
Point { float x; float y; float z; }
We would like to add and multiply some values to an array of this structure. Please this is an urgent help.
This is a very general question, and I don’t know how to answer you. Manipulating such arrays is easy in CUDA, just use 1 thread per element. Make sure you read the programming guide, it is much more than just a reference. Especially pay attention to the section on memory coalescing.
Yep. I already read the guide. I also checked the samples. However, we tried to do the same but it only works for 39,200 elements. Am really sorry but we really need a fast solution to this problem. We only have 5 days left. If you could help us please, it would help us greatly.
We just need a simple one to add and multiply some values to 256,000 points in an array.
Well, nobody can help you if you don’t provide some more details, the problem could be anything. Code that reproduces the problem would be useful.
What error message are you getting? Are you checking for errors after every kernel launch? What behavior are you seeing?
“It only works for 39,200 elements” doesn’t explain much. You could be exceeding maximum limits in the device with your kernel grid configuration, you could be writing past the end of an allocated array, you could be hitting the 5 s limitation.
Now will I be able to change the .cu file so as to use threads in GPU. Our NUM is not only limited to 16000 but can be 64000 and 256000 as well.
We really need to use threads to optimize everything. Please help! :( <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ /> :wacko:
I would suggest you start to look at the examples in the SDK. Your code does not look like something written for GPU at all so better start to understand the well-written examples before trying to write something yourself.
I would do something like:
__global__ void compute_testd(float* temp, int nSize)
{
unsigned int index = threadIdx.x + blockIdx.x * threadDim.x;
if (index < nSize)
temp[index] = 3.0;
}
extern "C" void test (float* temp, int nSize)
{
float* tempd;
cudaMalloc((void**)&tempd, sizeof(temp) * nSize);
cudaMemcpy( tempd, points, sizeof(temp) * nSize, cudaMemcpyHostToDevice);
compute_testd<<<ceil(nSize/256), 256>>> (tempd, nSize);
cudaMemcpy(temp, tempd, sizeof(float) * nSize, cudaMemcpyDeviceToHost);
cudaFree(tempd);
}
This will have every single thread on the GPU loop over nSize elements. If nSize is large, you are going to hit the 5s limitation. You should check for errors after calling your kernel to discover when this occurs. Either use CUT_CHECK_ERROR from the SDK or call cudaThreadSynchronize and then get the error code and print the error message if it is not cudaSuccess.
DenisR’s modifications to the code are probably what you really want: each thread handles a single element with coalesced reads and writes.
So my kernel code does give you the right results? Because the code you posted is not the same.
This again has all threads in 1 block do the same stuff. Also syncthreads is not needed, since there are no dependencies between threads. And any code that does not use threadIdx is basically buggy CUDA code.
Check how my kernel code calculates the index into the array and read again the programming guide about grid and block dimensions.
unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < nSize)
temp[index] = 3.0;
}
Somebody said that 1 thread accesses 1 index. But there are only 512 threads that are available right. What if I have more than a 1000 values, how will it handle this?
Also, from I understand, if nSize is, for example, 1000, it allocated 3 blocks and has 256 threads. If so, how does the device advance each data since 3 blocks are allocated for each thread.
No there are not only 512 threads available. There are a maximum of 512 threads available per block! You have more things the way around. There are not 3 blocks allocated for each thread. There are (in this example) 256 threads allocated for each block.
In the above code, only 256 threads are running per block (that way 3 blocks will be running per multiprocessor, whereas you can only have 1 block per multiprocessor when using 512 threads per block, check the occupancy calculator for details). And each block calculates a different block of different indices.
blockIdx.x gives you which block the thread belongs to.
blockDim.x gives you how many threads there are per block.
So if nSize = 1000 and the number of threads per block is 256, you will get 4 blocks (and not 3, look at the ceil()). The first block will have
blockIdx.x = 0, so it will process element 0 - 255 (threadIdx.x)
the second block will have
blockIdx.x = 1, so it will process element 256 - 511 (threadIdx.x + 1*256)
the third block will have
blockIdx.x = 2, so it will process element 512 - 767 (threadIdx.x + 2*256)
the fourth block will have
blockIdx.x = 3, so it will process element 768 - 1023 (threadIdx.x + 3*256)
That is the reason for the if (index < nSize), otherwise the last 24 threads of the fourth block would be writing past the end of the array.
Another option would be to pad the input-array to be a multiple of 256, so you can skip the if (index < nSize).
I hope this helps, and I would really want to advise you to read the programming guide again, and study the examples from the SDK.
I read the book and I now somehow understand how it works! However, I tried using ceil(nSize/256)+1, 256 instead of ceil(nSize/256), 256 because when I try it on 16000, the last block is not changed. Will that be ok?
And also, what if for example in the device code with threads, I want to check if the total of all the values I added to an array is, let’s say, 1000. How will I do that? Like a shared variable among all threads?
I might be a gift from above, but I think you should just read the programming guide to find out about shared variables.
The 5sec limitation is the fact that in windows (and linux if the CUDA device is your primary display adapter) a kernel cannot run for more than 5 seconds, otherwise a watchdog will reset the card and the kernel will not run to completion. The error you will get when using CUT_CHECK_ERROR will be unspecified launch failure.
Hehehe… I will look into it then. Thanks for the many infos. Hope I could still ask some questions in the future.
By the way, we are actually making this for our thesis. We are trying to improve an algorithm. If you want, we could add you to our resource persons since you were of big help to us. Well, that is only if you want. Many thanks again. Keep in touch.
I also want to ask to what GPU do I select in the CUDA Occupancy Calculator for GeForce 8400 GS GPU. I tried to search it and it says its G86. But there is no G86 from the list. Could you help me on this? ^_^