I’m trying to paralelize a simple task, that I think it may be suitable under CUDA architecture.
- I’m working on Windows 10
- I’m using VisualStudio to compile (with “Release” profile)
So, I have a function f(x), simillar to a quadratic function (ax^2 + bx + c = 0), so I’m trying to make several threads run, each with strided values for x. This function will be called on this post
It works like:
someValue = 0 while (f(x) != someValue) solve someValue = f(x) add offset to vars: a, b, c to get new x
I had a C code that makes this very task, but on sequential mode, so I ported to CUDA easily.
Now, while testing, several things occurs:
If I invoke
searchValue<<<32, 1>>>, or just
searchValue<<<16, 16>>>which I understand it’s not a huge amount of threads, the application crashes.
If I invoke
searchValue<<<16, 1>>>it works, but works really slow.
About the first point, I make a temporal fix so
searchValue will only search for the first f(x) value, and I profile it as
==11564== Profiling result: Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name 357.38ms 3.9680us - - - - - 12.023KB 2.8897GB/s Pageable Device GeForce 920M (0 1 7 [CUDA memcpy HtoD] 357.42ms 7.9680us - - - - - 12.023KB 1.4391GB/s Pageable Device GeForce 920M (0 1 7 [CUDA memcpy HtoD] 357.47ms 1.0240us - - - - - 8B 7.4506MB/s Pageable Device GeForce 920M (0 1 7 [CUDA memcpy HtoD] 357.50ms 3.1680us - - - - - 4.0078KB 1.2065GB/s Pageable Device GeForce 920M (0 1 7 [CUDA memcpy HtoD] 359.50ms 8.3200us - - - - - 4.0078KB 470.42MB/s Pageable Device GeForce 920M (0 1 7 [CUDA memcpy HtoD] 359.65ms 28.6490s (16 1 1) (1 1 1) 72 1B 0B - - - - GeForce 920M (0 1 7 searchValue(void*, void*, int*, void*, void*)  29.1263s 2.8160us - - - - - 4.0078KB 1.3573GB/s Device Pageable GeForce 920M (0 1 7 [CUDA memcpy DtoH] Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows. SSMem: Static shared memory allocated per CUDA block. DSMem: Dynamic shared memory allocated per CUDA block. SrcMemType: The type of source memory accessed by memory operation/copy DstMemType: The type of destination memory accessed by memory operation/copy
So, as far as I can see (I don’t know a lot about this kind of architecture), the memory usage is ok (I read that 80k regs are allowed by block, and of course I’m low on Shared Memory).
Then, when I invoke the kernel as
searchValue<<<32, 1>>> it hits
an illegal memory access was encountered.
cuda-memcheck for the low settings (<<<32, 1>>>) and didn’t find any issue, but on “high settings” (<<<32, 1>>>), it found a NULL pointer (but it’s not NULL if the settings are different).
The NULL pointer comes from one of the input params of the function, so it may be running out of memory?
Then, on the second point (performance). The program computing really slow; I copied the same code (excluding CUDA exclusive tokens) on C and tested
- CUDA did about 600 iterations on 30 seconds
- C did 100.000 iterations on 40 seconds
So, either my graphic card is somehow really bad (I’m working on a laptop with a GeForce 920M), or I’m doing some big mistakes on the CUDA approach.
As the coding part may be a little bit off-topic, I prefer to focus on the first issue, that may be causing the second issue…
So, can somebody please help me with the first point?
Thanks a lot!