Bad performance or bad coding?

Hello:
I’m trying to paralelize a simple task, that I think it may be suitable under CUDA architecture.

  • I’m working on Windows 10
  • I’m using VisualStudio to compile (with “Release” profile)

So, I have a function f(x), simillar to a quadratic function (ax^2 + bx + c = 0), so I’m trying to make several threads run, each with strided values for x. This function will be called on this post searchValue.

It works like:

someValue = 0

while (f(x) != someValue)
  solve someValue = f(x)
  add offset to vars: a, b, c to get new x

I had a C code that makes this very task, but on sequential mode, so I ported to CUDA easily.

Now, while testing, several things occurs:

  • If I invoke searchValue<<<32, 1>>>, or just searchValue<<<16, 16>>> which I understand it’s not a huge amount of threads, the application crashes.

  • If I invoke searchValue<<<1, 1>>> or searchValue<<<16, 1>>> it works, but works really slow.

About the first point, I make a temporal fix so searchValue will only search for the first f(x) value, and I profile it as searchValue<<<16, 1>>>

==11564== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
357.38ms  3.9680us                    -               -         -         -         -  12.023KB  2.8897GB/s    Pageable      Device  GeForce 920M (0         1         7  [CUDA memcpy HtoD]
357.42ms  7.9680us                    -               -         -         -         -  12.023KB  1.4391GB/s    Pageable      Device  GeForce 920M (0         1         7  [CUDA memcpy HtoD]
357.47ms  1.0240us                    -               -         -         -         -        8B  7.4506MB/s    Pageable      Device  GeForce 920M (0         1         7  [CUDA memcpy HtoD]
357.50ms  3.1680us                    -               -         -         -         -  4.0078KB  1.2065GB/s    Pageable      Device  GeForce 920M (0         1         7  [CUDA memcpy HtoD]
359.50ms  8.3200us                    -               -         -         -         -  4.0078KB  470.42MB/s    Pageable      Device  GeForce 920M (0         1         7  [CUDA memcpy HtoD]
359.65ms  28.6490s             (16 1 1)         (1 1 1)        72        1B        0B         -           -           -           -  GeForce 920M (0         1         7  searchValue(void*, void*, int*, void*, void*) [118]
29.1263s  2.8160us                    -               -         -         -         -  4.0078KB  1.3573GB/s      Device    Pageable  GeForce 920M (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

So, as far as I can see (I don’t know a lot about this kind of architecture), the memory usage is ok (I read that 80k regs are allowed by block, and of course I’m low on Shared Memory).

Then, when I invoke the kernel as searchValue<<<32, 1>>> it hits an illegal memory access was encountered.

I did cuda-memcheck for the low settings (<<<32, 1>>>) and didn’t find any issue, but on “high settings” (<<<32, 1>>>), it found a NULL pointer (but it’s not NULL if the settings are different).

The NULL pointer comes from one of the input params of the function, so it may be running out of memory?

Then, on the second point (performance). The program computing really slow; I copied the same code (excluding CUDA exclusive tokens) on C and tested

  • CUDA did about 600 iterations on 30 seconds
  • C did 100.000 iterations on 40 seconds

So, either my graphic card is somehow really bad (I’m working on a laptop with a GeForce 920M), or I’m doing some big mistakes on the CUDA approach.

As the coding part may be a little bit off-topic, I prefer to focus on the first issue, that may be causing the second issue…

So, can somebody please help me with the first point?

Thanks a lot!

I would suggest using the method outlined here to identify the specific line of code that is generating the cuda-memcheck error. Work backwards from that point to identify precisely how the invalid address is being generated. If you need help with that, get the process started, post your results including the line of code, and ask questions about what to do next.

Hello @Robert_Crovella.
I did that and solve some previous memory errors that are not happening anymore.
On this specific scenario, the error code is

mul(&sdist[0], bidx, m); //sdist[0] *= bidx;
mul(&sdist[1], bidx, m); //sdist[1] *= bidx; --> This one (line 312)
mul(&sdist[2], bidx, m);  //sdist[2] *= bidx;

So, bidx it’s OK, as on the previous line it’s working; also sdist comes from

memcpy(&sdist[0], &((struct BigInteger*)dist)[0], sizeof(struct BigInteger));
memcpy(&sdist[1], &((struct BigInteger*)dist)[1], sizeof(struct BigInteger)); // line 303
memcpy(&sdist[2], &((struct BigInteger*)dist)[2], sizeof(struct BigInteger));

And dist comes for function signature

__global__ void searchValue(void* dif, void* dist, int* arr, void* num, void* ret)

So, the point is

  • If I invoke searchVal<<<1, 1>>> or searchVal<<<16, 1>>> it works, and for each thread dist have a value (same initial value for each thread).
  • On line 312 all the 3 parameters may have valid address: sdist is populated for all array occurs across same origin variable; bidx is used on previous sentence without error; same for m.

Also, the variable address points to 0x00000000 (NULL), that make it a bit weird (it does not look as an out of bounds array or pointer).

I will try to replicate the scenario later and post some more information about cuda-memcheck.

Regards.

That doesn’t show where sdist comes from. Show the definition of sdist and how space for it is allocated.

Hello @Robert_Crovella:

As I told you, here’s some more information about the error execution (with <<<32, 1>>>)

========= Error: process didn't terminate successfully
========= Invalid __global__ write of size 1
=========     at 0x00008028 in C:/_CUDA/RSA/C - Original/cuda_module.cu:781:searchValue(void*, void*, int*, void*, void*)
=========     by thread (0,0,0) in block (28,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2DAsync + 0x1ba01e) [0x1c8755]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (cudart::cudaApiLaunchKernel + 0xf6) [0x1c86]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (cudaLaunchKernel + 0x1c4) [0x1644]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (searchValue + 0xf7) [0xfb27]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (main + 0x53d) [0x100fd]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (__scrt_common_main_seh + 0x10c) [0x106d4]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x17c24]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6d4d1]
=========
========= Invalid __global__ write of size 1
=========     at 0x00008028 in C:/_CUDA/RSA/C - Original/cuda_module.cu:781:searchValue(void*, void*, int*, void*, void*)
=========     by thread (0,0,0) in block (5,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2DAsync + 0x1ba01e) [0x1c8755]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (cudart::cudaApiLaunchKernel + 0xf6) [0x1c86]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (cudaLaunchKernel + 0x1c4) [0x1644]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (searchValue + 0xf7) [0xfb27]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (main + 0x53d) [0x100fd]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (__scrt_common_main_seh + 0x10c) [0x106d4]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x17c24]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6d4d1]
=========
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2DAsync + 0x2fa194) [0x3088cb]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (cudaDeviceSynchronize + 0xf8) [0x1218]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (main + 0x55d) [0x1011d]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\EQSolver.exe (__scrt_common_main_seh + 0x10c) [0x106d4]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x17c24]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6d4d1] 

Line 781, on mul function contains:

//inicializamos BIT
memcpy(&((struct BIT*)((struct memory*)m)->biBIT)->BI[0], ((struct memory*)m)->mzero, sizeof(struct BigInteger)); //Line 781
memcpy(&((struct BIT*)((struct memory*)m)->biBIT)->BI[1], va, sizeof(struct BigInteger));

On which the write is being made on biBIT->BI[0], as part of memory struct.

On this point, several writes has been made to memory on the same function:

CUBImemcpy(((struct memory*)m)->mzero, 0); //Line 735
CUBImemcpy(((struct memory*)m)->mret, 0);
CUBImemcpy(((struct memory*)m)->mone, 1);

The allocation of memory is done one per thread:

struct memory* m = (struct memory*)malloc(CUgetMemorySize()); //Line 270, previous to "mul" call

[...]
//initial value setting
CUinit((void**)m); //Line 290

Then, just to clarify

__device__ size_t CUgetMemorySize() {
  return (sizeof(struct BigInteger) * 19) + sizeof(struct BIT);
}

__device__ void CUinit(void** m) {
  //suma
  ((struct memory*)m)->vt = malloc(sizeof(struct BigInteger));

  //resta
  ((struct memory*)m)->stmp = malloc(sizeof(struct BigInteger));

  //multiplicación
  ((struct memory*)m)->mpart = malloc(sizeof(struct BigInteger));
  ((struct memory*)m)->mret = malloc(sizeof(struct BigInteger));
  ((struct memory*)m)->mzero = malloc(sizeof(struct BigInteger));
  ((struct memory*)m)->mone = malloc(sizeof(struct BigInteger));
  ((struct memory*)m)->mtmp = malloc(sizeof(struct BigInteger));

  //division
  ((struct memory*)m)->done = malloc(sizeof(struct BigInteger));
  ((struct memory*)m)->dtmp = malloc(sizeof(struct BigInteger));
  ((struct memory*)m)->dret = malloc(sizeof(struct BigInteger));
  ((struct memory*)m)->dTemp = malloc(sizeof(struct BigInteger));
  ((struct memory*)m)->biTemp = malloc(sizeof(struct BigInteger));

  //raiz
  ((struct memory*)m)->sret = malloc(sizeof(struct BigInteger));
  ((struct memory*)m)->sraw = malloc(sizeof(struct BigInteger));
  ((struct memory*)m)->sbase = malloc(sizeof(struct BigInteger));
  ((struct memory*)m)->szero = malloc(sizeof(struct BigInteger));

  //potencia
  ((struct memory*)m)->bres = malloc(sizeof(struct BigInteger));
  ((struct memory*)m)->btmp = malloc(sizeof(struct BigInteger));

  //append
  ((struct memory*)m)->aaux = malloc(sizeof(struct BigInteger));

  //BIT
  ((struct memory*)m)->biBIT = malloc(sizeof(struct BIT));

  //valores comunes
  CU_BI_initialize();

  CUBImemcpy(((struct memory*)m)->vt, 2);
}

Finally, you also asked for size, so here are the structs i’m using

//struct
struct BigInteger {
  char k;
  int count;
  char n[4096];
};

struct BIT {
  struct BigInteger BI[10];
  int status[10];
};

so, memory size would be (if not wrong)

(4097 + 4) * 19 + (4097 + 4) * 10 = (4097 + 4) * 29 = 118.929bytes / thread

Also, you also asked for sdist so let me please clarify it (sorry for the omision on my first post)

__global__ void searchValue(void* dif, void* dist, int* arr, void* num, void* ret) {
  [...]
  //dist setting
  struct BigInteger* sdist = (struct BigInteger*)malloc(sizeof(struct BigInteger) * 3);
  [...]
  //set dist
  memcpy(&sdist[0], &((struct BigInteger*)dist)[0], sizeof(struct BigInteger));
  memcpy(&sdist[1], &((struct BigInteger*)dist)[1], sizeof(struct BigInteger));
  memcpy(&sdist[2], &((struct BigInteger*)dist)[2], sizeof(struct BigInteger));
  [...]
  mul(&sdist[0], bidx, m);
  mul(&sdist[1], bidx, m);
  mul(&sdist[2], bidx, m);

also, bidx is BigInteger representation for int idx = blockIdx.x * blockDim.x + threadIdx.x; and m is an instance of struct memory that is passed to mul as void* (__device__mul(void* va, void* vb, void* m).

I hope this information will help to find the problem :-)

Thanks.

In kernel malloc will return a null pointer (0x00000000) if the allocation fails. The allocation can file if you run into the default limit of 8MB across your entire device.

Test for this after the malloc call by testing for a null pointer returned, and then printf/return, or assert.

struct BigInteger* sdist = (struct BigInteger*)malloc(sizeof(struct BigInteger) * 3);
assert (sdist != 0);

Thanks for the quick answer.

I did this check

__global__ void searchValue(void* dif, void* dist, int* arr, void* num, void* ret) {
  //block trap
  __shared__ char s;

  //local trap
  int ex = -1;

  //local vars
  //index setting
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  int std = blockDim.x * gridDim.x;
  char* sidx = (char*)malloc(sizeof(char) * 100);
  char* sstd = (char*)malloc(sizeof(char) * 100);
  struct BigInteger* bidx = (struct BigInteger*)malloc(sizeof(struct BigInteger));
  struct BigInteger* bstd = (struct BigInteger*)malloc(sizeof(struct BigInteger));

  //dist setting
  struct BigInteger* sdist = (struct BigInteger*)malloc(sizeof(struct BigInteger) * 3);
  
  //offset setting
  struct BigInteger* sdif = (struct BigInteger*)malloc(sizeof(struct BigInteger) * 3);
  
  //distance setting
  struct BigInteger* sdix = (struct BigInteger*)malloc(sizeof(struct BigInteger) * 3);
  
  //solving
  struct BigInteger* sols = (struct BigInteger*)malloc(sizeof(struct BigInteger) * 2);
  
  //general porpouse
  struct memory* m = (struct memory*)malloc(CUgetMemorySize());
  char* st1;
  char* st2;
  char* st3;

  if (sdist == NULL)
    printf("sdist is NULL\n");

  if (sdif == NULL)
    printf("sdif is NULL\n");

  if (sdix == NULL)
    printf("sdix is NULL\n");

  if (sols == NULL)
    printf("sols is NULL\n");

  if (m == NULL)
    printf("m is NULL\n");

Then launched <<<16, 16>>> and got a bunch of xxx is NULL. So, I’m running out of memory? How can this situation be solved?

Thanks a lot for the help.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations

B.31.1. Heap Memory Allocation

The device memory heap has a fixed size that must be specified before any program using malloc(), __nv_aligned_device_malloc() or free() is loaded into the context. A default heap of eight megabytes is allocated if any program uses malloc() or __nv_aligned_device_malloc() without explicitly specifying the heap size.

The following API functions get and set the heap size:

  • cudaDeviceGetLimit(size_t* size, cudaLimitMallocHeapSize)
  • cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)

The heap size granted will be at least size bytes. cuCtxGetLimit()and cudaDeviceGetLimit() return the currently requested heap size.

Hello:

Now I think part of the error is gone, but I’m still having memory issues. This one is different (I think now it’s a stack size problem)

So, I’m now using cudaDeviceSetLimit function by passing a calculated value

cn: 1, cm: 16
Memory usage summary (kernel):
        struct memory     size: 135472
        struct BigInteger size: 4104
        char              size: 1
        int               size: 4
        ------------------------
        struct memory     count: 1    >                 135472
        struct BigInteger count: 26   >                 106704
        int               count: 2    >                 8
        char              count: 12488 >                12488
TOTAL MEMORY USAGE: 254672 BYTES PER THREAD

Requested Threads: 16
        Setting memory for 17 threads (4329424 bytes)
Memory allocation: no error
Kernel Launch: no error
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Error on CUnewBI!
Kernyel Sync: unspecified launch failure
========= Error: process didn't terminate successfully
========= Invalid __global__ write of size 1
=========     at 0x00000eb8 in C:/_CUDA/RSA/C - Original/cuda_module.cu:514:searchValue(void*, void*, int*, void*, void*)
=========     by thread (15,0,0) in block (0,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2DAsync + 0x1ba01e) [0x1c8755]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\cuda-10.1-template-2019.exe (cudart::cudaApiLaunchKernel + 0xf6) [0x1e76]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\cuda-10.1-template-2019.exe (cudaLaunchKernel + 0x1c4) [0x17a4]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\cuda-10.1-template-2019.exe (searchValue + 0xf7) [0xfa87]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\cuda-10.1-template-2019.exe (main + 0x64b) [0x1016b]
=========     Host Frame:C:\Users\dosca\source\repos\RSAAnalyzer - CUDA\x64\Release\cuda-10.1-template-2019.exe (__scrt_common_main_seh + 0x10c) [0x10734]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x17c24]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6d4d1]
=========

On this execution, the error is located on

__device__ void CUnewBI(void* dst, char* s, int sig) {
  int i = cu_strlen(s) - 1;
  int f = i;
  int j = 0;
  int c;
  int ssig = sig;

  if (dst == NULL)
    printf("Error on CUnewBI!\n");

  //ajustamos el tipo
  ((struct BigInteger*)dst)->k = 'i'; //Line 514, dst is null

On searchValue function there are only two calls to CUnewBI, both of them are wrapped on if statement, plus they are after general memory check

//memory check
if (sidx == NULL)
  printf("Error on sidx allocation on thread %i\n", idx);

if (sstd == NULL)
  printf("Error on sstd allocation on thread %i\n", idx);

if (bidx == NULL)
  printf("Error on bidx allocation on thread %i\n", idx);

if (bstd == NULL)
  printf("Error on bstd allocation on thread %i\n", idx);

if (sdist == NULL)
  printf("Error on sdist allocation on thread %i\n", idx);

if (sdif == NULL)
  printf("Error on sdif allocation on thread %i\n", idx);

if (sidx == NULL)
  printf("Error on sidx allocation on thread %i\n", idx);

if (sols == NULL)
  printf("Error on sols allocation on thread %i\n", idx);

if (m == NULL)
  printf("Error on memory allocation on thread %i\n", idx);

//str to bi
if (bidx == NULL) {
  printf("Error on bidx\n");
} else {
  CUnewBI(bidx, sidx, 0);
}

if (bstd == NULL) {
  printf("Error on sidx\n");
} else {
  CUnewBI(bstd, sstd, 0);
}

So, on this case, the memory allocation looks ok (also, I “saved” additional space, just in case…), but looks like the stack data is getting lost.

I saw there’s cudalimitstacksize value to pass on cudaDeviceSetLimitfunction, so it may solve this issue? on this case, how can I calculate the appropiate stack size?

Thanks.

are you building this project with relocatable device code?

Hello @Robert_Crovella

It’s not a “must” to be reallocable. I have two devices to test and visual studio on both so I can compile it on two different executables.

Let me restate the question. Are you compiling with the visual studio project setting:

project properties…Configuration Properties…CUDA C/C++…Common…Generate Relocatable Device Code

set to Yes, or No. Just looking for a Yes or No response.

Hi, sorry…
I have it set to “No”. On this same section, I have “–compile” and “-cudart static” flags set.

Thanks

Ok then my best guess is that it is unlikely to be a stack issue. The compiler would inline the call to CUnewBI. But I agree that it is curious that the pointer in question appears to be non-NULL prior to the call but then NULL within the function body. I don’t know that I would be able to explain that without having a complete code to work with.

Hello:
I did a bit of debugging and found the mistake. The cuda-memcheck output suggest that error was found on searchValue function, but it really was on a device function called by this function (it may be inlined during compilation). So the real problem was the heap memory again, as I did not take in count all the malloc done in the kernel and device functions.

I did a small fix on the code and now it works fine with appropiate values (for my limited GPU, it’s <<<1, 20>>>, but I’ll be working on making it use less heap memory so I can allocate more threads).

Now I can focus on the performance of the kernel, that is quite slow compared to regular C…

At this point, I avoided all global memory (except for 1 shared byte (char) for thread sync), and the performance increased a bit, but not much. Maybe is there any compiler setting that can impact on the performance?

Thanks for all the support.

If you happen to be compiling a debug project, you should switch to compiling a release project if you care about performance. (Yes, I note in your initial posting you said you are building a Release project.)

However, kernel launch configurations such as <<<1,...>>> or <<<...,1>>> are generally miserable from a performance perspective. The first 2 performance objectives for any CUDA developer are:

  1. Use lots of threads
  2. Make efficient use of the memory subsystems

A launch configuration like the 2 I indicated above are a failure to achieve objective 1. You should address that as soon as possible to improve performance of your code.

The 920M is a very “small” GPU, so you may not have to go very far to meet this objective for such a small GPU. Something like <<<64,64>>> or higher/similar might get you there. But if you run on a large GPU, for best performance you may need to go as high as something like <<<320, 512>>> or similar, to get best performance (even higher on Ampere).

Hi Robert:
Thanks for the answers. Then I may work on minimize the use of heap (current limitation to achieve a bigger number of threads).

Is there any guide on how to program with a better approach to CUDA? Maybe I’m missing some important concepts,

Thanks.

This may be of interest. There are many resources, google is your friend.

Hello Robert:

I spent all day rethinking my solution and managed a way to gain some room (now I can issue 340 threads, as <<<1, 340>>> for example, or <<<17, 20>>>).

It still works slow, I assume it’s because the lack of parallel threads (as you said, my goal was <<<64, 64>>> too far from now for me).

So, I will back up a bit and rethink the whole thing.
Is there any space on this forum to ask for collaborations? I was searching about BigInteger computation but I didn’t found a clear solution about that, and a well made BigInteger library will help me a lot…

Thanks!