How to make paralle factorial.

Hi,
I’m Peter, I’m new on this forum.
To pass subject lecturer give me a task to count factorial from 400 on GPU with CUDA.
I suspect that 400! is out of range all known variables so I wrote a code to multiply numbers on cstrings. My idea to parallel factorial calculation is
splitting up multiply operation ( 1 multiply - 1 thread ), but this solution has , in my opinion, disadvantage. I have to start kernel in loop and in each iteration move to device new portion of data ( a lot of memory operations ).
I saw in the Internet benchmark on total-oc.ru where factorial 750k is counted. Maybe You know a solution for factorial. Please help me.

PS. Sorry for my English.