Comment
Comments
Comments in green were written by me. Comments in blue were not written by me.
@Lord Sméagol: Right, I wasn't using SMT (I meant it when I said without multithreading, sorry for the bad wording). I tried to use intel-based VM before with 4 cpu/8 threads, but the speedup was about 5.5 times only, and the price was only 20% less (I used Azure spot instances which are not so expensive, but some automation is needed to restart them every time they are stopped by Azure).
To give you all details, my program spent 21 days on an AMD EPYC 9004 (8 cores without SMT, Azure spot instance Standard F8als v6) using 8 threads (that is, about 160 CPU-days!)
I've published the source code, still planning to write about the optimizations: https://github.com/lightln2/partridge-...
To give you all details, my program spent 21 days on an AMD EPYC 9004 (8 cores without SMT, Azure spot instance Standard F8als v6) using 8 threads (that is, about 160 CPU-days!)
I've published the source code, still planning to write about the optimizations: https://github.com/lightln2/partridge-...
(anonymous)
on /blog/119
on /blog/119
@(anonymous): Thanks for the clarification.
I'm still (slowly) building my asm funcion. I think I have settled on register allocation, leaving only rcx as a 'scratch' register because cl will be needed for some variable shifts.
I also use the xmm registers (14 so far) to minimize memory operations to hopefully let HT/SMT get some decent gains.
How long would Matt Parker's 'terrible Python code' take to solve this problem ? :)
Ok, his maths knowledge might produce some decent algorithms, but it would help him a lot to use something that compiles to native code.
I'm still (slowly) building my asm funcion. I think I have settled on register allocation, leaving only rcx as a 'scratch' register because cl will be needed for some variable shifts.
I also use the xmm registers (14 so far) to minimize memory operations to hopefully let HT/SMT get some decent gains.
How long would Matt Parker's 'terrible Python code' take to solve this problem ? :)
Ok, his maths knowledge might produce some decent algorithms, but it would help him a lot to use something that compiles to native code.
Lord Sméagol
on /blog/119
on /blog/119

Going by your time differences between single and multi-thread, I assume you are not using SMT.
My performance is limited by my old tech (Ivy Bridge E5-2697-v2) !
I have got multi-threading working in C and it looks like there are no bugs :)
But I am still not getting any benefit from HyperThreading, maybe 16 general purpose registers aren't enough for the compiler!
I have just started converting my inner search function to asm, which will give me full control of ALL registers and let me use some coding tricks that are not availble using C.
Once I get it going, I will try a '10' run, which should verify your results.
on /blog/119