Comment

Comments

Comments in green were written by me. Comments in blue were not written by me.

2025-12-01

I think I solved the problem for n=10. I got 290,794,520 different solutions (or 36,349,315 solutions excluding symmetries). It took 3 weeks on an 8-core AMD EPYC (without multithreading).
For comparison: my program runs 2 seconds for n=8 (16 seconds single-threaded), and 15 minutes for n=9 (2 hours single-threaded). It has a number of quite tricky optimizations, I should write about those sometime.
Of course the result needs to be verified as the program is quite complex and might have bugs.

Oleg
on /blog/119

×1

2025-12-27

@Oleg: I pulled your code into Visual Studio 2026 and dealt with some warnings:

The COLLAPSES initializer was easy: Use (char) for 0x80

I changed unsafe localtime to:

struct tm buf;
auto err = localtime_s(&buf, &cur_time);
return std::put_time(&buf, "%F %T");

I did a quick change of __tzcnt_u32, __lzcnt32 to use _tzcnt_u32, _lzcnt_u32

Setting the compiler to use AVX and optimize for speed.

An '8' run worked, so I tried '9'

And it found only 1,729,930 solutions!

I suspected _tzcnt_u32, _lzcnt_u32 might be causing it, so I covered that:

#define __tzcnt_u32(v) ((v) ? (_tzcnt_u32(v)) : (32)) // Match BMI : should return 32 for value 0
#define __lzcnt32(v) ((v) ? (_lzcnt_u32(v)) : (32)) // Match BMI : should return 32 for value 0

but it didn't fix it.

Anyway, I decided to test multi-threading.
First I hunted for the initial depth sweet spot:

>Puzzle_Oleg.exe run 9 8 24
>Puzzle_Oleg.exe run 9 9 24
...
>Puzzle_Oleg.exe run 9 18 24
>Puzzle_Oleg.exe run 9 19 24

And found that 15 was fastest. [It didn't like 19]

'9' run [with the 1,729,930 problem] on Xeon E5-2697-v2

12t: 11m 52s
24t: 9m 1s

so HyperThreading is helping by about 48%

Lord Sméagol
on /blog/119

2026-01-01

@Lord Sméagol: Hello and happy New Year!
9 minutes is cool!
The answer is wrong because of _lzcnt instruction, as you suspected, as turns out it works differently on different cpus: https://nextmovesoftware.com/blog/2017...
With this error, solutions having 1x1 square directly in the center are not counted.

I guess, gcc/clang do it correctly because I specify -march=native (so it checks cpu and generates correct instruction), and run where I compile. But it's a potential problem I probably need to add some assertions to the code.

Maybe on your hardware you can either use WSL and clang compiler, or set constexpr bool USE_SSE_QUADRANT_FILL=false, to fall back to slower.
You could also try to use BitScanReverse instead of __lzcnt, but it has different input/output so I'm not sure how hard would that be to fix it.

Oleg
on /blog/119

2026-01-13

@Oleg:

The includes got filtered:

Util.h

//#include [bits/stdc++.h] // not including all
#include [filesystem] // just include what's needed
#include [array] // just include what's needed
#include [mutex] // just include what's needed

:)

Lord Sméagol
on /blog/119

2026-01-02

@Oleg:

I removed my macros:
#define __tzcnt_u32(v) ((v) ? (_tzcnt_u32(v)) : (32))
#define __lzcnt32(v) ((v) ? (_lzcnt_u32(v)) : (32))
replacing them with simple inline code

Util.h

//#include // not including all
#include // just include what's needed
#include // just include what's needed
#include // just include what's needed

#if 1 // use safe localtime
struct tm buf; // use safe localtime
auto err = localtime_s(&buf, &cur_time); // use safe localtime
return std::put_time(&buf, "%F %T"); // use safe localtime
#else // use safe localtime
return std::put_time(std::localtime(&cur_time), "%F %T");
#endif // use safe localtime

State.h

changed _mm_set_epi8(0x80 to -0x80 to stop warnings

inline replacement:
//int i = __tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
int i = _tzcnt_u32(mask); // for no BMI; without zero test, as not needed here

inline replacement:
//int last_idx_before_mid = 31 - __lzcnt32(off_mask); // for no BMI; without zero test, as not needed here
int last_idx_before_mid = _lzcnt_u32(off_mask); // for no BMI; without zero test, as not needed here

Solver.h

inline replacement:
//return ini.size(); // to stop warning
return (int)ini.size(); // to stop warning

inline replacement:
//const int dim = __tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
const int dim = _tzcnt_u32(mask); // for no BMI; without zero test, as not needed here

I tried '9' runs: with asserts: 10:31, without: 10:18 (saved 2%)
A minute slower than the faulty version, but still not too bad for a 2013 (Q3) CPU :)

Lord Sméagol
on /blog/119

2026-01-01

@Oleg: Happy new year!

I just added this:

#if 0
int last_idx_before_mid = 31 - __lzcnt32(off_mask); // 31 - LZCNT ==> index of MSb
#else

// if off_mask can never be zero, no need for check to override BSR result
assert(off_mask);
// a '9' run didn't reveal any 0 [you would know for sure for other sizes]

// need unsigned long result
unsigned long last_idx_before_mid;

// get index of MSb [no need for adjustment if off_mask can never be zero]
_BitScanReverse(&last_idx_before_mid, off_mask);
#endif

a run of '9' now produces the correct result: 1,730,280 :)

Lord Sméagol
on /blog/119

2025-12-06

@Oleg: Great!
Going by your time differences between single and multi-thread, I assume you are not using SMT.

My performance is limited by my old tech (Ivy Bridge E5-2697-v2) !
I have got multi-threading working in C and it looks like there are no bugs :)
But I am still not getting any benefit from HyperThreading, maybe 16 general purpose registers aren't enough for the compiler!

I have just started converting my inner search function to asm, which will give me full control of ALL registers and let me use some coding tricks that are not availble using C.

Once I get it going, I will try a '10' run, which should verify your results.

Lord Sméagol
on /blog/119

2025-12-10

@Lord Sméagol: Right, I wasn't using SMT (I meant it when I said without multithreading, sorry for the bad wording). I tried to use intel-based VM before with 4 cpu/8 threads, but the speedup was about 5.5 times only, and the price was only 20% less (I used Azure spot instances which are not so expensive, but some automation is needed to restart them every time they are stopped by Azure).
To give you all details, my program spent 21 days on an AMD EPYC 9004 (8 cores without SMT, Azure spot instance Standard F8als v6) using 8 threads (that is, about 160 CPU-days!)
I've published the source code, still planning to write about the optimizations: https://github.com/lightln2/partridge-...

(anonymous)
on /blog/119

2025-12-11

@(anonymous): Thanks for the clarification.
I'm still (slowly) building my asm funcion. I think I have settled on register allocation, leaving only rcx as a 'scratch' register because cl will be needed for some variable shifts.
I also use the xmm registers (14 so far) to minimize memory operations to hopefully let HT/SMT get some decent gains.

How long would Matt Parker's 'terrible Python code' take to solve this problem ? :)
Ok, his maths knowledge might produce some decent algorithms, but it would help him a lot to use something that compiles to native code.

Lord Sméagol
on /blog/119

Comment

Comments

Archive

May 2026

Apr 2026

Feb 2026

Dec 2025

Nov 2025

Sep 2025

Aug 2025

Jun 2025

Mar 2025

Jan 2025

Dec 2024

Nov 2024

Feb 2024

Jan 2024

Dec 2023

Nov 2023

Sep 2023

May 2023

Apr 2023

Mar 2023

Feb 2023

Jan 2023

Dec 2022

Nov 2022

Oct 2022

Mar 2022

Feb 2022

Jan 2022

Dec 2021

Nov 2021

Sep 2021

May 2021

Jan 2021

Dec 2020

Nov 2020

Jul 2020

May 2020

Mar 2020

Feb 2020

Jan 2020

Dec 2019

Nov 2019

Sep 2019

Jul 2019

Jun 2019

Apr 2019

Mar 2019

Jan 2019

Dec 2018

Nov 2018

Sep 2018

Jul 2018

Jun 2018

May 2018

Apr 2018

Mar 2018

Jan 2018

Dec 2017

Nov 2017

Jun 2017

Mar 2017

Feb 2017

Jan 2017

Dec 2016

Nov 2016

Oct 2016

Sep 2016

Jul 2016

Jun 2016

May 2016

Mar 2016

Jan 2016

Nov 2015

Oct 2015

Aug 2015

Mar 2015

Jan 2015