Blog
2025-09-06
The partridge puzzle
Showing all comments about the post The partridge puzzle. To return to the blog post, click here.
Comments
Comments in green were written by me. Comments in blue were not written by me.
@Oleg:
I removed my macros:
#define __tzcnt_u32(v) ((v) ? (_tzcnt_u32(v)) : (32))
#define __lzcnt32(v) ((v) ? (_lzcnt_u32(v)) : (32))
replacing them with simple inline code
Util.h
//#include // not including all
#include // just include what's needed
#include // just include what's needed
#include // just include what's needed
#if 1 // use safe localtime
struct tm buf; // use safe localtime
auto err = localtime_s(&buf, &cur_time); // use safe localtime
return std::put_time(&buf, "%F %T"); // use safe localtime
#else // use safe localtime
return std::put_time(std::localtime(&cur_time), "%F %T");
#endif // use safe localtime
State.h
changed _mm_set_epi8(0x80 to -0x80 to stop warnings
inline replacement:
//int i = __tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
int i = _tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
inline replacement:
//int last_idx_before_mid = 31 - __lzcnt32(off_mask); // for no BMI; without zero test, as not needed here
int last_idx_before_mid = _lzcnt_u32(off_mask); // for no BMI; without zero test, as not needed here
Solver.h
inline replacement:
//return ini.size(); // to stop warning
return (int)ini.size(); // to stop warning
inline replacement:
//const int dim = __tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
const int dim = _tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
I tried '9' runs: with asserts: 10:31, without: 10:18 (saved 2%)
A minute slower than the faulty version, but still not too bad for a 2013 (Q3) CPU :)
I removed my macros:
#define __tzcnt_u32(v) ((v) ? (_tzcnt_u32(v)) : (32))
#define __lzcnt32(v) ((v) ? (_lzcnt_u32(v)) : (32))
replacing them with simple inline code
Util.h
//#include // not including all
#include // just include what's needed
#include // just include what's needed
#include // just include what's needed
#if 1 // use safe localtime
struct tm buf; // use safe localtime
auto err = localtime_s(&buf, &cur_time); // use safe localtime
return std::put_time(&buf, "%F %T"); // use safe localtime
#else // use safe localtime
return std::put_time(std::localtime(&cur_time), "%F %T");
#endif // use safe localtime
State.h
changed _mm_set_epi8(0x80 to -0x80 to stop warnings
inline replacement:
//int i = __tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
int i = _tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
inline replacement:
//int last_idx_before_mid = 31 - __lzcnt32(off_mask); // for no BMI; without zero test, as not needed here
int last_idx_before_mid = _lzcnt_u32(off_mask); // for no BMI; without zero test, as not needed here
Solver.h
inline replacement:
//return ini.size(); // to stop warning
return (int)ini.size(); // to stop warning
inline replacement:
//const int dim = __tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
const int dim = _tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
I tried '9' runs: with asserts: 10:31, without: 10:18 (saved 2%)
A minute slower than the faulty version, but still not too bad for a 2013 (Q3) CPU :)
Lord Sméagol
@Oleg: Happy new year!
I just added this:
#if 0
int last_idx_before_mid = 31 - __lzcnt32(off_mask); // 31 - LZCNT ==> index of MSb
#else
// if off_mask can never be zero, no need for check to override BSR result
assert(off_mask);
// a '9' run didn't reveal any 0 [you would know for sure for other sizes]
// need unsigned long result
unsigned long last_idx_before_mid;
// get index of MSb [no need for adjustment if off_mask can never be zero]
_BitScanReverse(&last_idx_before_mid, off_mask);
#endif
a run of '9' now produces the correct result: 1,730,280 :)
I just added this:
#if 0
int last_idx_before_mid = 31 - __lzcnt32(off_mask); // 31 - LZCNT ==> index of MSb
#else
// if off_mask can never be zero, no need for check to override BSR result
assert(off_mask);
// a '9' run didn't reveal any 0 [you would know for sure for other sizes]
// need unsigned long result
unsigned long last_idx_before_mid;
// get index of MSb [no need for adjustment if off_mask can never be zero]
_BitScanReverse(&last_idx_before_mid, off_mask);
#endif
a run of '9' now produces the correct result: 1,730,280 :)
Lord Sméagol
@Lord Sméagol: Hello and happy New Year!
9 minutes is cool!
The answer is wrong because of _lzcnt instruction, as you suspected, as turns out it works differently on different cpus: https://nextmovesoftware.com/blog/2017...
With this error, solutions having 1x1 square directly in the center are not counted.
I guess, gcc/clang do it correctly because I specify -march=native (so it checks cpu and generates correct instruction), and run where I compile. But it's a potential problem I probably need to add some assertions to the code.
Maybe on your hardware you can either use WSL and clang compiler, or set constexpr bool USE_SSE_QUADRANT_FILL=false, to fall back to slower.
You could also try to use BitScanReverse instead of __lzcnt, but it has different input/output so I'm not sure how hard would that be to fix it.
9 minutes is cool!
The answer is wrong because of _lzcnt instruction, as you suspected, as turns out it works differently on different cpus: https://nextmovesoftware.com/blog/2017...
With this error, solutions having 1x1 square directly in the center are not counted.
I guess, gcc/clang do it correctly because I specify -march=native (so it checks cpu and generates correct instruction), and run where I compile. But it's a potential problem I probably need to add some assertions to the code.
Maybe on your hardware you can either use WSL and clang compiler, or set constexpr bool USE_SSE_QUADRANT_FILL=false, to fall back to slower.
You could also try to use BitScanReverse instead of __lzcnt, but it has different input/output so I'm not sure how hard would that be to fix it.
Oleg
@Oleg: I pulled your code into Visual Studio 2026 and dealt with some warnings:
The COLLAPSES initializer was easy: Use (char) for 0x80
I changed unsafe localtime to:
struct tm buf;
auto err = localtime_s(&buf, &cur_time);
return std::put_time(&buf, "%F %T");
I did a quick change of __tzcnt_u32, __lzcnt32 to use _tzcnt_u32, _lzcnt_u32
Setting the compiler to use AVX and optimize for speed.
An '8' run worked, so I tried '9'
And it found only 1,729,930 solutions!
I suspected _tzcnt_u32, _lzcnt_u32 might be causing it, so I covered that:
#define __tzcnt_u32(v) ((v) ? (_tzcnt_u32(v)) : (32)) // Match BMI : should return 32 for value 0
#define __lzcnt32(v) ((v) ? (_lzcnt_u32(v)) : (32)) // Match BMI : should return 32 for value 0
but it didn't fix it.
Anyway, I decided to test multi-threading.
First I hunted for the initial depth sweet spot:
>Puzzle_Oleg.exe run 9 8 24
>Puzzle_Oleg.exe run 9 9 24
...
>Puzzle_Oleg.exe run 9 18 24
>Puzzle_Oleg.exe run 9 19 24
And found that 15 was fastest. [It didn't like 19]
'9' run [with the 1,729,930 problem] on Xeon E5-2697-v2
12t: 11m 52s
24t: 9m 1s
so HyperThreading is helping by about 48%
The COLLAPSES initializer was easy: Use (char) for 0x80
I changed unsafe localtime to:
struct tm buf;
auto err = localtime_s(&buf, &cur_time);
return std::put_time(&buf, "%F %T");
I did a quick change of __tzcnt_u32, __lzcnt32 to use _tzcnt_u32, _lzcnt_u32
Setting the compiler to use AVX and optimize for speed.
An '8' run worked, so I tried '9'
And it found only 1,729,930 solutions!
I suspected _tzcnt_u32, _lzcnt_u32 might be causing it, so I covered that:
#define __tzcnt_u32(v) ((v) ? (_tzcnt_u32(v)) : (32)) // Match BMI : should return 32 for value 0
#define __lzcnt32(v) ((v) ? (_lzcnt_u32(v)) : (32)) // Match BMI : should return 32 for value 0
but it didn't fix it.
Anyway, I decided to test multi-threading.
First I hunted for the initial depth sweet spot:
>Puzzle_Oleg.exe run 9 8 24
>Puzzle_Oleg.exe run 9 9 24
...
>Puzzle_Oleg.exe run 9 18 24
>Puzzle_Oleg.exe run 9 19 24
And found that 15 was fastest. [It didn't like 19]
'9' run [with the 1,729,930 problem] on Xeon E5-2697-v2
12t: 11m 52s
24t: 9m 1s
so HyperThreading is helping by about 48%
Lord Sméagol
@(anonymous): Thanks for the clarification.
I'm still (slowly) building my asm funcion. I think I have settled on register allocation, leaving only rcx as a 'scratch' register because cl will be needed for some variable shifts.
I also use the xmm registers (14 so far) to minimize memory operations to hopefully let HT/SMT get some decent gains.
How long would Matt Parker's 'terrible Python code' take to solve this problem ? :)
Ok, his maths knowledge might produce some decent algorithms, but it would help him a lot to use something that compiles to native code.
I'm still (slowly) building my asm funcion. I think I have settled on register allocation, leaving only rcx as a 'scratch' register because cl will be needed for some variable shifts.
I also use the xmm registers (14 so far) to minimize memory operations to hopefully let HT/SMT get some decent gains.
How long would Matt Parker's 'terrible Python code' take to solve this problem ? :)
Ok, his maths knowledge might produce some decent algorithms, but it would help him a lot to use something that compiles to native code.
Lord Sméagol
@Lord Sméagol: Right, I wasn't using SMT (I meant it when I said without multithreading, sorry for the bad wording). I tried to use intel-based VM before with 4 cpu/8 threads, but the speedup was about 5.5 times only, and the price was only 20% less (I used Azure spot instances which are not so expensive, but some automation is needed to restart them every time they are stopped by Azure).
To give you all details, my program spent 21 days on an AMD EPYC 9004 (8 cores without SMT, Azure spot instance Standard F8als v6) using 8 threads (that is, about 160 CPU-days!)
I've published the source code, still planning to write about the optimizations: https://github.com/lightln2/partridge-...
To give you all details, my program spent 21 days on an AMD EPYC 9004 (8 cores without SMT, Azure spot instance Standard F8als v6) using 8 threads (that is, about 160 CPU-days!)
I've published the source code, still planning to write about the optimizations: https://github.com/lightln2/partridge-...
(anonymous)
@Oleg: Great!
Going by your time differences between single and multi-thread, I assume you are not using SMT.
My performance is limited by my old tech (Ivy Bridge E5-2697-v2) !
I have got multi-threading working in C and it looks like there are no bugs :)
But I am still not getting any benefit from HyperThreading, maybe 16 general purpose registers aren't enough for the compiler!
I have just started converting my inner search function to asm, which will give me full control of ALL registers and let me use some coding tricks that are not availble using C.
Once I get it going, I will try a '10' run, which should verify your results.
Going by your time differences between single and multi-thread, I assume you are not using SMT.
My performance is limited by my old tech (Ivy Bridge E5-2697-v2) !
I have got multi-threading working in C and it looks like there are no bugs :)
But I am still not getting any benefit from HyperThreading, maybe 16 general purpose registers aren't enough for the compiler!
I have just started converting my inner search function to asm, which will give me full control of ALL registers and let me use some coding tricks that are not availble using C.
Once I get it going, I will try a '10' run, which should verify your results.
Lord Sméagol
I think I solved the problem for n=10. I got 290,794,520 different solutions (or 36,349,315 solutions excluding symmetries). It took 3 weeks on an 8-core AMD EPYC (without multithreading).
For comparison: my program runs 2 seconds for n=8 (16 seconds single-threaded), and 15 minutes for n=9 (2 hours single-threaded). It has a number of quite tricky optimizations, I should write about those sometime.
Of course the result needs to be verified as the program is quite complex and might have bugs.
For comparison: my program runs 2 seconds for n=8 (16 seconds single-threaded), and 15 minutes for n=9 (2 hours single-threaded). It has a number of quite tricky optimizations, I should write about those sometime.
Of course the result needs to be verified as the program is quite complex and might have bugs.
Oleg
@Danila P.: I think I have pushed VB.Net to its limit!
I moved the board from byte cells to 64 bit integer bitmaps (1 for each row);
using 8, 9, 10 variables for the 'active' rows instead of an array to reduce memory access.
I also partitioned the '9' and '10' search so other computers (I have a 6c/12t i7 3930K and a few quad cores) can assist using a mapped network drive.
All 2,332 distinct solutions of the '8' puzzle (12c/24t) now 16 secs (was 26 sec).
I have ported most of it to C ... just need to get the multi-threading working!
I moved the board from byte cells to 64 bit integer bitmaps (1 for each row);
using 8, 9, 10 variables for the 'active' rows instead of an array to reduce memory access.
I also partitioned the '9' and '10' search so other computers (I have a 6c/12t i7 3930K and a few quad cores) can assist using a mapped network drive.
All 2,332 distinct solutions of the '8' puzzle (12c/24t) now 16 secs (was 26 sec).
I have ported most of it to C ... just need to get the multi-threading working!
Lord Sméagol
@Dan: I have added symmetry optimization (still in VB.Net):
(Scanning for free space from top to bottom, left to right)
Only place 1x1 in one octant; The tests are simple, the savings are huge.
Even size: 8->36, 11->66, 12->78, 15->120
. . . . . . . . . .
. . . . . . . . . .
. . \ x x x x x . .
. . + \ x x x x . .
. . + + \ x x x . .
. . # # # # # # . .
. . # # # # # # . .
. . # # # # # # . .
. . . . . . . . . .
. . . . . . . . . .
Odd size: 9->45, 10->55, 13->91, 14->105
. . . . . . . . . . .
. . . . . . . . . . .
. . \ x x x x x x . .
. . + \ x x x x x . .
. . + + \ x x x x . .
. . = = = * x x x . .
. . # # # # # # # . .
. . # # # # # # # . .
. . # # # # # # # . .
. . . . . . . . . . .
. . . . . . . . . . .
+ No symmetry check needed
\ Check transpose only
= Check vertical flip only
* Check all 7 symmetries
. 1x1 impossible here
x don't place 1x1 here
# backtrack if 1x1 not placed yet
'8' 2,332 distinct: (12c/24t) 26 sec, (1t) 3m56s
'9' 216,285 distinct: (12c/24t) 3h32m41s, (1t) [32.2 hours estimated]
12c/24t is only getting a 9x improvement over single thread :(
I need to think about making an asm version :)
(Scanning for free space from top to bottom, left to right)
Only place 1x1 in one octant; The tests are simple, the savings are huge.
Even size: 8->36, 11->66, 12->78, 15->120
. . . . . . . . . .
. . . . . . . . . .
. . \ x x x x x . .
. . + \ x x x x . .
. . + + \ x x x . .
. . # # # # # # . .
. . # # # # # # . .
. . # # # # # # . .
. . . . . . . . . .
. . . . . . . . . .
Odd size: 9->45, 10->55, 13->91, 14->105
. . . . . . . . . . .
. . . . . . . . . . .
. . \ x x x x x x . .
. . + \ x x x x x . .
. . + + \ x x x x . .
. . = = = * x x x . .
. . # # # # # # # . .
. . # # # # # # # . .
. . # # # # # # # . .
. . . . . . . . . . .
. . . . . . . . . . .
+ No symmetry check needed
\ Check transpose only
= Check vertical flip only
* Check all 7 symmetries
. 1x1 impossible here
x don't place 1x1 here
# backtrack if 1x1 not placed yet
'8' 2,332 distinct: (12c/24t) 26 sec, (1t) 3m56s
'9' 216,285 distinct: (12c/24t) 3h32m41s, (1t) [32.2 hours estimated]
12c/24t is only getting a 9x improvement over single thread :(
I need to think about making an asm version :)
Lord Sméagol
@Dan: 9.5 min on what CPU, RAM ? My rig is 3.5 GHz single thread Ivy Bridge, DDR3 1600, which will be holding me back quite a lot compared to modern kit!
The main problem with VB.Net (well, .Net itself) is only allowing you to disable integer overflow checks.
I remember VB6 let you also disable array bounds checking. That option SHOULD be available in .Net!
--> Test your code in Debug mode ... ok, it works, release mode with no checking --> let it rip!
My VB.Net prog spits out solutions as it finds them, but this hardly impacts performance. There is a single render thread that waits for a solution from a shared queue that [24 in my case] worker threads are feeding.
I would like to keep all the hot stuff in registers, but I don't think C will give me enough control over that, so maybe it's asm time!
The main problem with VB.Net (well, .Net itself) is only allowing you to disable integer overflow checks.
I remember VB6 let you also disable array bounds checking. That option SHOULD be available in .Net!
--> Test your code in Debug mode ... ok, it works, release mode with no checking --> let it rip!
My VB.Net prog spits out solutions as it finds them, but this hardly impacts performance. There is a single render thread that waits for a solution from a shared queue that [24 in my case] worker threads are feeding.
I would like to keep all the hot stuff in registers, but I don't think C will give me enough control over that, so maybe it's asm time!
Lord Sméagol
@Lord Sméagol: Great job! I guess VB is pretty slow.
The world record in single thread mode for the '8' puzzle is 9.5 min, which can be further improved by taking the diagonal symmetry into account.
You can find the code in the comments to my article: https://habr.com/ru/articles/889410/
The world record in single thread mode for the '8' puzzle is 9.5 min, which can be further improved by taking the diagonal symmetry into account.
You can find the code in the comments to my article: https://habr.com/ru/articles/889410/
Dan
I decided to write a multi-threaded Partridge solver (in VB.Net) as I have a Xeon E5-2697-v2 12c/24t.
It finds all 18,656 solutions for the '8' puzzle in 2 minutes and all 1,730,280 solutions for the '9' puzzle in 21 hours, which isn't too bad for a 13 year old PC :)
Testing the '8' solver on a single thread took just under 24 minutes (only 12 times 2 minutes, not 24 times), hinting that Hyper-threading memory bandwidth is the bottleneck, so another board storage method is needed to improve the scalability!
It finds all 18,656 solutions for the '8' puzzle in 2 minutes and all 1,730,280 solutions for the '9' puzzle in 21 hours, which isn't too bad for a 13 year old PC :)
Testing the '8' solver on a single thread took just under 24 minutes (only 12 times 2 minutes, not 24 times), hinting that Hyper-threading memory bandwidth is the bottleneck, so another board storage method is needed to improve the scalability!
Lord Sméagol
Hi,
I am the author of the OEIS sequence. It's a pity that the sequence was not mentioned in Matt Parker's video.
Earlier this year I've made some analysis of the solutions: https://habr.com/ru/articles/889958/
In particular, there are solutions where all squares from 1 to 9 stack in one row or column (1+2+...+9 = 45).
As for the symmetry, the proof is the following: The symmetry could be horizontal (which is nearly the same as vertical) or diagonal.
In case of horizontal, the square of size 1 must be located on the center line. It will be either near the wall, or between 2 larger squares, that are centered on the center line. In both cases a lane of width 1 arises, that cannot be filled with any other square.
In case of diagonal, the square of size 1 must be on the diagonal and at first sight there is no lane of width 1. But, as long as you put all diagonal squares and then any square adjacent to the square of size 1, such a lane arises.
Your heatmap for size 1 is great!
I am the author of the OEIS sequence. It's a pity that the sequence was not mentioned in Matt Parker's video.
Earlier this year I've made some analysis of the solutions: https://habr.com/ru/articles/889958/
In particular, there are solutions where all squares from 1 to 9 stack in one row or column (1+2+...+9 = 45).
As for the symmetry, the proof is the following: The symmetry could be horizontal (which is nearly the same as vertical) or diagonal.
In case of horizontal, the square of size 1 must be located on the center line. It will be either near the wall, or between 2 larger squares, that are centered on the center line. In both cases a lane of width 1 arises, that cannot be filled with any other square.
In case of diagonal, the square of size 1 must be on the diagonal and at first sight there is no lane of width 1. But, as long as you put all diagonal squares and then any square adjacent to the square of size 1, such a lane arises.
Your heatmap for size 1 is great!
Danila P.
@Chris B: I put a little more text on the interactive page. Hopefully it's a bit clearer now what it's expecting (click and click, not drag)
Matthew
In the video Matt mentions the equivalence of flipping a sub-rectangle, so that leads me to a few question.
1) How many unique solutions are there is flipping/rotating a sub rectangle (or the whole thing) are considered equivalent solutions?
To make it a proper equivalence class, you would need to be allowed multiple flips/rotations such as flipping a sub rectangle followed by flipping the whole thing. Which leads me to some other questions:
2) What is the the greatest distance (number of flips/rotations needed) of two solutions within the same equivalence class? And 3) are there sub rectangles that only appear after an initial flip?
Finally, 4) what happens if we extend all of the above to not just sub rectangles, but any flippable or rotatable sub shape?
1) How many unique solutions are there is flipping/rotating a sub rectangle (or the whole thing) are considered equivalent solutions?
To make it a proper equivalence class, you would need to be allowed multiple flips/rotations such as flipping a sub rectangle followed by flipping the whole thing. Which leads me to some other questions:
2) What is the the greatest distance (number of flips/rotations needed) of two solutions within the same equivalence class? And 3) are there sub rectangles that only appear after an initial flip?
Finally, 4) what happens if we extend all of the above to not just sub rectangles, but any flippable or rotatable sub shape?
Dan
@Chris B: Oh never mind ???? I just went back there and found the correct way to move the squares. Duh, sorry!
(anonymous)
Hi Dr Matt
Just came here from Matt Parker's YouTube post of 28th Aug re the Partridge problem. I tried - and failed at the first click - your interactive model / puzzle "squares" when trying to drag a square resulted in the "blocked" icon to say "Naah, you're trying to do something illegal". I saw something that indicated it's written in JavaScript - I do have JavaScript set as "allowed" by default so that's evidently not the problem...
Just came here from Matt Parker's YouTube post of 28th Aug re the Partridge problem. I tried - and failed at the first click - your interactive model / puzzle "squares" when trying to drag a square resulted in the "blocked" icon to say "Naah, you're trying to do something illegal". I saw something that indicated it's written in JavaScript - I do have JavaScript set as "allowed" by default so that's evidently not the problem...
Chris B
Add a Comment

The includes got filtered:
Util.h
//#include [bits/stdc++.h] // not including all
#include [filesystem] // just include what's needed
#include [array] // just include what's needed
#include [mutex] // just include what's needed
:)