mscroggs.co.uk
mscroggs.co.uk

subscribe

Comment

Comments

Comments in green were written by me. Comments in blue were not written by me.
@Oleg: I pulled your code into Visual Studio 2026 and dealt with some warnings:

The COLLAPSES initializer was easy: Use (char) for 0x80

I changed unsafe localtime to:

struct tm buf;
auto err = localtime_s(&buf, &cur_time);
return std::put_time(&buf, "%F %T");

I did a quick change of __tzcnt_u32, __lzcnt32 to use _tzcnt_u32, _lzcnt_u32

Setting the compiler to use AVX and optimize for speed.

An '8' run worked, so I tried '9'

And it found only 1,729,930 solutions!

I suspected _tzcnt_u32, _lzcnt_u32 might be causing it, so I covered that:

#define __tzcnt_u32(v) ((v) ? (_tzcnt_u32(v)) : (32)) // Match BMI : should return 32 for value 0
#define __lzcnt32(v) ((v) ? (_lzcnt_u32(v)) : (32)) // Match BMI : should return 32 for value 0

but it didn't fix it.

Anyway, I decided to test multi-threading.
First I hunted for the initial depth sweet spot:

>Puzzle_Oleg.exe run 9 8 24
>Puzzle_Oleg.exe run 9 9 24
...
>Puzzle_Oleg.exe run 9 18 24
>Puzzle_Oleg.exe run 9 19 24

And found that 15 was fastest. [It didn't like 19]

'9' run [with the 1,729,930 problem] on Xeon E5-2697-v2

12t: 11m 52s
24t: 9m 1s

so HyperThreading is helping by about 48%
Lord Sméagol
on /blog/119
               
@Lord Sméagol: Hello and happy New Year!
9 minutes is cool!
The answer is wrong because of _lzcnt instruction, as you suspected, as turns out it works differently on different cpus: https://nextmovesoftware.com/blog/2017...
With this error, solutions having 1x1 square directly in the center are not counted.

I guess, gcc/clang do it correctly because I specify -march=native (so it checks cpu and generates correct instruction), and run where I compile. But it's a potential problem I probably need to add some assertions to the code.

Maybe on your hardware you can either use WSL and clang compiler, or set constexpr bool USE_SSE_QUADRANT_FILL=false, to fall back to slower.
You could also try to use BitScanReverse instead of __lzcnt, but it has different input/output so I'm not sure how hard would that be to fix it.
Oleg
on /blog/119
               
@Oleg:

The includes got filtered:

Util.h

//#include [bits/stdc++.h] // not including all
#include [filesystem] // just include what's needed
#include [array] // just include what's needed
#include [mutex] // just include what's needed

:)
Lord Sméagol
on /blog/119
               
@Oleg:

I removed my macros:
#define __tzcnt_u32(v) ((v) ? (_tzcnt_u32(v)) : (32))
#define __lzcnt32(v) ((v) ? (_lzcnt_u32(v)) : (32))
replacing them with simple inline code


Util.h

//#include // not including all
#include // just include what's needed
#include // just include what's needed
#include // just include what's needed

#if 1 // use safe localtime
struct tm buf; // use safe localtime
auto err = localtime_s(&buf, &cur_time); // use safe localtime
return std::put_time(&buf, "%F %T"); // use safe localtime
#else // use safe localtime
return std::put_time(std::localtime(&cur_time), "%F %T");
#endif // use safe localtime


State.h

changed _mm_set_epi8(0x80 to -0x80 to stop warnings

inline replacement:
//int i = __tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
int i = _tzcnt_u32(mask); // for no BMI; without zero test, as not needed here

inline replacement:
//int last_idx_before_mid = 31 - __lzcnt32(off_mask); // for no BMI; without zero test, as not needed here
int last_idx_before_mid = _lzcnt_u32(off_mask); // for no BMI; without zero test, as not needed here


Solver.h

inline replacement:
//return ini.size(); // to stop warning
return (int)ini.size(); // to stop warning

inline replacement:
//const int dim = __tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
const int dim = _tzcnt_u32(mask); // for no BMI; without zero test, as not needed here


I tried '9' runs: with asserts: 10:31, without: 10:18 (saved 2%)
A minute slower than the faulty version, but still not too bad for a 2013 (Q3) CPU :)
Lord Sméagol
on /blog/119
               
@Oleg: Happy new year!

I just added this:

#if 0
int last_idx_before_mid = 31 - __lzcnt32(off_mask); // 31 - LZCNT ==> index of MSb
#else

// if off_mask can never be zero, no need for check to override BSR result
assert(off_mask);
// a '9' run didn't reveal any 0 [you would know for sure for other sizes]

// need unsigned long result
unsigned long last_idx_before_mid;

// get index of MSb [no need for adjustment if off_mask can never be zero]
_BitScanReverse(&last_idx_before_mid, off_mask);
#endif

a run of '9' now produces the correct result: 1,730,280 :)
Lord Sméagol
on /blog/119
               

Archive

Show me a random blog post
 2026 

May 2026

World Cup stickers 2026

Apr 2026

A new puzzle every day
Mixing Wordle with other games

Feb 2026

Christmas (2025) is over
 2025 

Dec 2025

Christmas card 2025

Nov 2025

Christmas (2025) is coming!

Sep 2025

The partridge puzzle

Aug 2025

TMiP 2025 puzzle hunt

Jun 2025

A nonogram alphabet

Mar 2025

How to write a crossnumber

Jan 2025

Christmas (2024) is over
Friendly squares
 2024 

Dec 2024

A regular expression Christmas puzzle
Christmas card 2024

Nov 2024

Christmas (2024) is coming!

Feb 2024

Zines, pt. 2

Jan 2024

Christmas (2023) is over
 2023 
▼ show ▼
 2022 
▼ show ▼
 2021 
▼ show ▼
 2020 
▼ show ▼
 2019 
▼ show ▼
 2018 
▼ show ▼
 2017 
▼ show ▼
 2016 
▼ show ▼
 2015 
▼ show ▼
 2014 
▼ show ▼
 2013 
▼ show ▼
 2012 
▼ show ▼

Tags

bempp matrix multiplication talking maths in public pascal's triangle people maths world cup convergence pokémon wordle zines inverse matrices matrix of minors kings inline code hexapawn weather station youtube anscombe's quartet manchester science festival numerical analysis tetris braiding computational complexity big internet math-off arrangement puzzles triangles polynomials approximation gather town raspberry pi the aperiodical data visualisation turtles edinburgh craft squares databet menace nine men's morris finite group ternary wordle guest posts golden spiral go noughts and crosses python mean christmas card dates wool bluesky sport flexagons trigonometry correlation games machine learning 24 hour maths martin gardner dinosaurs crochet mathsteroids reddit national lottery geometry logic phd folding tube maps pythagoras logo pac-man arithmetic matrix of cofactors pokémon error bars gaussian elimination geogebra christmas live stream draughts realhats mathslogicbot coins rhombicuboctahedron asteroids coventry royal baby alphabets european cup palindromes partridge puzzle signorini conditions data reuleaux polygons oeis hyperbolic surfaces interpolation mathsjam light platonic solids advent calendar datasaurus dozen exponential growth crossnumber nonograms simultaneous equations dataset harriss spiral sobolev spaces rust runge's phenomenon london preconditioning football chebyshev weak imposition bubble bobble plastic ratio programming gerry anderson graphs speed probability manchester frobel numbers estimation ucl logs standard deviation graph theory sorting dragon curves finite element method crossnumbers warwick pi approximation day newcastle latex regular expressions bots friendly squares stirling numbers misleading statistics fonts matrices hannah fry a gamut of games quadrilaterals stickers cambridge chalkdust magazine determinants php books pi video games puzzles game show probability fractals matt parker javascript pizza cutting accuracy crosswords radio 4 propositional calculus wave scattering thirteen golden ratio rugby curvature boundary element methods electromagnetic field final fantasy london underground chess map projections tmip game of life countdown statistics captain scarlet hats folding paper fence posts errors binary sound tennis kenilworth recursion cross stitch news bodmas royal institution

Archive

Show me a random blog post
▼ show ▼
© Matthew Scroggs 2012–2026