Comment
Comments
Comments in green were written by me. Comments in blue were not written by me.
@Lord Sméagol: Hello and happy New Year!
9 minutes is cool!
The answer is wrong because of _lzcnt instruction, as you suspected, as turns out it works differently on different cpus: https://nextmovesoftware.com/blog/2017...
With this error, solutions having 1x1 square directly in the center are not counted.
I guess, gcc/clang do it correctly because I specify -march=native (so it checks cpu and generates correct instruction), and run where I compile. But it's a potential problem I probably need to add some assertions to the code.
Maybe on your hardware you can either use WSL and clang compiler, or set constexpr bool USE_SSE_QUADRANT_FILL=false, to fall back to slower.
You could also try to use BitScanReverse instead of __lzcnt, but it has different input/output so I'm not sure how hard would that be to fix it.
9 minutes is cool!
The answer is wrong because of _lzcnt instruction, as you suspected, as turns out it works differently on different cpus: https://nextmovesoftware.com/blog/2017...
With this error, solutions having 1x1 square directly in the center are not counted.
I guess, gcc/clang do it correctly because I specify -march=native (so it checks cpu and generates correct instruction), and run where I compile. But it's a potential problem I probably need to add some assertions to the code.
Maybe on your hardware you can either use WSL and clang compiler, or set constexpr bool USE_SSE_QUADRANT_FILL=false, to fall back to slower.
You could also try to use BitScanReverse instead of __lzcnt, but it has different input/output so I'm not sure how hard would that be to fix it.
Oleg
on /blog/119
on /blog/119
@Oleg:
I removed my macros:
#define __tzcnt_u32(v) ((v) ? (_tzcnt_u32(v)) : (32))
#define __lzcnt32(v) ((v) ? (_lzcnt_u32(v)) : (32))
replacing them with simple inline code
Util.h
//#include // not including all
#include // just include what's needed
#include // just include what's needed
#include // just include what's needed
#if 1 // use safe localtime
struct tm buf; // use safe localtime
auto err = localtime_s(&buf, &cur_time); // use safe localtime
return std::put_time(&buf, "%F %T"); // use safe localtime
#else // use safe localtime
return std::put_time(std::localtime(&cur_time), "%F %T");
#endif // use safe localtime
State.h
changed _mm_set_epi8(0x80 to -0x80 to stop warnings
inline replacement:
//int i = __tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
int i = _tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
inline replacement:
//int last_idx_before_mid = 31 - __lzcnt32(off_mask); // for no BMI; without zero test, as not needed here
int last_idx_before_mid = _lzcnt_u32(off_mask); // for no BMI; without zero test, as not needed here
Solver.h
inline replacement:
//return ini.size(); // to stop warning
return (int)ini.size(); // to stop warning
inline replacement:
//const int dim = __tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
const int dim = _tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
I tried '9' runs: with asserts: 10:31, without: 10:18 (saved 2%)
A minute slower than the faulty version, but still not too bad for a 2013 (Q3) CPU :)
I removed my macros:
#define __tzcnt_u32(v) ((v) ? (_tzcnt_u32(v)) : (32))
#define __lzcnt32(v) ((v) ? (_lzcnt_u32(v)) : (32))
replacing them with simple inline code
Util.h
//#include // not including all
#include // just include what's needed
#include // just include what's needed
#include // just include what's needed
#if 1 // use safe localtime
struct tm buf; // use safe localtime
auto err = localtime_s(&buf, &cur_time); // use safe localtime
return std::put_time(&buf, "%F %T"); // use safe localtime
#else // use safe localtime
return std::put_time(std::localtime(&cur_time), "%F %T");
#endif // use safe localtime
State.h
changed _mm_set_epi8(0x80 to -0x80 to stop warnings
inline replacement:
//int i = __tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
int i = _tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
inline replacement:
//int last_idx_before_mid = 31 - __lzcnt32(off_mask); // for no BMI; without zero test, as not needed here
int last_idx_before_mid = _lzcnt_u32(off_mask); // for no BMI; without zero test, as not needed here
Solver.h
inline replacement:
//return ini.size(); // to stop warning
return (int)ini.size(); // to stop warning
inline replacement:
//const int dim = __tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
const int dim = _tzcnt_u32(mask); // for no BMI; without zero test, as not needed here
I tried '9' runs: with asserts: 10:31, without: 10:18 (saved 2%)
A minute slower than the faulty version, but still not too bad for a 2013 (Q3) CPU :)
Lord Sméagol
on /blog/119
on /blog/119
@Oleg: Happy new year!
I just added this:
#if 0
int last_idx_before_mid = 31 - __lzcnt32(off_mask); // 31 - LZCNT ==> index of MSb
#else
// if off_mask can never be zero, no need for check to override BSR result
assert(off_mask);
// a '9' run didn't reveal any 0 [you would know for sure for other sizes]
// need unsigned long result
unsigned long last_idx_before_mid;
// get index of MSb [no need for adjustment if off_mask can never be zero]
_BitScanReverse(&last_idx_before_mid, off_mask);
#endif
a run of '9' now produces the correct result: 1,730,280 :)
I just added this:
#if 0
int last_idx_before_mid = 31 - __lzcnt32(off_mask); // 31 - LZCNT ==> index of MSb
#else
// if off_mask can never be zero, no need for check to override BSR result
assert(off_mask);
// a '9' run didn't reveal any 0 [you would know for sure for other sizes]
// need unsigned long result
unsigned long last_idx_before_mid;
// get index of MSb [no need for adjustment if off_mask can never be zero]
_BitScanReverse(&last_idx_before_mid, off_mask);
#endif
a run of '9' now produces the correct result: 1,730,280 :)
Lord Sméagol
on /blog/119
on /blog/119

The COLLAPSES initializer was easy: Use (char) for 0x80
I changed unsafe localtime to:
struct tm buf;
auto err = localtime_s(&buf, &cur_time);
return std::put_time(&buf, "%F %T");
I did a quick change of __tzcnt_u32, __lzcnt32 to use _tzcnt_u32, _lzcnt_u32
Setting the compiler to use AVX and optimize for speed.
An '8' run worked, so I tried '9'
And it found only 1,729,930 solutions!
I suspected _tzcnt_u32, _lzcnt_u32 might be causing it, so I covered that:
#define __tzcnt_u32(v) ((v) ? (_tzcnt_u32(v)) : (32)) // Match BMI : should return 32 for value 0
#define __lzcnt32(v) ((v) ? (_lzcnt_u32(v)) : (32)) // Match BMI : should return 32 for value 0
but it didn't fix it.
Anyway, I decided to test multi-threading.
First I hunted for the initial depth sweet spot:
>Puzzle_Oleg.exe run 9 8 24
>Puzzle_Oleg.exe run 9 9 24
...
>Puzzle_Oleg.exe run 9 18 24
>Puzzle_Oleg.exe run 9 19 24
And found that 15 was fastest. [It didn't like 19]
'9' run [with the 1,729,930 problem] on Xeon E5-2697-v2
12t: 11m 52s
24t: 9m 1s
so HyperThreading is helping by about 48%
on /blog/119