Hi Spp (& everyone else!),
We’ve done extensive post-mortems and server dump log analysis. We’ve even had third parties review what went on to cause the whole system to grind to a halt after only receiving 22 message packets from the outside world (given that we’ve handled up to 267 million per second in the past, 22 is a frankly pathetic number).
The answer, however, is that we simply don’t know what caused the lock-up.
In classic British understatement, this is “unfortunate”
It could be that there’s an issue in the AoA codebase somewhere (which seems unlikely, as the network and server code has been fixed for sometime now, as has worked perfectly every time, even under extraordinary load).
It could be an issue with the number of cores we dedicated to this particular test (just to see if we could, we ran this last test on a single CPU core machine), and perhaps the cross-CPU needs of some threads caused the seizure.
It could be that we were running this test on Windows Server 2016 (which is not yet publicly available) and there’s an error down there in the depths of the kernel that we’re unaware of (and presumably MS as well, given it’s supposed to be going RTM in August!).
We can’t replicate the error, despite having tried quite hard to do so. We wish we could replay all the data that was coming into the server - especially packet #22 - but we can’t, as all the client/server comms are SSL encrypted, so there’s no ‘do-over’ available to reproduce the bug.
We have both Intel & Microsoft on board in tracking this down. It’s quite important, even outside of the very important universe that is AoA
That’s the bad news, but now for the better news…
After discussions, we have collectively found something that might be the candidate for the error we experienced in the last playtest.
We can’t tell you what we think it is, because it’s quite deep down there - inside things way, way out of our control.
We have, however, decided to run the next playtest using the same codebase and infrastructure as this last test - simply with additional logging procedures in place - to see if the problem reoccurs, and (if so) to hope that we can capture the specific inbound messages that caused it to occur.
We will also have a fallback plan in place (a different codebase, with a different configuration) so that, if we can’t get the system up within 10 or 15 seconds, we can redirect all the players to a “known good” hardware/OS/software configuration that we’ve used in the past, so you can all enjoy a playtest anyways.
Hope that explains things, as much as I can.
Regards,
SC