Read Loadtesting for Open Beta, Part 1 and Part 2 to catch up on the spine-tingling story so far!
When we last left our hero, our differential state update change was a resounding success and reduced the network bandwidth utilization from 98% to 3%, and it looked like we could move on to optimizing the lobbyserver code itself to get to our goal of 1000 simultaneous loadtesting robots, until we noticed some of our robots were missing! This led me on a wild and wooly chase through the code, which I will recount for you now…
Where’d the robots go?
The first order of business was to figure out why some robots were dying when they weren’t supposed to, and some weren’t dying when they were supposed to. Robots: they never do what you tell them.
If you look at this graph of the number of running robots from last time, you can see that right off the bat, a bunch of them die on all the machines, and then they keep dying for about 30 seconds, and then it stabilizes. Each of these machines should have 50 robots running solidly during the test period.
The number of loadtest robots running on each EC2 instance.
Then, to make matters worse, some of them don’t die when they’re supposed to at the end of the test. In the graph above, they only all finally die when I kill them manually from a separate script at 03:16:30. This points towards two different problems I’m going to have to debug on the test machines…that only manifest themselves intermittently…with gdb…in the cloud. Good times!
Okay, first things first, let’s see if the robots will tell me where they’re going. The lobbyclient robots can output verbose log files, but I had them turned off because I was worried about bogging down the client machines. It turns out this isn’t much of a problem as I’ll discuss below, so I turned on logging and re-ran a test. Then I ssh’d into one of the servers, and looked at the log files. Well, before I looked the files themselves, I just did an ls of the directory:
-rw-r--r-- 1 root root 258577 Mar 5 03:02 out59
-rw-r--r-- 1 root root 332320 Mar 5 03:02 out60
-rw-r--r-- 1 root root 177743 Mar 5 03:02 out61
-rw-r--r-- 1 root root 181639 Mar 5 03:02 out62
-rw-r--r-- 1 root root 264535 Mar 5 03:02 out63
-rw-r--r-- 1 root root 333515 Mar 5 03:02 out64
-rw-r--r-- 1 root root 282875 Mar 5 03:02 out65
-rw-r--r-- 1 root root 271040 Mar 5 03:02 out66
-rw-r--r-- 1 root root 264 Mar 5 03:01 out67
-rw-r--r-- 1 root root 264 Mar 5 03:01 out68
-rw-r--r-- 1 root root 284838 Mar 5 03:02 out69
-rw-r--r-- 1 root root 332967 Mar 5 03:02 out70
-rw-r--r-- 1 root root 303352 Mar 5 03:02 out71
-rw-r--r-- 1 root root 310596 Mar 5 03:02 out72
-rw-r--r-- 1 root root 194669 Mar 5 03:02 out73
-rw-r--r-- 1 root root 313193 Mar 5 03:02 out74
-rw-r--r-- 1 root root 238246 Mar 5 03:02 out75
-rw-r--r-- 1 root root 264190 Mar 5 03:02 out76
-rw-r--r-- 1 root root 198096 Mar 5 03:02 out77
-rw-r--r-- 1 root root 233980 Mar 5 03:02 out78
-rw-r--r-- 1 root root 264 Mar 5 03:01 out79
-rw-r--r-- 1 root root 264 Mar 5 03:01 out80
-rw-r--r-- 1 root root 301029 Mar 5 03:02 out81
-rw-r--r-- 1 root root 299694 Mar 5 03:02 out82
-rw-r--r-- 1 root root 264 Mar 5 03:01 out83
-rw-r--r-- 1 root root 351158 Mar 5 03:02 out84
-rw-r--r-- 1 root root 188071 Mar 5 03:02 out85
-rw-r--r-- 1 root root 242228 Mar 5 03:02 out86
Well, there’s a clue, at least for the early-dyers. The contents of those 264 byte log files look like this:
Lobby Standalone Client: 1000.0.0.5
init genrand w/0, first val is 1178568022
Running for 61 seconds.
LobbyClient started, v1000.0.0.5 / v12
LobbyClient UDP bound to port 32921
lobbyclient: sendto_kdc.c:617: cm_get_ssflags: Assertion `i < selstate->nfds' failed.
A-ha! sendto_kdc.c is a file in the Kerberos libraries, which I use for login authentication.
I really love Kerberos, the architecture just feels right to me, the API is simple, clean, and flexible, it’s cross-platform and open source, so I’ve been able to contribute features and bug fixes as I’ve used it and trace into the code when I was confused about something, and the folks at MIT that develop it are smart, knowledgeable, open-minded, and don’t mind some crazy indie game developer asking dumb questions about the best way to do things that were pretty clearly not part of the original university and enterprise use-cases. Most importantly, it’s battle-tested; it’s used by tons of different applications, and it’s the foundation of the modern Windows domain and Xbox authentication systems, so I know it works. The last thing you ever want to do is roll your own authentication system.
So, that assert’s the first place to look for the early-dying robots.
Next, I looked into the never-dying robots. I logged into one of the machines that still had zombie robots running, ran pidof lobbyclient to figure out the process ID of one of them, and attached gdb to the robot. A quick thread apply all backtrace full and I found the thread that was hanging while the main thread was trying to join them and exit cleanly. It looked like the bad code was in a call to poll, and it just so happened it was in sendto_kdc.c as well! I realized I was going to need some debug symbols, but this was easy since I build the Kerberos libraries myself, so a quick scp of the debuginfo rpm and reattaching gdb and I could dig down a bit deeper.
The Kerberos libraries are built with optimizations on, which always makes debugging interesting, but I think it builds programming character to debug optimized code, so I don’t mind. Here’s the code in question:
if (in->end_time.tv_sec == 0)
timeout = -1;
else {
e = k5_getcurtime(&now);
if (e)
return e;
timeout = (in->end_time.tv_sec - now.tv_sec) * 1000 +
(in->end_time.tv_usec - now.tv_usec) / 1000;
}
/* We don't need a separate copy of the selstate for poll, but use one
* anyone for consistency with the select wrapper. */
*out = *in;
*sret = poll(out->fds, out->nfds, timeout);
Well, these loadtesting machines are under some load themselves so they can be a bit sluggish, and there’s a problem with this code in that scenario if the call to k5_getcurtime() happens later than the in->end_time passed in by the caller. As it says on the poll manpage, “Specifying a negative value in timeout means an infinite timeout.” Digging around on the stack verified the timeout was negative.
Okay, so now we have a pretty good clue for each of the problems. The second problem with the poll timeout seemed easy to fix, but the first one was pretty mysterious and might take some real debugging. I decided to check with the krbdev mailing list to see if they had any ideas while I looked into the problems more deeply. While doing so, I looked at the main Kerberos source repository and found a commit for the timeout problem, so it had already been fixed in a later version. I was hoping maybe this was true of the assert as well. True to form, the most excellent Greg Hudson replied with three more commits he thought might help. Meanwhile, I hacked the code to loop on a call to sleep() instead of asserting to convert the early-dyers into never-dying zombies so I could attach the debugger, since that’d worked so well on the second problem.
Sadly, although the negative-timeout-check fixed the original zombies, none of the fixes prevented the assert problem. It wasn’t asserting anymore because the asserters were now looping, so now I had more zombies to deal with.
Lots of zombie robots!
Time to get down and dirty and debug it for real.
As an aside, it’s a weird feeling when you’re debugging something on an EC2 instance, since you’re paying for it hourly. I felt a definite pressure to hurry up and debug faster…oh no, there went another $0.06 * 5 instances!
Too deep we delved there, and woke the nameless fear!
Like I said, debugging optimized code builds character, and I built a lot of character with this bug. The assert was in a function that was inlined by the optimizer, which was in a function that was inlined by the optimizer, which was in a loop, which looked like it had been unrolled. It was slow going, with lots of restarts and stuffing values into memory and registers so the code would execute again. At one point, I thought I’d narrowed it down to a compiler bug in gcc, because it seemed like a variable wasn’t getting reloaded from the stack correctly sometimes, but it was really hard to tell with all the inlining. Even thinking it was a compiler bug was pretty silly and that thought always violates Assume it’s a Bug, so I should have known better, but it happens.
Finally, a combination of stepping through the code, and looking at the code, and modifying the code revealed the problem. Here’s the source file at the version I was debugging, linked to the area of the code where the bug lurked. If you search for “host+1”, you will see that it occurs twice, once inside the loop, and once outside the loop. This is what threw me when I was debugging…initially I didn’t notice there were two separate calls to service_fds(), so in the debugger I thought it was looping again but loading weird values. I can only assume the second call almost never occurred in the wild for anybody but me after the inner loop on hosts completed, because in that case host+1 is n_conns+1, which is out-of-bounds for the connections. This bug was easy for me to fix locally, and it looks like it was (inadvertently?) fixed in this commit in the main Kerberos code.
Thank goodness for open source code, where you can modify it and debug it when you run into troubles!
No more zombies!
Moar Robots!
Now that I (thought I) was done debugging the robots, and I still had 5 EC2 instances running, I decided to see how well the instances did with 100 robots on each. My original tests indicated I could only run about 50 per m1.small instance, but the client also got a lot more efficient with the differential state update change described last time, and it turns out 100 per instance is no problem, as you can see here:
Top on an m1.small instance running 100 robots at only 20% CPU.
The lobby was a little more grim with 501 clients:
500 robots and me.
Here’s how the CPU looks with all these robots in the lobby, chatting at each other:
atop in CPU mode with 500 robots in the lobby jabbering.
There are two cores in this machine, which is why the lobbyserver is at 115% CPU. It’s mostly single-threaded for simplicity, but it uses threads for servicing network connections.
However, once the robots start playing each other, the CPU usage drops a bunch:
Stop talking, start playing!
This is pretty good news. I think it means the chat system needs some work, because when everybody’s in the lobby all the chats go to all the players, but when people in are a match, chats only go between those two players, and they don’t get any of the lobby chats. We’ll find out soon as I describe below. Memory looks pretty good with 501 clients, staying at about 256kb per client:
2013/03/16-04:53:11: MEMORY_POSIX 501/993/492: resident 25540/25540, virtual 198000/198000
2013/03/16-04:53:11: MEMORY_NEW 501/993/492: bytes 132098963, news 69166, deletes 55478
One last atop screenshot…this one is while the robots are starting up and connecting, but before they’re in the lobby:
Loadtest startup performance.
This one shows Kerberos and OpenLDAP taking a fair amount of time at the start of a new loadtest. I use LDAP as the database backend for Kerberos, among other things, and when all of these robots are trying to get login tickets at the same time, it bogs down a bit. I’m not too worried about this profile, since this scenario of 500 people all needing tickets at the same time is going to be rare (the tickets last a while, so this doesn’t happen every time), and there are well-known ways of scaling Kerberos and OpenLDAP if I need them.
Finally, here’s a shot of the 100 robots per instance:
Wait a second…
Oh no! Who the hell is that single zombie robot at the end on instance 4!?! Sigh. I find that machine, log in, attach the debugger, and check it out. It looks like I have a pretty rare deadlock between two threads during shutdown. I’m just going to ignore it for now and deal with it later. All the bugs above were preventing robots from doing a good job at loadtesting, while this one is just preventing 1 out of 500 from shutting down completely…it can wait. Here’s a shot of this guy, still in the lobby, mocking me:
At least I have one more Sniper win on this debug server than this troll!
There’s actually another bug I found in the new differential state update code while I was testing this, where the server will send a duplicate client sometimes, but I had a comment in the code that I thought it might be possible, and now I know it is. It turns out when you have 500 clients pounding on a server, you find bugs.
Coming Up Next Time
Okay, so now we’ve got things where I can easily run a predictable number of loadtesting robots against the debug lobbyserver, and I’ve got some high level profiles telling me that I’m now CPU bound inside the server itself. That points to a clear next step: profile the code. I use an old hacked up version of Sean Barrett’s iprof for all my client runtime profiling, so my next task is to integrate that into the server code, and get it running on Linux. That shouldn’t be too hard, and then I’ll be able to tell what’s actually taking the time when a lot of clients are in the lobby.
My prediction, based on the above, is that the chat message handling is going to be the main culprit. If so, it’ll be easy to queue up the chats and send them out in bunches, but I need to be careful here, because the robots chat a lot more than real humans would right now, so I don’t want to spend too much time optimizing this. I think I’ll keep the robots as they are for the initial profiles, and then dial back their chattiness to more realistic levels after I’ve plucked the low-hanging chat fruit. I also need to teach the robots how to use lobby rooms for a more realistic test.
Finally, I’m wondering if my usage of select() is going to be an issue as I get close to 1000 robots. I may need to port to epoll(). We shall see!
“Assume Nothing!”
And finally, the SimCity launch has given me pause…I’m still forging ahead with my 1000 simultaneous goal, but I really hope it’s enough and things go smoothly. I would much rather have a slow buildup of players over the next year as I roll out more cool stuff than a giant spike that melts everything and makes players grumpy.
On to Part 4…