Come ask questions of me and Jonathan Blow in San Francisco on Wednesday!

Update:  Okay, the video is up from this talk, it’s a mix of general indie game stuff, some SpyParty and The Witness stuff, and miscellaneous stuff:

Jonathan Blow and I will be answering—or trying to answer—questions from the audience on Wednesday, May 8th, at 7pm PDT in San Francisco as guests of the SF IGDA.  We did a kind of similar thing a while back on Kotaku, and it turned out pretty well, so hopefully it’ll work again this time.  It’s open to the public, and it’s hosted by Dolby Labs:

Dolby Laboratories Inc.
100 Potrero Ave, San Francisco, CA

Here’s the official announcement.  I assume the questions will mostly be about indie game development.  I hope most of the questions are about indie game development.  Come ask questions about indie game development, please.

This post was really just an excuse to show you this picture.

One Year of SpyParty

Wow, SpyParty has the best beta testers in the world!  Not only are they really patient while they’re waiting for invites, then they’re patient while finding good repros for bugs, then more patient while waiting for fixes for all the bugs they find, and they’re patient and super helpful mentoring new players…but they’re also hilarious and fun and creative and just generally great to hang out with!

I’ve been heads down on trying to get the lobby scalable for open beta, so I didn’t even realize today is the one year anniversary of the first batch of invites I sent to strangers!  To celebrate this auspicious day, zerotka made this amazing video and posted it on the forums; it’s simply wonderful:

Thank you so much, I you all!  More invites soon!

Loadtesting for Open Beta, Part 3

Read Loadtesting for Open Beta, Part 1 and Part 2 to catch up on the spine-tingling story so far!

When we last left our hero, our differential state update change was a resounding success and reduced the network bandwidth utilization from 98% to 3%, and it looked like we could move on to optimizing the lobbyserver code itself to get to our goal of 1000 simultaneous loadtesting robots, until we noticed some of our robots were missing!  This led me on a wild and wooly chase through the code, which I will recount for you now…

Where’d the robots go?

The first order of business was to figure out why some robots were dying when they weren’t supposed to, and some weren’t dying when they were supposed to.  Robots: they never do what you tell them.

If you look at this graph of the number of running robots from last time, you can see that right off the bat, a bunch of them die on all the machines, and then they keep dying for about 30 seconds, and then it stabilizes.  Each of these machines should have 50 robots running solidly during the test period.

The number of loadtest robots running on each EC2 instance.

Then, to make matters worse, some of them don’t die when they’re supposed to at the end of the test.  In the graph above, they only all finally die when I kill them manually from a separate script at 03:16:30.  This points towards two different problems I’m going to have to debug on the test machines…that only manifest themselves intermittently…with gdb…in the cloud. Good times!

Okay, first things first, let’s see if the robots will tell me where they’re going.  The lobbyclient robots can output verbose log files, but I had them turned off because I was worried about bogging down the client machines.  It turns out this isn’t much of a problem as I’ll discuss below, so I turned on logging and re-ran a test.  Then I ssh’d into one of the servers, and looked at the log files.  Well, before I looked the files themselves, I just did an ls of the directory:

-rw-r--r-- 1 root root 258577 Mar  5 03:02 out59
-rw-r--r-- 1 root root 332320 Mar  5 03:02 out60
-rw-r--r-- 1 root root 177743 Mar  5 03:02 out61
-rw-r--r-- 1 root root 181639 Mar  5 03:02 out62
-rw-r--r-- 1 root root 264535 Mar  5 03:02 out63
-rw-r--r-- 1 root root 333515 Mar  5 03:02 out64
-rw-r--r-- 1 root root 282875 Mar  5 03:02 out65
-rw-r--r-- 1 root root 271040 Mar  5 03:02 out66
-rw-r--r-- 1 root root    264 Mar  5 03:01 out67
-rw-r--r-- 1 root root    264 Mar  5 03:01 out68
-rw-r--r-- 1 root root 284838 Mar  5 03:02 out69
-rw-r--r-- 1 root root 332967 Mar  5 03:02 out70
-rw-r--r-- 1 root root 303352 Mar  5 03:02 out71
-rw-r--r-- 1 root root 310596 Mar  5 03:02 out72
-rw-r--r-- 1 root root 194669 Mar  5 03:02 out73
-rw-r--r-- 1 root root 313193 Mar  5 03:02 out74
-rw-r--r-- 1 root root 238246 Mar  5 03:02 out75
-rw-r--r-- 1 root root 264190 Mar  5 03:02 out76
-rw-r--r-- 1 root root 198096 Mar  5 03:02 out77
-rw-r--r-- 1 root root 233980 Mar  5 03:02 out78
-rw-r--r-- 1 root root    264 Mar  5 03:01 out79
-rw-r--r-- 1 root root    264 Mar  5 03:01 out80
-rw-r--r-- 1 root root 301029 Mar  5 03:02 out81
-rw-r--r-- 1 root root 299694 Mar  5 03:02 out82
-rw-r--r-- 1 root root    264 Mar  5 03:01 out83
-rw-r--r-- 1 root root 351158 Mar  5 03:02 out84
-rw-r--r-- 1 root root 188071 Mar  5 03:02 out85
-rw-r--r-- 1 root root 242228 Mar  5 03:02 out86

Well, there’s a clue, at least for the early-dyers.  The contents of those 264 byte log files look like this:

Lobby Standalone Client: 1000.0.0.5
init genrand w/0, first val is 1178568022
Running for 61 seconds.
LobbyClient started, v1000.0.0.5 / v12
LobbyClient UDP bound to port 32921
lobbyclient: sendto_kdc.c:617: cm_get_ssflags: Assertion `i < selstate->nfds' failed.

A-ha!  sendto_kdc.c is a file in the Kerberos libraries, which I use for login authentication.

I really love Kerberos, the architecture just feels right to me, the API is simple, clean, and flexible, it’s cross-platform and open source, so I’ve been able to contribute features and bug fixes as I’ve used it and trace into the code when I was confused about something, and the folks at MIT that develop it are smart, knowledgeable, open-minded, and don’t mind some crazy indie game developer asking dumb questions about the best way to do things that were pretty clearly not part of the original university and enterprise use-cases.  Most importantly, it’s battle-tested; it’s used by tons of different applications, and it’s the foundation of the modern Windows domain and Xbox authentication systems, so I know it works.  The last thing you ever want to do is roll your own authentication system.

So, that assert’s the first place to look for the early-dying robots.

Next, I looked into the never-dying robots.  I logged into one of the machines that still had zombie robots1 running, ran pidof lobbyclient to figure out the process ID of one of them, and attached gdb to the robot.  A quick thread apply all backtrace full and I found the thread that was hanging while the main thread was trying to join them and exit cleanly.  It looked like the bad code was in a call to poll, and it just so happened it was in sendto_kdc.c as well! I realized I was going to need some debug symbols, but this was easy since I build the Kerberos libraries myself,2 so a quick scp of the debuginfo rpm and reattaching gdb and I could dig down a bit deeper.

The Kerberos libraries are built with optimizations on, which always makes debugging interesting, but I think it builds programming character to debug optimized code, so I don’t mind.3  Here’s the code in question:

    if (in->end_time.tv_sec == 0)
        timeout = -1;
    else {
        e = k5_getcurtime(&now);
        if (e)
            return e;
        timeout = (in->end_time.tv_sec - now.tv_sec) * 1000 +
            (in->end_time.tv_usec - now.tv_usec) / 1000;
    }
    /* We don't need a separate copy of the selstate for poll, but use one
     * anyone for consistency with the select wrapper. */
    *out = *in;
    *sret = poll(out->fds, out->nfds, timeout);

Well, these loadtesting machines are under some load themselves so they can be a bit sluggish, and there’s a problem with this code in that scenario if the call to k5_getcurtime() happens later than the in->end_time passed in by the caller.  As it says on the poll manpage, “Specifying a negative value in timeout means an infinite timeout.”  Digging around on the stack verified the timeout was negative.

Okay, so now we have a pretty good clue for each of the problems.  The second problem with the poll timeout seemed easy to fix, but the first one was pretty mysterious and might take some real debugging.  I decided to check with the krbdev mailing list to see if they had any ideas while I looked into the problems more deeply.  While doing so, I looked at the main Kerberos source repository and found a commit for the timeout problem, so it had already been fixed in a later version.  I was hoping maybe this was true of the assert as well.  True to form, the most excellent Greg Hudson replied with three more commits he thought might help.  Meanwhile, I hacked the code to loop on a call to sleep() instead of asserting to convert the early-dyers into never-dying zombies so I could attach the debugger, since that’d worked so well on the second problem.

Sadly, although the negative-timeout-check fixed the original zombies, none of the fixes prevented the assert problem.  It wasn’t asserting anymore because the asserters were now looping, so now I had more zombies to deal with.

Lots of zombie robots!

Time to get down and dirty and debug it for real.

As an aside, it’s a weird feeling when you’re debugging something on an EC2 instance, since you’re paying for it hourly.  I felt a definite pressure to hurry up and debug faster…oh no, there went another $0.06 * 5 instances!

Too deep we delved there, and woke the nameless fear!

Like I said, debugging optimized code builds character, and I built a lot of character with this bug.  The assert was in a function that was inlined by the optimizer, which was in a function that was inlined by the optimizer, which was in a loop, which looked like it had been unrolled.  It was slow going, with lots of restarts and stuffing values into memory and registers so the code would execute again.  At one point, I thought I’d narrowed it down to a compiler bug in gcc, because it seemed like a variable wasn’t getting reloaded from the stack correctly sometimes, but it was really hard to tell with all the inlining.  Even thinking it was a compiler bug was pretty silly and that thought always violates Assume it’s a Bug, so I should have known better, but it happens. 

Finally, a combination of stepping through the code, and looking at the code, and modifying the code revealed the problem. Here’s the source file at the version I was debugging, linked to the area of the code where the bug lurked.  If you search for “host+1”, you will see that it occurs twice, once inside the loop, and once outside the loop.  This is what threw me when I was debugging…initially I didn’t notice there were two separate calls to service_fds(), so in the debugger I thought it was looping again but loading weird values.  I can only assume the second call almost never occurred in the wild for anybody but me after the inner loop on hosts completed, because in that case host+1 is n_conns+1, which is out-of-bounds for the connections.4  This bug was easy for me to fix locally, and it looks like it was (inadvertently?) fixed in this commit in the main Kerberos code.

Thank goodness for open source code, where you can modify it and debug it when you run into troubles!

No more zombies!

Moar Robots!

Now that I (thought I) was done debugging the robots, and I still had 5 EC2 instances running, I decided to see how well the instances did with 100 robots on each.  My original tests indicated I could only run about 50 per m1.small instance, but the client also got a lot more efficient with the differential state update change described last time, and it turns out 100 per instance is no problem, as you can see here:

Top on an m1.small instance running 100 robots at only 20% CPU.

 The lobby was a little more grim with 501 clients:

500 robots and me.

 Here’s how the CPU looks with all these robots in the lobby, chatting at each other:

atop in CPU mode with 500 robots in the lobby jabbering.

There are two cores in this machine, which is why the lobbyserver is at 115% CPU.  It’s mostly single-threaded for simplicity, but it uses threads for servicing network connections.

However, once the robots start playing each other, the CPU usage drops a bunch:

Stop talking, start playing!

This is pretty good news.  I think it means the chat system needs some work, because when everybody’s in the lobby all the chats go to all the players, but when people in are a match, chats only go between those two players, and they don’t get any of the lobby chats.  We’ll find out soon as I describe below.  Memory looks pretty good with 501 clients, staying at about 256kb per client:

2013/03/16-04:53:11: MEMORY_POSIX 501/993/492: resident 25540/25540, virtual 198000/198000
2013/03/16-04:53:11: MEMORY_NEW 501/993/492: bytes 132098963, news 69166, deletes 55478

One last atop screenshot…this one is while the robots are starting up and connecting, but before they’re in the lobby:

Loadtest startup performance.

This one shows Kerberos and OpenLDAP taking a fair amount of time at the start of a new loadtest.  I use LDAP as the database backend for Kerberos, among other things, and when all of these robots are trying to get login tickets at the same time, it bogs down a bit.  I’m not too worried about this profile, since this scenario of 500 people all needing tickets at the same time is going to be rare (the tickets last a while, so this doesn’t happen every time), and there are well-known ways of scaling Kerberos and OpenLDAP if I need them.

Finally, here’s a shot of the 100 robots per instance:

Wait a second…

Oh no!  Who the hell is that single zombie robot at the end on instance 4!?!  Sigh.  I find that machine, log in, attach the debugger, and check it out.  It looks like I have a pretty rare deadlock between two threads during shutdown.  I’m just going to ignore it for now and deal with it later.  All the bugs above were preventing robots from doing a good job at loadtesting, while this one is just preventing 1 out of 500 from shutting down completely…it can wait.  Here’s a shot of this guy, still in the lobby, mocking me:

At least I have one more Sniper win on this debug server than this troll!

There’s actually another bug I found in the new differential state update code while I was testing this, where the server will send a duplicate client sometimes, but I had a comment in the code that I thought it might be possible, and now I know it is.  It turns out when you have 500 clients pounding on a server, you find bugs.

Coming Up Next Time

Okay, so now we’ve got things where I can easily run a predictable number of loadtesting robots against the debug lobbyserver, and I’ve got some high level profiles telling me that I’m now CPU bound inside the server itself.  That points to a clear next step:  profile the code.  I use an old hacked up version of Sean Barrett’s iprof for all my client runtime profiling, so my next task is to integrate that into the server code, and get it running on Linux.  That shouldn’t be too hard, and then I’ll be able to tell what’s actually taking the time5 when a lot of clients are in the lobby.

My prediction, based on the above, is that the chat message handling is going to be the main culprit.  If so, it’ll be easy to queue up the chats and send them out in bunches, but I need to be careful here, because the robots chat a lot more than real humans would right now, so I don’t want to spend too much time optimizing this.  I think I’ll keep the robots as they are for the initial profiles, and then dial back their chattiness to more realistic levels after I’ve plucked the low-hanging chat fruit.  I also need to teach the robots how to use lobby rooms for a more realistic test.

Finally, I’m wondering if my usage of select() is going to be an issue as I get close to 1000 robots.  I may need to port to epoll().  We shall see!

“Assume Nothing!”

And finally, the SimCity launch has given me pause…I’m still forging ahead with my 1000 simultaneous goal, but I really hope it’s enough and things go smoothly.  I would much rather have a slow buildup of players over the next year as I roll out more cool stuff than a giant spike that melts everything and makes players grumpy.

On to Part 4…

  1. ZOMBIE ROBOTS!!! []
  2. I have some local patches I haven’t cleaned up enough to contribute yet []
  3. gdb is not the best for assembly language debugging, but I did learn about “layout asm”, which helps a bit. []
  4. It never crashed because conns has a preallocated number of connections that was always bigger than n_conns+1 []
  5. This is only partially true, because iprof is single-threaded…I really wish there was a good cross-platform light-weight way to get per-thread timings. []

Come to drawnonward’s 100th stream party on Thursday, March 7th at midnight PST!

I guess Thursday midnight is technically Friday morning, unless midnight is defined as 00:00-ε or something, but anyway, the Official SpyParty Late Night Streamer™ drawnonward is celebrating his 100th stream on Thursday night at midnight, Pacific Standard Time (GMT-8), which means it’s Friday at 8am GMT (I think, unless I messed that math up), on his twitch.tv channel:

http://www.twitch.tv/drawnonward

It should be fun and rather silly…he posted in the beta forums asking for ideas and people are gifting him random Steam stuff to give away to people in stream chat, he’s going to dress up as one of the current SpyParty characters (he’s not telling which one!), and I’ll be joining him from midnight to 1am to play on his stream. Sadly, my internet this week is kind of slow and terrible, so I don’t think I can stream the other side on the SpyParty channel, but I should be able to join him via voice chat.  My personal dream is that he’ll actually start using a non-default icon/image for his channel by then.

Come along, hang out in the stream chat, and it should be a good time. The big question is whether I’ll have this build done and tested before then. Maybe you can watch a brand new build fail spectacularly, live, while I curse over TeamSpeak!

PS. I think I have figured out who is killing my robots, so Part 3 of the Loadtesting for Open Beta series will be posted soon!

Loadtesting for Open Beta, Part 2

In our last exciting episode of Loadtesting for Open Beta, we did some initial profiling to see how the lobbyserver held up under attack by a phalanx of loadtesting robots spawned in the cloud. It didn’t hold up, obviously, or the beta would already be open.

Specifically, it failed by saturating the server’s 100Mbps network link, which turned out to be a great way to fail because it meant there were some pretty simple things I could do to optimize the bandwidth utilization.  I had done the initial gamelobby protocol in the simplest way possible, so every time any player state changed, like a new connection, or switching from chatting in the lobby to playing, it sent out the entire list of player states to everybody.  This doesn’t scale at all, since as you add more players, most aren’t changing state, but you’re sending all of their states out to everybody even if only one changes.  This doesn’t mean it was the wrong way to program it initially; it’s really important when you’re writing complicated software1 to do things the simplest way possible, as long as you have a vague plan for what you’ll do if it turns into a problem later.  In this case, I knew what I was doing was probably not going to work in the long run, but it got things up and running more quickly than overengineering some fancy solution I might not have needed, and I waited until it actually was a problem before fixing it.

Tell Me Something I Don’t Know

The solution to this problem is pretty obvious: differential state updates.  Or, in English, only send the stuff that’s changed to the people who care about it.  Doing differential updates is significantly more complicated than just spamming everybody with everything, however.  You still have to send the initial state of all the curent players when new players log in, you have to be able to add and remove players in the protocol, which you didn’t have to before because you were just sending the complete new state every time, etc.

This was going to be a fairly large change, so I took it by steps.  I knew that I’d have to send out the complete state of everybody to new logins, so it made sense to start by optimizing that initial packet using normal data size optimization techniques.  I pretty easily got it from about 88 bytes per player down to 42 bytes per player, which is nice, because my goal for these optimizations is 1000 simultaneous players, and at 88 bytes they wouldn’t all fit in my 64kb maximum packet size, where at 42 bytes they should fit, no problem, so I don’t have to add any kind of break-up-the-list-across-packets thing.  However, it turns out I actually got the ability to send the entire list across multiple packets while I was doing this, because I had to program the ability to add players as part of the differential updates, so now I could just use that packet type to send any clients in a really large player list that didn’t fit in a single packet.  But, like I said in the last episode, although I don’t think I’ll hit 1000 simultaneous outside of load testing for a while, it’s always nice to know you have that sort of thing in your back pocket for the future.

Once I’d tested the new optimized player list, I started making the updates differential.  New players get the initial list, and then they’re considered up-to-date and just get updates along with everybody else.  The list of new players is sent as additions to players already in the lobby.  For each player, I track some simple flags about what’s been updated in their state, so if they set or clear their /away message for example, that flag is set, and I only send that information.

In programming, usually when you’ve got the right design, you get some unintentional upside, and this case was no different.  Previously, I was not sending live updates to player stats (wins, game time, etc.) to the players in the lobby until the player was done playing the match, or some other state changed that caused everybody’s state to be re-sent.  Now, since the differential updates are efficient, I’m updating player stats in real time as well, so people in the lobby can see wins as they accumulate for players in matches, which is nice and how you’d expect it to work.

Results

It basically worked exactly as planned.  After lots of debugging, of course.  Here you can see the profiles for one of the loadtests, which got to 340 simultaneous players in the lobby:

I really need to have the robot Sniper win sometimes.

 

atop in memory mode

 

atop in cpu mode

Look ma, 3% network utilization!  That’s whats so awesome about a really spiky profile…when you pound one of the spikes down, things just get better!

Here’s the new table of packet sizes for this run.  If you compare this with the previous results, you can see the PLAYER_LIST packets are way way way smaller, and this table was accumulated from two longer test runs, so it’s not even a fair comparison!  It’s interesting, because the TYPE_LOBBY_MESSAGE_PACKET is smaller as well, and I think that’s because now the robots can actually start games since the network isn’t saturated, and this means they don’t broadcast chats to the entire lobby while they’re playing, so that’s a nice side effect of optimizing the bandwidth.

Packet Type Total Bytes
TYPE_LOBBY_MESSAGE_PACKET 58060417
TYPE_LOBBY_PLAYER_LIST_UPDATE_PACKET 29751413
TYPE_CLIENT_GAME_JOURNAL_PACKET 18006186
TYPE_LOBBY_ROOM_LIST_PACKET 16674479
TYPE_LOBBY_PLAYER_LIST_ADDITION_PACKET 4280563
TYPE_LOBBY_PLAYER_LIST_PACKET 3482691
TYPE_CLIENT_MESSAGE_PACKET 1501822
TYPE_CLIENT_LOGIN_PACKET 477356
TYPE_CLIENT_INVITE_PACKET 435368
TYPE_LOBBY_INVITE_PACKET 275781
TYPE_LOBBY_LOGIN_PACKET 235878
TYPE_LOBBY_GAME_ID_PACKET 96000
TYPE_LOBBY_GAME_OVER_PACKET 68901
TYPE_CLIENT_GAME_ID_CONFIRM_PACKET 40257
TYPE_LOBBY_PLAY_PACKET 32498
TYPE_CLIENT_IN_MATCH_PACKET 25714
TYPE_LOBBY_IN_MATCH_PACKET 21204
TYPE_CLIENT_CANDIDATE_PACKET 16089
TYPE_CLIENT_PLAY_PACKET 12419
TYPE_CLIENT_GAME_ID_REQUEST_PACKET 9610
TYPE_LOBBY_WELCOME_PACKET 4494
TYPE_CLIENT_JOIN_PACKET 4494
TYPE_KEEPALIVE_PACKET 1011
TYPE_CLIENT_IDLE_PACKET 24

Hmm, I just noticed as I’m writing this that the resident memory utilization in the atop screenshot is way lower now than before…I wonder why… On the application side I take about 250kb per player right now, which at 340 players should be about 85MB.  Looking at the lobbyserver logs, right about when the screenshot was taken, the lobby self-reported this data:

2013/03/03-02:13:15: MEMORY_POSIX 348/757/409: resident 12808/12808, virtual 160276/160276
2013/03/03-02:13:15: MEMORY_NEW 348/757/409: bytes 91766974, news 45707, deletes 36155

The MEMORY_NEW stats looks about right for this load and my quick math, but the MEMORY_POSIX stats—which are read from /proc/pid/status—match the atop results: expected virtual but low resident.   Maybe it was just paged out for a second, or maybe I’m not touching much of that 250kb and so it doesn’t stay resident.  A lot of it is network buffers, so it makes some sense with this lower bandwidth protocol that it wouldn’t be resident compared to last profile because less buffering is having to be done.  I’ll have to investigate this more.

Up Next, The Case of the Missing Robots

So, the bandwidth optimizations were a resounding success!  Plus, both the CPU and memory utilization of the lobbyserver are really reasonable and haven’t been optimized at all, so we’re sitting pretty for getting to 1000 simulataneous robots…

Except, where are the remaining 160 robots?  In the test above, I ran 10 EC2 instances, each with 50 robots, thinking the optimizations might let me get to 500 simultaneous and find the next performance issue…but it never got above 340 in the lobby.  I updated my perl loadtesting framework and had each instance output how many lobbyclients were running every two seconds with this shell command over ssh:

'while true; do echo `date +%T`,`pidof lobbyclient | wc -w`; sleep 2; done'

And then I loaded that into gnuplot,2 and graphed the number of robots on each instance:

The number of loadtest robots running on each EC2 instance.

You can see that they all started up with 50, but then a bunch of them lost clients until they found a steady state.   Something is killing my robots, and I need to figure out what it is…

Turn the page to Part 3…

  1. especially by yourself! []
  2. …which I hate, but I forgot to install excel on my new laptop, and Google’s spreadsheet sucks at pivottables, and the Office for Web excel doesn’t even have them as far as I could tell! []