Optimizing Your Application: The Full Story

This talk was given on Mar 19, 2020. Slides are available here.

I recently undertook a large performance optimisation project on the WorldEdit project (large Minecraft mod with many millions of users), and it required utilising numerous methods of optimisation. This talk goes over the process and discusses findings.

This talk covers how to undergo optimisation projects, as well as the tools required. Some of the key points covered are:

  • CPU Profilers
  • Finding hot spots
  • Design decisions
  • Environmental Impact
  • Garbage Collectors
  • Third-Party Software
  • Optimal VM Flags

Looking at optimisation from a non-development perspective is essential as well, so relevant product and design goals are discussed in this talk.

About the Author

Maddy Miller

Hi, I'm Maddy Miller, a Senior Software Engineer at Clipchamp at Microsoft. In my spare time I love writing articles, and I also develop the Minecraft mods WorldEdit, WorldGuard, and CraftBook. My opinions are my own and do not represent those of my employer in any capacity.

Like this talk? Why not share it with others!

Transcript

Hey everyone, I'm giving a talk on how to optimize your application and the full story behind it.

Just a little about me: I'm a software engineer at Clipchamp I graduated from QUT in 2018; former president of Code Network, which is the computer computing society at QUT; I've been programming since around the age of seven; and am a developer of WorldEdit, WorldGuard, and CraftBook. These are large Minecraft mods. I've been involved with them all like remotely since 2011, CraftBook lead since 2011, and WorldEdit since 2016. Together they have around 4 billion downloads and 17 million end-users. These stats are really sketchy; I don't know how accurate they are realistically, just take them as completely random numbers. And yeah I was formerly a developer for The Powder Toy, and also I'm really into emojis.

So just a quick thing because this is a remote talk. I'm not too used to doing those so can't really judge the audience so if anyone has any questions or anything just put them in the chat and I will try to address any questions or concerns.

Yeah, so this talk basically walks through a real-life optimization project of the Minecraft mod WorldEdit. It uses a profiler, how to look for hot spots, and how to fix them once identified. It also discusses the key takeaways and other findings. And yeah feel free to just ask any questions during the talk. Not the raise your hand thing because I won't see you; I can't like see everyone when it's remote.

Yeah, so just some quick background info on what WorldEdit is because this talk uses it as an example. It is a Minecraft mod used for large-scale terraforming of the terrain. Minecraft's a block-based game; this lets you modify the blocks. So basically you can perform potentially tens of millions of operations in very short amounts of time. That's a very large operation and therefore can be very slow. Minecraft itself [coughs sorry] has been getting slower with each update and WorldEdit, therefore, has as well. WorldEdit also makes many performance trade-offs to improve memory usage, which also lowers performance.

And yeah so the first question you should always ask when wanting to optimize something is why. It's important to actually have a reason to actually dedicate time towards optimization. If you don't, you're basically just optimizing it for the wrong reasons. WorldEdit is currently known as being pretty slow; it's not a good look. People expect it to kick them off their servers now due to it timing out. It's also not something that people should expect; they shouldn't expect normal usage to just break everything. And also a few unsafe forks of the software have popped up that claim better performance and cause a fair few unexpected issues. And also it's a really bad user experience.

So the first steps that I took once I actually decided to optimize this was coming up with a standard test that can be reproduced for higher accuracy. So what I chose was setting a hundred million blocks to stone. It doesn't really matter if you don't know what that is; just know that it's a standardized test. A specific area of a specific world was also chosen to minimize the impact of external variables. Also, it doesn't matter if you don't know what that is but minimizing external factors that can actually impact performance is important because otherwise, you don't know if you're testing external factors or your code. Yeah, and then use a CPU profiler. Never optimize your code without metrics because it probably is not behaving as you expect. There's a lot of stuff usually running under the hood that changes performance profiles of things. If you don't know what the actual metrics are, you should not be touching it. Also, CPU profilers will slow things down, so no timeframe that I'm mentioning in this talk is the real-lifetime frame. If it were, I would have done this much sooner. Yeah, in my quick testing, these results are 10% slower than without a profiler. And also, I used YourKit as my profiler; a Java profiler that's like decent-ish.

So just a quick thing, a flame graph (which I'm gonna use very soon) is a graph that visualizes a performance profile. The y-axis is the stack depth, so like methods that all other methods. And the x-axis is, in this case, how long the code took in descending order. It's not a timeframe graph, so the x-axis does not increase as time increases. It's actually the amount of "how long the code took". In this case, as you can see there's this function here, that was the main function that got called, and then the ones on top of it are ones that it called and the amount of time that each of those took.

So the findings of my initial profiling was that yeah, setting a hundred million blocks took 155 seconds on the main thread. That's a lot of time. It was also a very high powered machine, so low powered machines are probably gonna have a much worse time. Just to put that into perspective the game client times out if the server doesn't respond for 60 seconds. So this is over double that time frame, almost triple. So basically it's really bad. Also, that's the screenshot of the profile output which I will get into in more detail in a second.

So yeah these four areas that have been squared off all main focal areas that I immediately identified after going into it. I decided as these are basically all distinct code paths and also have their own area of functionality.

So firstly that red box which was on the right-hand side there. This code actually happens first, despite being lost on the thing. This basically pulls data into a buffer. It takes 18 seconds in this profile, and I immediately noticed that there were some very slow Java stream accesses, and also very slow map accesses. It's kind of hard to see probably on this board, but right over in the end there are streams and the rest of it is basically map accesses.

The blue box is basically something that reorganizes the list of operations, to make them in the most efficient order to actually set back to the world. The game does a bit of caching, so we try to make the best use of it. This took 27 seconds, which is really bad because it's literally just a single function call. So the immediate takeaway was it's a very slow function call. It's doing a copy and sort of a massive map of memory. It's slow. That's bad. Also, there are also some leftover slow map accesses from the last box as well here, which also slow things down a little bit.

The green box is a segment of code where we queue lighting updates. The game asynchronously updates lighting using a lighting thread. This takes 42 seconds, and I immediately noticed that this shouldn't be in the profile because the game asynchronously updates lighting. So why is it impacting the main thread? This is because it's blocking for some reason, so this obviously something that made sense to investigate.

The purple box is a segment of code where WorldEdit performs various side effects such as updating neighboring blocks and all that kind of stuff. It took 61 seconds, so this is basically the biggest chunk of time. The immediate takeaways here were that there are a lot of calls to an event manager. That's like everything from this point onwards. And this basically means other things that listen to those events can massively slow things down further, and we don't really have any control over that.

Yeah so what now? We've investigated and identified a few problems, so now it makes sense to actually work out why this is happening and how to fix them. So if you're unsure about something, research it. That's basically the best way of working out how you can actually go about fixing things. Also, once you've made changes always rerun the profiler to make sure what you did actually was effective.

So in terms of the slow stream accesses, this is something that is fairly common in code. Java streams are pretty slow, so we unrolled the stream into just some basic imperative code which caused a major speed-up. So this change sped things up in this method by 2.5 times. We also removed optionals, but this didn't actually do anything anyway because the JVM was optimizing them out. We still went ahead with removing them just because we're not sure how reliable that is. Maybe it sometimes happens, maybe it's always, who knows it's kind of a black box.

Yeah, so with fixing the slow map access, this one was pretty interesting to go into. We had a custom map implementation. If anyone doesn't know what a map is, it's basically a data structure that maps from a key to a value, in a hopefully efficient way. So for memory reasons, we had basically two layers to try compacting everything into a really efficient lookup table, and this was really slow. We ended up optimizing the load factors of this map. If anyone doesn't know what load factors are, they are basically numbers that tell a map when it should resize. Because maps allocate memory depending on the implementation, and then as they fill up, they have to allocate more memory. So this tells it at what point have been filled up should have allocate more memory. We'd set them to allocate more when they'd filled up, which is really good for memory efficiency but also it meant that it got really slow when they started filling up. So we tweaked them to 0.9 and 0.75. They gave us the best speed-up from the testing that happened. Also, the second internal map required looking up another map to actually identify the keys. That's not a great idea, and it slowed stuff down a fair bit. And also, there were a fair few places where data was recomputed when we could have just re-used it. So we fixed that and got an overall 3.5 times speed-up throughout the block.

With the sorted memory copy, this one was fairly easy to fix because we realized that we were creating an immutable copy. It didn't need to be immutable. Also, it was being done on a single thread. There were no data dependencies here that prevented us from doing this in a multi-threaded way. And also, Java has a really nice parallel sorting system, so we just switched it to that. This is not really the fastest solution, but it gave us really good speed-ups, and anything faster would have like vastly overcomplicated the code. So we decided this was good enough, which is a trade-off that often makes sense to make when doing optimization. So the improvement we got here was 28 times. This, of course, is very dependent on the machine, because different machines have different amounts of cores, or the calls are capable of different things. For example, old AMD cores are not as good at multi-threading integer operations for design reasons. And yeah, basically it can vary a lot. However, most users of worldedit use server hardware. Server hardware usually has a lot of cores, so this made sense to actually implement.

In terms of the synchronous lighting issue we identified. Which it turns out that there's actually a maximum queue size. So we were queuing it, it hit the queue limit, and then all further updates happen synchronously. When we looked into this, we realized it's not actually a small fix, so we didn't actually fix this. But it is definitely something that we identified as a major problem. So identifying a slow point does not mean you can easily fix it, but at least once it's on your radar, you can think about it in the future whenever you need to like rewrite that section of code.

Yeah, so the outcome of that initial optimization work was getting it down from 155 seconds to 111, which is pretty substantial. However, it's not really what we wanted because it's still way too slow. So a majority of the time taken was outside of our control, which means it's a lot harder to optimize from this point. Most of it was actually updating the blocks within Minecraft, so that meant it was time for us to explore other avenues for optimization.

So, something that's fairly important to do when doing more serious optimization work is to work out what a typical machine or typical use case for your software is, and test it in those situations. So what I did was I installed the most common mods that people usually have alongside it and tested it with that. Turns out just installing one of them, one in particular called CoreProtect made things 40 times slower. So that's very, very significant. And then adding just 10 of the other common mods slowed things down to 100 times slower overall. This is very bad. My test case took four hours to run. That's not something people want to use. And uh yeah the profiler kept crashing when creating output, so I don't have one of those fancy pictures. But this means in an average case, most users are actually going to be sitting there for hours, which is very, very bad.

So this, of course, required a solution. The first thing we did was remove the event system calls happening in the purple box because that mostly prevented other mods from slowing things down. However, some people do depend on this system existing. So this can't be removed without impacting the usability of the product. We also looked into it further and found out that there are a few other things that most people would not need, however, some people do, such as updating the entity AI. So because people have also been requesting fine gained control over what is actually updated for a fair while, we decided to actually create a new feature which as a side effect produced massive performance gains.

So we added a system, ironically called side effects, that allows the user to toggle everything that happens in Minecraft when a block actually changes. So we disabled the rarely used ones by default, and this basically means that people can decide what is necessary, and therefore speed things up when they don't need to do everything. So it now takes 15 seconds with all side-effects disabled, 31 seconds with the default enabled. And yeah this time frame is similar despite whether the typical or isolated case is used, however, enabling the event side-effect which still calls that event system slows it down similar to the typical case. We also did some further optimizations when actually investigating all this stuff which got about a five to ten percent performance boost. But yeah that wasn't as major as just setting up this whole system.

So at this point, we decided that performance was now acceptable, you can do a hundred million block edits in like a shorter time. It like, it didn't really impact user behavior, and most edits now occur in well under a second. We also came up with some future ideas, such as creating a deferred update system because not everything needed to happen at once, refactor some of the pipeline to create less duplication of data, and perform the buffer processing on another thread.

So yeah like from going through this process, there were a few things that I found interesting that didn't really come up in the other slides.

The first is, depending on your language, your virtual machine or garbage collector can be tweaked. For some reason a lot of users add custom JVM Flags to their process, generally not knowing what they do. They just read somewhere "hey this is good", and then just add it. It's not always the case, often is not the case, but yeah they actually can be used well. If you know what you're doing, you can tweak the garbage collector to suit the application, and this can lead to actually substantial gains. Incorrectly tweaking the garbage collector however can massively slow down the application. For example, many users were limiting the maximum GC time to 100 milliseconds. This means it's not able to actually clean up garbage because there's too much of it, and this caused really bad performance and often just led to an out of memory crash.

Another one was Java's just-in-time optimizing compiler was a lot better than I expected it to be. It was completely removing method calls; it was like removing object allocations; it was doing a lot of stuff that was not something I expected. But this does mean it is hard to determine if something will actually optimize your code because Java may be doing it already. And most other languages do very similar things like the C compiler even does a lot of optimizations like before even running the code. JavaScript, the web browser, like the v8 team have done a lot of really weird stuff to get things faster.

Yeah, so what are the main takeaways of this whole optimization project.

Firstly, have a reason and goal. Don't optimize your code just because you think "hey I can make this faster". That probably means you're going to just do the wrong thing and write really really messy code just because you think it can be faster. Treat it as a scientific process basically, have a problem and a goal. So you're trying to solve this problem to reach your goal. Also, it's a good idea to have a threshold for what's considered good enough because otherwise, you may be doing this literally forever. Also, premature optimization is bad, I was going to include a Donald Knuth quote, but too many words and a slide is bad apparently as well.

Using a profiler that's I already went over that start, but just to reiterate it is very important. You need metrics. You should always test with a profiler, validate with a profiler. Also, keep the test case consistent. If it's different data you're not validating it, you just see different results. And yeah try to incrementally fix the problem. Don't run the profiler, do a ton of stuff, and then run it again. Always do small things, and rerun the profiler, so you can actually be sure that the small change you've just made is actually what has improved performance. Also, don't make changes without evidence that it's a problem. If you see a method and think "oh yeah I think this is the reason it's slow", and then just make a change, that's bad. Like you've just gone against the previous things. Never assume that what you're doing is actually benefiting things.

Also, test with various hardware and setups. Try to optimize for the devices and setups the users will use. So this is core count, multi-threading, caching data if you expect there to be a lot of memory available, and also if there's really slow IO. Like, if this is a device that's meant to be like deployed in an area with really bad internet like Australia, you probably need to put some extra effort into handling that. Also this helps identify specific bottlenecks, like in a case of slow IO. Oh, yeah also various network conditions are somewhat fairly important. If you don't test performance when your application is in a very spotty network connection, sometimes you'll see that for example a network thread could be spitting errors and these errors are slowing down the network thread. That's fairly common in Java because stack trace generation is fairly slow.

Yeah so also look at the user requirements, because you may sometimes find that yes, this code cannot be made faster. However, this code was written to satisfy a user requirement, and that user requirement can be written in a way that is actually running faster. Also, in the case that it can't, sometimes try just negotiating with the product team and just say that it'll provide a better user experience. Those are words the product teams like to hear so sometimes that'll work. So if it's as fast as it can be, try changing what it is.

Also, keep code readable and maintainable. If you're positive making code overly complex is required to make this acceptable fast then do so, but otherwise it really just makes the code harder to follow. There are cases where it makes sense to implement it but, well... implement complex functionality yourself, but realistically other people will have done it, and yeah retaining maintainability is just as important as optimizing the code. Slowing down the ability to modify the code is arguably worse than having a slow application. Write code well, if you need to then write it fast.

This one's not as important I guess, but reading up on how the language and runtime works can actually lead to a few really cool optimization ideas. For example, the v8 dev logs for anything web related are really really insightful for different ways you can optimize like applications. And understanding the environment helps prevent you from fighting it. For example, if the environment is doing something and you are doing something else, and they conflict, you're basically just fighting the runtime, which is just gonna waste your time and not actually improve anything.

Another one is to not overdo it. You can probably always make it faster, that doesn't mean you should. As I said before you will probably be there forever if you try that. There is a point where the current speed is acceptable, and that's the point where you should probably stop. Speed boosts have them returns based on psychology as well, which is pretty cool. It's been shown that in some in some cases something being faster than a certain point is actually worse. That link there was basically to this post that said Facebook loading screens were slowed down, because before they did that their users through it was insecure because it would appear to load instantly. So, therefore, they think "oh wait this can't be secure". So basically something working too fast, makes the users think nothing happened.

Also, other software can impact yours. Test with other common software that it's going to be used with. This is probably easiest with websites because browser extensions are very very common, and you'd be surprised that like how badly some common browser extensions actually impact the speed of various different websites.

Also, make sure it doesn't get to this point. If your performance is so bad that you need to basically stop everything and go through a performance audit, it means that you haven't really been caring about this the entire time. So have metrics. Metrics are really nice, it basically lets you know when something has slowed down your code. So like have some user telemetry if that's available, or just like general performance testing so you can basically plot out a timeline of "hey this change made the code slow, why is that". Also, once you've got a negative perception about performance, it's very hard to get rid of it. Java is a key example of that; everyone thinks Java is slow. It was in the 1990s; it's not anymore. But everyone thinks it still is. Why did my emails open up?

Okay, yeah so performance is a key aspect of user experience. Treat optimization as a priority alongside new features. People don't want to use a slow application, no matter how cool it is or how many features you add.

Yeah, thanks everyone for listening and watching my talk