Managing Large Open-Source Projects | Talk

(Transcript is auto-generated from audio recording; may contain errors)

Thank you. So, uh, yep, today I will be giving a talk on how to run a large-scale open source project. It's mostly catered for less technical people, so there will be a bit of "if you don't know something, just put your hand up and I'll explain it." But it does go into technical detail because this is something that everyone needs to know, not just hardcore programmers.

So, just honestly about me: I graduated from QUT, not UQ. I am the former president of Code Network and currently a software engineer at Clipchamp, as you can see by my shirt. I have been programming since the age of seven, I believe, and I'm the lead developer of WorldEdit, WorldGuard, and CraftBook, which are three of the largest Minecraft mods in existence. I've been involved since 2011 or 2012; I can't remember exactly, so sometimes I write 2011, sometimes 2012. I've been leading CraftBook since then, and I've been with WorldEdit and WorldGuard since 2016, but I've been involved in all of those projects since either 2011 or 2012. Together, they have over 4 billion downloads and around 17 million end users. I'm also a former developer for The Powder Toy, which was a particle physics game. I'm also an emoji enthusiast, but sadly I could not include any emojis in this slide because if anyone's ever tried in Google Slides, it kind of ruins the emojis.

This talk is not about my opinions; there isn't really a right way to do this. It's very dependent on what you're trying to do, what kind of project you've got, and what kind of community you've got, which I'll go into during the talk. If people do want to know my personal opinions, I can answer that during questions. So, what it will cover is what you should do before you begin, how you should encourage contributions, how you should manage the community and support, and also how to manage yourself because that's a very important part that a lot of people don't get right. And yeah, if anyone has questions during the talk, just raise your hand and I'll answer it as long as there's time.

So, before you begin, there are a few pretty important steps to take because they're kind of hard to change afterward. You can change them, but it's a lot more rapid. You need to choose a license, versioning scheme, version control system, host, and also a release strategy.

With choosing a license, a lot of people get really hung up on this, and this prevents them from properly publishing something as open source. You don't need to worry about it too much; there are just a few small things you need to know. You don't need to be a lawyer to understand the licenses. If you really want to read them in detail, they do use some complex terms, but they're usually written in an American legal way, which is basically redefining every word of the English language.

Most of the common open source licenses are pretty similar, with a few key differences that fit different situations. You've got very permissive ones such as MIT, Apache, and Zilla, and they basically allow other people to take the code and do whatever they want with it. They can make their own closed commercial copy if they really wanted to. Then there are ones that aren't as permissive, such as GPL, AGPL, and LGPL. When permissiveness is set here, it's basically preventing people from taking the code and doing things with it, such as commercial use, redistribution, and all that. GPL and its derivatives don't allow people to make their own closed source copy and then distribute it.

Does everyone understand what open source is as a concept first? Yep, cool, that's good. It's currently one step up on QUT. Most open source licenses do allow commercial use, so if you want to make something that can't be used commercially, you'll probably need to look for a license that does that. But all the big ones basically allow it. They also mostly allow modification and distribution because that's kind of one of the points of open source software. You want to be able to spread it. Licenses that don't do these are sometimes referred to as shared source. For example, Microsoft today open-sourced their terminal application and also announced a new one, which they released a very interesting trailer for. I've never seen a trailer for a terminal, so it was kind of hot. Those licenses don't allow people to commercially redistribute it because that doesn't really fit Microsoft's interests. Those are referred to as shared source because it's not technically open source.

Don't put too much thought into your license choice. Basically, just work out one that seems to fit what you want the project to be and then go with it. I usually go with MIT just because a lot of different projects go with it, and it's fairly simple. The next thing I think it's important to mention is that you're not bound to a single license. You can use as many as you wish, and dual licensing is quite common. For example, a main use case of dual licensing is if you've pulled any code from somewhere else. If that's under a license, you can't really just change the license of that unless you write, as a rule of thumb, like 70% of it. Usually, that's where you can re-license it, but it still is a bit of a gray area. If you do that, you can have dual licensing where, say, one folder of the project may fall under one license and the rest under another.

Choosing a versioning scheme is pretty important because you don't want to change it up all the time. For example, 2019.1, 1903, and 19H1 are all, as far as I'm aware, versions for the most recent Windows update. It very heavily depends on what your project is and how you plan to release it. If you've got a rolling release product that has multiple releases a year, maybe choose something like 2019.1. That's what JetBrains currently uses for their applications, and it seems to work pretty well for them. For example, I believe Ubuntu uses the 1903 system as well.

If you're creating a library, that's usually pretty different because you usually need to specify when something's breaking, like when you add new features or if something's just a small patch that people can safely update to without actually having to test that the whole application doesn't explode. For something like that, there's semantic versioning or SemVer, which uses major.minor.patch, where major indicates breaking changes, minor indicates generally new features, and patch indicates just bug fixes or something like that. If you want to find out more information about that, there's semver.org, which is kind of just the website for it. Definitely make sure to be consistent. As I've said before, if you're running a library or something where you change versioning schemes all the time, no one's going to trust your library. If no one trusts your library, they're not going to use it unless you're writing JavaScript because then no one seems to care. I was using a library, and they changed from 1.0.3 to 0.5.4. I don't understand the logic of that version because everything sees it as less than.

Choosing a version control system and host is usually pretty simple because nowadays, unless you have a very specific reason not to, everyone basically just uses Git and then generally also GitHub. A lot of people will fight me on that GitHub one since the Microsoft acquisition, but it is the same website, still works, it's great. GitHub has a big community around it, which is why it's kind of a de facto choice. It's very easy to publish your project there and actually get it seen. Some communities do use other methods like Subversion, Mercurial, TFS, but they're usually the kinds of communities where if you're publishing something for it, you know that they're using something different. For example, the Linux kernel often accepts contributions via mailing lists and privately posted repositories. However, there are mirrors on things like GitHub, so there isn't really much of a reason not to use them unless you really don't like them or you have a good reason not to.

Choosing a release strategy is pretty important because if you just randomly make releases and don't tell anyone what's in them, they're not really going to know what things have changed and probably won't update. Keeping a changelog is very important. If you can, it's usually a good idea to auto-generate them from commit messages. It just saves time, which also means don't just write one-word commit messages like "fix" or "a" or sometimes people use single emojis. As much as I do enjoy that, it's probably not the best idea. It's good to understand how often you should be making releases because it may seem simple to just push them out whenever you make a change, but is your actual application something where users are happy to update that often? If you have an auto-updater, generally it's fine. I believe Chrome updates happen every few days, and it just happens seamlessly. If updating is more of a hassle for a user, like if they need to download a new installer and reinstall the program, probably don't make daily releases because it's going to get really annoying for the users, and you'll find them staying on older versions. Faster releases do mean users get features faster; however, it also means they need to update more often. It's good to keep a good balance and also consider how users will find out about these releases because if they don't know new updates were made, they also won't update. If it's an application with an auto-updater, that's very easy. Otherwise, consider update prompts or something like that. It's fairly easy for libraries because usually people just pin them at what they want, and then whenever they have time and want to update everything, they just go through and check if there's a new version available.

So, that's the getting started section covered. Now, encouraging contributions. There are a few key steps here, such as having a public plan, advertising that the project is open source, making it easy to contribute to, and also the tooling involved. Having a public plan is very important because otherwise people won't understand what you want to achieve with the project. If they don't know what you're trying to do, they won't contribute because they're not sure if their contribution will actually be accepted. If it has a roadmap or something like that, it improves people's ability to contribute because they can say, "Oh, that's a cool thing. I know how to do that. I can make it." It also just gives the project a lot more credibility. People don't really want a project that switches directions every five months because if it does, then they're going to feel left behind when they're trying to use the original functionality, and suddenly your calculator app has turned into a placement service or something.

Advertising that the project is open source is super important because unless it's something very obviously open source, like a developer library that people will look up, they probably won't know. If you've released a calendar application, most people are just going to assume it's a calendar application and can't see the code. It very much depends on the type of project. If you've got a website, usually people have those little "Fork this on GitHub" banners or something like that. If it's a desktop application, you'll probably need to do something similar or have something on the download page or just something like that. Using GitHub definitely makes that easier just because of all their social icons, as mentioned before.

Definitely make it easy to contribute. If people are intimidated or scared of contributing, they definitely won't. Keep the project consistent with strict contributing guidelines. Have a contributing file that explains everything. Make sure it states what to do so people don't just make PRs that say "fix issues" and then get rejected because they haven't explained what they've actually fixed. One way to make this a lot easier is to have an issue tracker with templates to ensure that what people report is actually a meaningful issue. If people just report "I have an error, it's broken," you're just going to immediately throw that out because it's not at all helpful. Also, having issues with items on the roadmap helps people keep track of them. For example, if you just have items on the roadmap and there's been discussions about it in an IRC channel or whatever people use these days, it's probably not going to be known when someone just goes to the GitHub page looking to contribute. Labels are super useful. Have bug labels, feature labels, or labels to indicate that it's easy to guess. Reviewing PRs and being active and responding to issues is super important because no one's going to contribute if they don't think that PRs are actually ever going to be looked at.

Tooling is something that most bigger open source projects get very right, but some don't, and then they kind of just feel weird to contribute to. Continuous integration is a massive, great thing here because it allows PRs to be automatically tested, assuming you've actually written tests. Writing tests is important. Things like Travis CI are free for open source projects. It integrates with GitHub super well. It has a little integration icon and doesn't actually pass a pull request until it's passed all the tests. If you need something that's a lot more custom and powerful, Jenkins, TeamCity, or CircleCI will work for good. Both Jenkins and TeamCity are free to a certain point. CircleCI, I'm not too sure about their payment. Deploying artifacts to the archive system of your language is also super useful because that gives people a chance to test recent development releases so that you don't make a release and then find out there's this massive showstopper bug for 90% of users. For Java, it's usually Maven Central. For JavaScript, it's npm. Also, have extensive unit testing, as mentioned before, and run them on the pull requests, also potentially on Git hooks, but I usually find that's fairly annoying. Set up automatic linting and track style definitely because otherwise people won't know that they've written something that goes against the style guide, and then you comment saying, "Oh, this is the wrong style," and then they kind of lose interest because it's such a minor change that they don't really care enough to fix it. It just kind of gives people more enthusiasm about actually making changes. On JS projects, there's a package called Husky, which automates installing of Git hooks.

Does everyone here know what Git hooks are? No? Okay. Basically, Git hooks are little scripts that run when you do certain Git actions, such as committing or pushing or something like that. It allows you to basically make something run every time you try doing something with Git.

Managing the community and support is also super important because if no one knows how to use your product, they're not going to. Is everyone else understanding everything at this point? Yeah? Cool.

Documentation is important. It's even bolded, so just make sure everyone's aware. If people don't know about a feature, the feature doesn't exist because they don't know about it. How are they going to use it? Therefore, don't bother writing something unless you're planning on documenting it. This means have documentation, make it extensive, and also open source. If it's an open source project, it makes sense to also get community contributions to the documentation. Services such as Read the Docs are super useful because they basically provide a versioned, free hosted documentation service. Try auto-generating the docs as much as you can just because writing docs isn't fun. It's very time-consuming, especially if it's something like a massive table of routes or something. You may as well auto-generate that from within the application because you probably have that information available. Anything such as configuration as well is super easy to auto-generate. The docs are a first-class citizen and should be treated with equal or greater importance to your actual application because it's basically what people will look at when deciding to use your application, and then it's what they'll look at when they don't know how to use it.

Handling support is something that a lot of large open source projects don't do super well. Make sure you have super consistent issue tracking; otherwise, people just randomly message you on Telegram at 2 AM. That's happened to me multiple times. Also, discourage general questions through the issue tracker. I'll talk more about that in the next slide. Keep milestones and assign issues to them. That way, people kind of know when something's coming out. They'll still ask you constantly when it's coming, but at least some users will look at the milestones and say, "Oh, it's coming out." Definitely try to automate as much as possible. This image here is an example where I've got a bot that scans Pastebin for known errors and then responds with how to solve it. Basically, that cuts down about 80% of the issues that I get reported, which is great because otherwise, I wouldn't sleep. Definitely be friendly and approachable. If someone joins, asks for help, and then you yell at them for asking for help, they're not going to use your product anymore. Also, definitely understand that not all your users will be native English speakers because a lot of the time they'll be using English that you may not completely understand, or they may not be speaking English at all, and you still need to be nice to them and help them, which a lot of projects seem to not do great at.

Creating a community is a super useful part of this because it basically allows you to kind of offshore support to an extent. The types of community will vastly differ based on what your actual application is. If you have a C terminal library, you're probably going to have a very particular crowd of people. If you have a game, you're probably going to have a group of 12-year-olds. Both of them have their pros and cons, but you just need to understand your target audience and then base the community around that. If you're expecting a more professional community, consider something that's more slower-paced like forums, mailing lists, or GitHub discussions. If people are going to be fairly professional, they're probably going to either be using this at work or potentially after they've gotten home from work. They're not going to want to be in a constant chatting environment. If you're wanting to build a very strong community that's close-knit, consider something like Discord or Gitter because it basically allows you to have general and off-topic channels where people just kind of chat in general, which keeps them in the community space. This means if someone has a question, they're already there and can answer it. Handle any general questions and as much support as you can through the community to try drawing people in there to actually have them stay there because that basically creates a self-fulfilling community where people join because they need help, they're helped by the community, and then stay around to help others.

Telemetry is very controversial. Some people really hate it; some people really like it. Always have it optional. Consider opt-in versus opt-out. Basically, the downsides of each are: with opt-in, you're going to have a lot fewer people actually turning it on, so you're going to have the telemetry of a certain type of person. If all of your audience is somewhat that kind of person, it can work, but it won't always be great. For example, if you are making a game mod, a lot of 12-year-olds aren't going to enable opt-in telemetry, but then people who know what they're doing may. Whereas with opt-out, you then have the issue of people complaining about privacy. Definitely keep all data truly anonymous, and by that, I mean actually anonymous. A lot of people think, "Okay, we're just uploading random locations people have been with an identifier. Their name may not be attached to that." But once you look through it and say, "Okay, this person's been to these places," you can easily match up that data with something else, and in that case, it's not really anonymous. Definitely keep it good. Since open source projects are generally available to the European audience, it's almost necessary to use opt-in to comply with GDPR.

Does that include more than just basic usage stats? It depends on what you're collecting because even something like an IP address can be considered personal information, so that falls under it. Even hashes of identifiable information can be considered. Definitely keep local laws into consideration when doing something like telemetry.

With it being an open source project or a big one, maybe you have heaps of developers. Who gets access to the telemetry? If it was something you're running, are you the only one who sees them, or are they public for anyone who wants to look at them and develop based on them? Generally, I think telemetry should be public, and if there's a problem with it being public, you've probably got the wrong information in your telemetry. Having it public just means anyone can see it, so if someone wants to make a contribution, they also have access to that information because it should be a community project rather than just the team. Definitely keep track of feature usage, errors, versions, etc. For example, if as of a certain version a certain feature starts getting less used, you may wonder, "Did it get harder to use in this version? Did something else come in that replaced it?" If you want to develop features that people are using, it's good to actually know what people are using. Try keeping track of manual things as well. For example, if a lot of people are suddenly asking you about a setup question, maybe it's not documented well, maybe it doesn't work, or something like that. Also, definitely actually use the data. If no one uses a feature, work out why. If people just don't want it, maybe get rid of it. If no one's using it because they don't know how to, document it. Spend more time on popular features because that's what people are actually using.

This is arguably the most important step that a lot of programmers don't seem to understand. They just kind of program all day and night, don't sleep, and somehow look like that, which is not actually possible. It won't last forever. Don't burn out. Manage your time. Don't just program; do other things. Don't let the project itself take a toll on you, especially theSo, before you begin, there are a few pretty important steps to take because they're kind of hard to change afterward. You can change them, but it's a lot more rapid. You need to choose a license, versioning scheme, version control system, host, and also a release strategy.

Managing the community and support is also super important because if no one knows how to use your product, they're not going to. Is everyone else understanding everything at this point? Yeah? Cool.

Ask the community for help if you need it because it's a community project, so don't be afraid to use the community for assistance and everything like that. Don't let impatient or abusive users get to you. I've had a lot of very interesting users, so it can be quite a time. Also, make sure you do things other than programming. Have non-programming hobbies, which sometimes when I say that, people just kind of get personally attacked because the thing is, like, is writing C libraries another hobby because I'm normally a web developer? It's not a non-programming hobby; it's still programming.

Do social activities, which is another thing that people usually glaze their eyes over at. It's kind of important to balance your time. Also, have downtime; that can be anything from reading, watching Netflix, sleeping, going for a walk, something like that. Know what to expect. Open source software is not a way to get rich. I guarantee it. You won't necessarily get any money, and you won't necessarily cover running costs. It's still worth trying to get money, just not in a very intrusive way. Do stuff like create a Patreon, include donation links, and all that kind of stuff.

Some people will demand support, like actually full and demanded. Some people will be fairly abusive. I get usually about three death threats a week, which is kind of... yeah. Some people will take advantage of you. I was recently spearfished in a very specific manner. Spear phishing is basically a phishing attack where someone finds information out about you to find out the thing that you will actually fall for and then specifically target you. I can get into that during questions if people want me to because people were interested when I did this talk at Code Network.

Some people don't have boundaries, just like my other Korean telegram messages I've received. There's a chance people will make ridiculously large amounts of money off your software. I know someone who made, I think it was 2 million US dollars off CraftBook. I'm not going to get into who that was because of reasons. I've also done a talk on terrible support requests before. If I can, it's like a five-minute talk I can quickly do at some point if people are interested.

Does anyone have any questions?

"You can't say someone made two million dollars off a Minecraft mod and not explain."

They were selling the features of the plugin on a donation store for, I think, ten dollars each, and they had a lot of users. After, I think it was two years, they made about two million dollars. Is that like where if it was GPL, they could only have... oh yeah, right. To do that legally, they would have had to also tell everyone where the free source code was. It was a Minecraft mod, so selling it to players, even if they had the source code, they could install it on their own server, but they wanted to be able to use it on that server.

"What was the spearfishing attack?"

There's an upstream project to WorldEdit, which is one of the mods, called Spigot. They've had this bug which I've been trying to track down because it only affects four people. Only four people have reported this bug, and it basically entirely breaks WorldEdit. A few days ago, I actually found out what it was. People who have Polish machines, so machines where the system language is set to Polish, when Java's toUpperCase feature is called, it adds accentation above "i"s, which then makes the word "Minecraft" have an accent character and therefore it breaks when people try to use block names, which is the most obscure issue.

"What did they do to get you?"

When I was trying to work this out, I was asking people to send me their Minecraft server zipped up so I could run it and see if it would still occur for me. Never accept executable code from someone who you don't know or trust. It's kind of a bad idea. They made a modified version of WorldEdit which was installed on the server they sent me, and in there was code to go into my Minecraft launcher's auth file and strip out the authentication token, which isn't a password, so they didn't have full access to the account, but they could log in as me. They used this access to join Minecraft servers, say they were the developer of WorldEdit, and try getting extra permissions, which is the most amount of effort I've ever seen anyone go into to do so little because you can't do anything with that.

"What did you do for The Powder Toy?"

Back in 2010, I was just making random contributions to it. I can't specifically remember what I added. I know I definitely added the convert element, also the favorites menu, and a fair few random things like the scripting engine. This is still running? Yeah. Does anyone have any other questions?

"What would you recommend overall? Would you recommend starting your own open source project or trying to get onto one? What's the better experience?"

All of the ones that I've been involved with, all the big ones, I've gotten involved with already existed, which is kind of easier because you don't need to do the initial groundwork of getting a user base. But if you like that, that kind of requires you to find something you're really passionate about. If you can't find that, it may be easier just to start your own, but it is definitely a lot harder to start your own because the initial getting users is difficult. Any other questions?