If you write code for research, you’re missing out if you’re not on GitHub. GitHub is a collaborative coding website that hosts over 1 million open source projects, is increasingly being used by scientists who code, and has even hired a science guru to make the platform work better for researchers.
GitHub makes coding research software easier with its excellent version control, solid tools for collaboration, and real-time feedback and reviews. Even better, GitHub can tell you much more about the interest in, use and adaptation of your open source software and code than simply posting it to your website can.
In this guide, we’ll give you a very high-level overview of how GitHub works, and some of the benefits you can expect to see if you share your code on GitHub.
Learn the basics Git and GitHub
GitHub is built on top of the distributed version control system, Git. Git allows multiple users to edit a single piece of software code at once. Simply put, it tracks edits and allows each to be applied without overwriting the other edits.
GitHub is an open source software hosting platform that takes a lot of the pain out of using Git. Users create profiles on the site, download software to their machines, and start coding. If you’re using a Mac, GitHub’s desktop software can do most of the heavy lifting for you, making it relatively easy to push your local code to the cloud and vice versa.
Individual software projects are hosted in GitHub “repositories”. Later on in this challenge, you’ll create repositories for your code.
When you’re ready to collaborate, you can search others’ repositories, “fork” their code for your reuse, and suggest changes via “pull requests.” You can also invite others to collaborate on your code–more on that below.
Full-on Git & GitHub tutorials are beyond the scope of this post, but I encourage you to check out Lauren Orsini’s excellent GitHub primer (Part 1 & Part 2) to begin learning the basics of Git.
Set up a Github profile
Once you’ve got your local software setup, it’s time to create a GitHub profile. This is the centralized place where all of your code and contributions will be collected.
Here are some tips for creating a profile that will make your academic code shine:
- Choose a photo following the recommendations we discussed in our LinkedIn challenge
- Include a link to your professional website, so others can easily learn more about your research
- In the “Company” field, add your area of research or title alongside your institution name, so it reads “Marine Biologist at UC Santa Barbara” rather than just “UC Santa Barbara”
- Add your best code to well-documented individual repositories (more on how to do that in a moment)
By following all of these tips, you’ll have a profile that’s much more searchable on GitHub. Plus, a complete profile that showcases your authority will make you more appealing to potential collaborators.
Create repositories for your code
Once your profile is complete, it’s time to get your code online. Individual projects go into GitHub repositories. And repository-based reuse and interest metrics can help us learn about how our software is being used by others. Here are some tips for creating a great repository.
Choose a short but descriptive title for your repository: it will help with both memorability and SEO. Naming your repository after the software itself is a good choice.
Create a killer Readme file: you want your code to be reusable, don’t you? Documentation is a huge boost to reusability, and a Readme file is the best place to keep your documentation. The Frontier Group recommends including the following:
- The name of the project
- The name and contact details of the client and any 3rd party vendors
- The names of the developers on the project
- A brief description of the project, you should include the answer to the age-old question “What problem is this project solving?”
- An outline of the technologies in the project. e.g.: Framework (Rails/iOS/Android/Gameboy Colour), programming language, database, ORM.
- Links to any related projects (e.g.: Is this a Rails API that has corresponding iOS and Android clients?)
- Links to online tools related to the application (e.g.: Links to the Basecamp project, a link to the dropbox where all the wireframes are stored, a link to the Pivotal Tracker project)
Consider also adding information about the grant that funded the development of this code, and links to any related publications. To increase your SEO, try to also include keywords that others who might be interested in your software might search for.
Choose an open license: In a separate License.md file, include a license that clearly explains what rights you’ll allow others who want to reuse and adapt your code. There are strong feelings about open which open licenses are most appropriate, and pros and cons for each that are worth looking into, but we prefer relatively permissive licenses like the MIT license (in fact, that’s the license Impactstory’s code is under).
Add collaborators: Invite anyone who has contributed to developing the code to be a collaborator on your personal code. For code that’s not yours but instead is part of the work an organization or institution does, you can also create an “organization” for code repositories. For example, Matt Jones belongs to the rOpenSci and DataONE organizations on GitHub, as we see on the left. For more information on adding others to a GitHub organization, see this guide.
GitHub for sharing data
Some researchers like using GitHub for storing and working with numerical data. It has the advantage of being stored in a repository alongside the code that’s used for analysis, making your research project into a single, neatly-packaged reproducible object.
For some examples of how others use GitHub for data, check out Carl Boettiger’s R workflow, Caitlin Rivers’ Ebola data archive, and OKFN’s government data archive.
Some drawbacks to using GitHub to store your data include its lack of solid preservation strategy and that it doesn’t specialize in one kind of data like repositories like the Protein Data Bank do, making it difficult to find data to reuse.
Mint DOIs for your code
Now that your code (and possibly also your data) is online, let’s make it easier to track its impacts.
A challenge for tracking the scholarly impact of research software is a lack of persistent identifiers that are available for code. That’s why Mozilla Science, GitHub, Zenodo, and Figshare partnered to begin issuing DOIs for code repositories on GitHub, which are often included in citations in publications.
To learn how to create a DOI for your code, check out this guide to connecting Zenodo to a GitHub repository to mint a DOI.
Once you’ve gotten DOIs for your repositories, put them into each of your repositories’ Readme files alongside a preferred citation. It’ll make it easier for others to cite your code in their papers.
Sit back and watch the forks & stars roll in
Citations are far from the only type of impact you can start to accrue if your code is made openly available on GitHub. GitHub has some good metrics that can tell you how your code is being reused, commented upon, and so on–in real time. Some GitHub metrics to know about include:
- Stars: some GitHub users “star” repositories as a means of showing appreciation for your work; others use them as a bookmark, so they can find and revisit your code more easily.
- Forks: a “fork” is created when another user copies one of your repositories so they can explore and experiment without affecting your original code. It’s a good signal of reuse.
- Pull requests: When a user wants to suggest changes to your code, they’ll issue a pull request. The number of pull request and identities of contributors can be good indicators of how collaborative your work is and who your high-profile collaborators are.
Each of these metrics can tell a more nuanced story of the use of your code in your discipline than citations alone can.
Limitations
Despite its popularity in some circles, GitHub has notable limitations. The biggest is that learning Git can be too high a barrier for entry for some to overcome.
GitHub’s filesize limitations and usability are drawbacks for others. Moreover, the problems with GitHub’s search function make it difficult to search for code or rank by relevancy when searching code documentation. A good workaround for this is to just use a regular search engine like Google.
And, finally, GitHub is a for-profit company. They reserve the right to delete your code and data at any time, for any reason, making the long-term storage of code a questionable proposition.
Homework
First things first: read these excellent tutorials [1] [2] [3] [4] and practice using Git and GitHub. Once you’ve got your footing, it’s time to get your code online.
Deposit at least one of your best known software projects or code snippets to GitHub repositories. Then, mint a DOI for it and add your preferred citation to the top of your Readme.md file.
Finally, get social! GitHub’s major strength lies in its social networking features, so try a few Google searches to see if you can find and follow researchers in your field. Bonus points for exploring their repositories to see if there’s any code you can borrow/fork for future projects.
Tomorrow, you’ll have it a bit easier: we’re going to get you onto Slideshare!
Please also post your mediocre and bad code, especially if it was used to generate results for a paper!
+1000! As a “how-to” for beginners, we didn’t want to overwhelm folks with the concept of putting *everything* online, but I couldn’t agree more that code that underpins papers should absolutely go onto GitHub. Thanks, Michael!