Git Lessons Learned

If you're doing any kind of programmatic data analysis or software development you need to use Git. It is a version control framework that is really powerful. This post is about the lessons I learned the hard way while I was developing a machine learning algorithm for the analysis of large-scale neural recordings.

If you're new to git, check out this one page summary on how to use it https://github.com/marius10p/NeuralDataScienceCSHL2019/blob/master/HowToGit.md  or this free online course from Udacity https://classroom.udacity.com/courses/ud123

Here's my repository that this post is based on https://github.com/mariakesa/EnsemblePursuit

Lesson number 1: Make an effective folder structure for your project to organize your work.

Here's a paper on organizing data analysis projects in Bioinformatics, "A Quick Guide to Organizing Computional Biology Projects", Noble 2009 https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424 One of the key advice that they make is to organize your work into meaningful folders. This helps to manage complexity as it is a true nightmare to sift through jupyter notebooks and code that are not organized by topic. When I first started with my project I had notebooks and .py files all in the same folder and it quickly became unmanagable. The lesson learned is: structure your work. Think hard about how to organize your computational research project from the very beginning.

Lesson number 2: Occam's Razor-- entities should not be multiplied beyond necessity.

So right now my github repo contains a bunch of branches. 7 to be precise. When I created the branches, what I had in mind was to split my work on different features. However, now I'm stuck with a maintance nightmare. I have code that is shared between several branches and it's a real mess to commit and merge changes taking place on the same file. It's error prone and it's messy as heck.

I think a better strategy is to have just two branches: a devel branch while you're working on a feature and a clean master branch that is the face of your repository, e.g. the finalized code that you and others have used. You can organize the features you're developing by having folders in the devel branch that group together snippets of code. Here is a case where Lesson 1 or organizing your code into folders is also useful.

Lesson number 3: Make your commits small and have one theme.

Forgive me father, I have sinned. My repository contains a bunch of unreadable commits, one commit with changes to multiple files, random gibberish commits to track my changes as I move between branches. It is an unholy mess. Ideally, a commit has a nice descriptive message of what you did and is particular.

Lesson number 4: If you're using a piece of code more than once, turn it into a script.

An important part of reproducibility is being able to line the full analysis that leads to a figure in a paper with just one push of a button. Classes and functions in Python are really useful for organizing your code. This is something that I did well in this project. I made pipeline classes for analyzing datasets and used these classes in Jupyter notebooks. I was used to organizing code into pipelines as I had previously done an internship in analyzing satellite image data with convolutional neural networks. This is where I learned that classes in Python are perfect for encapsulating particular stages of an analysis and tying different functions together so that it's easy to apply them to the data it has. It is really good to have your entire pipeline in one class and have every function in the class commented. It really helps with reproducibility.

Lesson number 5: Have a folder for draft notebooks and clean notebooks.

I learned from the Harvard Reproducibility MOOC that it is good to track what you tried and any analyses that you did even if these analyses are preliminary and don't make it to the paper that you're writing. Playing around with data in a draft Jupyter notebook is often the first step to developing a feature and later turning it into a .py script. These things are messy. You're just trying stuff out. It is useful to have a separate folder for draft notebooks. I found it to be better than having clean and exploratory notebooks intermingled in a folder. I tried adding the '_draft' suffix to my messy notebooks, but it makes more sense to just have them in a folder in the devel branch.

Hope this helps somebody who is starting to work on large-scale datasets:-)

Kommentaarid