Using git for a PhD thesis

2020-01-12

It was one of the best decisions I made in my PhD thesis.

I started my phd officially in January 2015, although the internship that finally turned into a thesis project started in September 2013. I wrote the articles that became chapters of the thesis gradually during the four years which lead to a month or maybe two months of manuscript writing, rewriting and organization, which was relatively easily and painless.
And I think a lot of it has to do with the fact that my whole thesis work was version controlled in git.

As the internship/phd project was in applied statistics, it was clear from the beginning that I needed to write quite some code. I liked writing code (still do), and had done it somewhat professionally during previous internships, which means specifically that I had already been confronted with the problem of writing a piece of program during a long period of time with multiple people, and therefore having to figure out how not to lose ideas through code across multiple people and/or versions of myself from different periods.
I knew that version controlling all my code was the only way to keep myself sane. So I set up a git repository when I had around 50 lines of R code (maybe two weeks into my internship, I spent some time getting lost in articles in the beginning).

When the internship was finished, I wrote my internship report, did my master’s defense. Then one of my advisors suggested: you can probably make an article out of this report and submit it somewhere.

I looked at the folder that contained the internship report. In the folder, there are the LaTeX source file, the pdf file which is the submitted report, multiple graphic files of results, and a bunch of diverse intermediate files generated by LaTeX compilation.
On the other hand, I had the feedbacks and suggested modifications from my advisors and people from the defense.
I did a mental estimation of the changes I would have to do to get this report publishable.
This is when I realized I had two options:

copying the folder and calling it internship_report_2_article, and therefore have the following folder structure:
```
internship
	internship_code
	internship_report
thesis
```
using a disciplined way to manage the multiple versions of written work, as well as the code.

The second option is probably a bit more work upfront, but a no-brainer in the long run.

I set about the Internet for tips, using some of them, and gradually changed the ways to use the whole system through out the four years. In the end of my thesis, I arrived at a set of practices. This blog post summarizes those. As is often the case in programming practices, they are not perfect, and they are not ideal for every situation and every person. They are the results of a long-term negotiation among the goals I needed to achieve, the suggestion and tips I found on the Internet, my technical ability and real-world constraints of the moment.

Keeping the pdf files in git

Most online tips recommend committing only the files you modify by hand (so the LaTeX files) and ignoring everything else. I tried that out and in the end decided version controlling the .tex files and the .pdf files. There are two reasons for it:

Many of the times when I wanted to go back to a version, it was the pdf file that I needed. It was either to show it to someone else, or for me to look at how the article looked like. Much less clicking and recompiling was involved if I simply committed the pdf files.
As most computer science or statistics thesis, mine contains quite some graphics to show how my algorithms compare to alternatives. It is not uncommon that these graphics go through dozens of iterations before the first drafts and even have more iterations during the reviewing process. Not ignoring the pdf files means that I can have a clear history of the graphics that are included in the written reports, along with the comments that I write for these graphics. For those who have worked on algorithms and show the results with graphics, you can easily see that this history has enormous value.

The practical downside of this practice is that the git repository is larger, but on modern computers, disk space is nearly never the issue.

One could argue that versioning is for keeping the diffs. If a file is binary (like a pdf), then it’s not meant to be in the git. I agree with this argument in principle. One way to perfectly solve this problem would be to hook up the (mostly) LaTeX repository with a continuous integration system which compiles and saves the generated artefacts (the pdfs) periodically or after every commit (ShareLatex or Overleaf sort of does that). But that takes much more work. As a system that complex could break more easily and might take much time to maintain, and it would not be worthwhile as I was the only one using it.

Reproducible research

As I mentioned above, the code part of my thesis was version controlled right from the beginning¹. There is plenty of resources on how to correctly version control a piece of software, as it is one of the most common practices of software development.

One thing that was not obvious when I started my thesis, but became gradually clearer was the concept of a library.

As I started out from statistics and not computer science, the code that I wrote before my thesis was mostly stand-alone “algorithms” that are either a single function (in the best case) or a single script with parameters defined at the beginning of the file calling some functions defined in that script (in R for example) or some files beside it.
For my internship, it was mostly enough.
In this internship, I did something which was typical for someone’s first scientific research project: I put two classic methods together to solve a new problem². The code consists of several R scripts, each in its own file. Every script serves its purpose: it reads some data (in the form of files), does something to that data, and saves the results somewhere, either as more data to be read as files in a specific place on the disk, or as graphics or tables to be read by human also in a specific place on the disk. The way that I used them was to run the scripts one by one, and obtaining the results in the end to be used in the report.
I felt safe tweaking the scripts and obtaining new results, as everything was in a git repository, so I can always go back to an earlier version if something bad happens.

When I started the proper phd research, this system started showing some shortcomings:

The number of scripts increased sharply. I have a bigger problem to solve, and I have to compare my method with multiple alternatives. Some of these alternatives share the same piece of the pipeline. I needed to make sure that when I tweak some part of the pipeline, the rest still works. This maintaining became more work.
I needed to use earlier code used in a different problem, and sometimes change it for new needs. How do I make sure that the old pipeline still works after the changes?

The answer is quite straightforward, once I dug into the problem a little.
I needed to separate my code into a library and things (scripts mostly) that use the library to do experiments, and achieve the goals, for example to produce graphics which compare algorithms.
In the programming world, this is referred to as the library layer and the application layer.

The library part was not too hard to set up.
In the R community, everyone writes packages. The standard practice is to release a package when publishing an article. Entire websites and books (this one for example) are dedicated to teach you how to do that properly.

The script part, however, is a bit more difficult to organize.
I spent quite some time to look over how to set up the folder structure for research projects, and how to do “reproducible research”. There are some online resources on the topic (seminars, books, articles, even a MOOC). But things were a lot less clearer, in my opinion for two reasons:

It is much more difficult to have a clear definition for some cross-discipline concept as “reproducible research” as it is to have a clear definition for an R package (defined as a specific way to structure R code).
There were no go-to tools for reproducible research. There are things that you can use to make things easier, Jupyter or knitr for example. But they are only tools which address the needs of some parts of your work. There is no software that you could use or end-to-end procedure that you could follow, that will guarantee that your research is reproducible. Contrary to it are R packages, basically you just click on the button “new package” in RStudio, and then all you need to do is to fill up the empty folders with your code. All the methodological thinking is done, and you only need to focus on the content.

With reproducible research, what you have is a list of principles. You may agree with all of the principles, but you still need to design your own system so that you can actually follow them. This is reasonable for me, as doing scientific experiments is intrinsically an activity which has more loose ends than developing software. It is not uncommon for researchers in the past to design their own note-taking systems or experimental procedures.

Folder structure, monolithic repository, and knitr

The code folder looks like this

main_thesis_work
    my_R_package
        R
            various_R_files_with_functions.R
        data
            reusable_datasets
        docs
            R_documentation_generated_using_Roxygen
        srcs
            things_in_Cpp.cpp
        ...
    everthing_else
        data
            data_scripts
                scripts_to_regenerate_reusable_datasets_from_raw_datasets.R
            2016_06_01
                metadata.txt
                metadata.RDS
                actual_dataset.RDS
            2016_07_01
                ...
        analysis
            2016_06_01
                graphic_obtained_with.pdf
            2016_07_01
                ...
        scripts
            2016_06_01_script_name_which_explains_what_the_experiment_is_about.R
            ...
internship
	internship_code
	internship_report

I used R almost exclusively in my thesis. I coded some computation-intensive parts of the algorithms with RCpp, but the code is mostly in R. The R package part follows standard package structure.
In the everything\_else folder, though, I made several choices.

The whole main_thesis_work folder is a single git repository. It is also a single R workspace. I set up RStudio so that when I do the keyboard shortcut to “build and reload”, the library folder is rebuilt and reloaded (this is not specific to R. It’s a simple configuration available in most modern IDEs). A typical workflow is when I have an experiment script open, run the experiment, realize that something is wrong, fix the problem in the library, rebuild and reload, and then rerun the experiment.
I decided to properly format and structure several datasets, put the data scripts in the everything\_else folder and put the datasets in the package. Shipping some datasets with the package makes it easy for users of the package to start and try the algorithms directly. It also made it easy for me to test new things, without having to think about cleaning datasets.
I decided to save everything experiment related (output datasets, script, generated graphics) with timestamps in the filename or folder name. One could argue that this is redundant with the git log. For me, as it summarized information at a different time scale (based on each experiment, and not each commit), it actually provided extra information. It was very practical when I wanted to rerun some experiments when I changed the library. And since the pdf files are versioned, I can commit the changes to the graphics making it clear what changed after the library modifications.
I intensively used knitr.³ The experiment code I wrote was all spin scripts with YAML header and knitr chunks and occasional markdown explanations. When I say I “run” an experiment, I mean I click on the “knit” button in RStudio, the computer does its work, and then graphics and caches are automatically saved into the analysis and data folder with the experiment date, because I configure this as default chunk option. This was so powerful. The ease that it brought never having to worry about losing something was so empowering.

Individual repositories for articles and presentations

The folder structure above makes it easy to keep on track the experiments, which for me were the things that are most easily messed up. For the rest of the activities in my thesis, I adopted a structure that is pretty simple and flat.

main_thesis_work
    my_R_package
        ...
    everything_else
        ...
internship
	internship_code
	internship_report
writings
    20140701_first_presentation_to_advisors
        prez.tex
        ...
    20140901_first_article
    20150501_presentation_at_seminar_1
    20150601_presentation_at_seminar_2
    20150701_presentation_at_seminar_3
    20150801_second_article
    ...

The folders in writings are independent and chronologically organized. Most of them are individual git repositories. When I needed graphics and tables produced by the experiments, I copy-paste the them from the experiment part of the main_thesis_work folder.

I find it nice to have these small individual git repositories, as the git log keeps a clear history of the modifications of that particular article or presentation. Especially for the articles, on which there can be many iterations of rewriting, following feedbacks from advisors and peer reviews, having this history before my eyes really emboldens me to go ahead and make the changes.

At the beginning of my thesis, I considered having a more sophisticated integration between the writings folder and the experiments, for example, to have the graphics automatically copied into the article folders, or write my articles directly in the main_thesis_work folder, with R notebooks or something similar to Jupyter. In the end, I decided to keep it simple and separated, and doing the copy-paste by hand, for two reasons:

For me, the articles have a different iteration time scale as the experiments. I did not necessarily change the articles after every try, but decided to make iterations on the articles when there are enough results produced by the experiments. This is the case for scientific experiments. For people who are working on industrial applications of machine learning, notebooks might be better (R notebooks or Jupyter).
I liked executing knitr code better than using the notebook solutions. The upside of being able to control where the graphics are stored how and when outweighed the downside of having the graphics in a separate window than the code.

Using submodules

I used git submodules to combine the articles written during the thesis into the actual manuscript. The folder structure looks like this:

main_thesis_work
    my_R_package
        ...
    everything_else
        ...
internship
	internship_code
	internship_report
writings
    20140701_first_presentation_to_advisors
        prez.tex
        ...
    20140901_first_article
		.git
    20150501_presentation_at_seminar_1
    20150601_presentation_at_seminar_2
    20150701_presentation_at_seminar_3
    20150801_second_article
		.git
    ...
    thesis
		.git
        introduction
        chapter_from_first_article
			.git
        chapter_from_second_article
			.git

Using git submodules, the chapter_from_article folders each contains a clone of the repositories of the articles. This means that the changes done in the article folders can be retrieved easily in the chapter folders, and vice versa.

In practice, the article folders and the chapter folders have different branches checked out. Typically, in the chapter folders, each chapter starts with an introduction of the chapter in the context of the thesis, while the articles are all independent of each other. When I receive reviewers’ feedback on an submission, and modify the article in the article folders, all I need to do is

git pull article_repo article_branch

in the chapter folder, and the changes would be reflected in the thesis chapter.

Managing the `origin` and working on several computers

I worked on my thesis both on a corporate computer and on my own laptop (much lighter and therefore easier to carry every where). I wasn’t allowed to put anything related to my thesis on GitHub (private repositories were a paid service at the time). The company had a GitLab server, but my personal computer did not have access to it.

To synchronize my work on both computers, I bought an external hard drive, put a bare repository on that hard drive, and used that as remote. It did not have all the functionalities of GitHub, but with a graphical UI for git (SourceTree), it was good enough for me.

Further references for practical suggestion and tips

The chosen answer has many tips that I used in this StackOverflow questions.
To get started, this guide provides excellent tutorials for people not familiar with git.

I have @Matthieur Durut to thank for for this. He taught a C++ class at ENSAE and the only requirement for the only home assignment of the class was to submit a program as a group (in any object oriented programming language) on GitHub. ^[return]
The two methods are additive regression models and clustering. The problem is to estimate electricity consumption on a previously unknown spatial scale of interest. ^[return]
Yihui Xie is truly my hero for the last six years. ^[return]