It was one of the best decisions I made in my PhD thesis.
I started my phd officially in January 2015, although the internship that finally turned into a thesis project started in September 2013.
I wrote the articles that became chapters of the thesis gradually during the four years which lead to a month or maybe two months of manuscript writing, rewriting and organization, which was relatively easily and painless.
And I think a lot of it has to do with the fact that my whole thesis work was version controlled in git.
As the internship/phd project was in applied statistics, it was clear from the beginning that I needed to write quite some code.
I liked writing code (still do), and had done it somewhat professionally during previous internships, which means specifically that I had already been confronted with the problem of writing a piece of program during a long period of time with multiple people, and therefore having to figure out how not to lose ideas through code across multiple people and/or versions of myself from different periods.
I knew that version controlling all my code was the only way to keep myself sane.
So I set up a git
repository when I had around 50 lines of R
code (maybe two weeks into my internship, I spent some time getting lost in articles in the beginning).
When the internship was finished, I wrote my internship report, did my master’s defense. Then one of my advisors suggested: you can probably make an article out of this report and submit it somewhere.
I looked at the folder that contained the internship report.
In the folder, there are the LaTeX source file, the pdf file which is the submitted report, multiple graphic files of results, and a bunch of diverse intermediate files generated by LaTeX compilation.
On the other hand, I had the feedbacks and suggested modifications from my advisors and people from the defense.
I did a mental estimation of the changes I would have to do to get this report publishable.
This is when I realized I had two options:
copying the folder and calling it internship_report_2_article, and therefore have the following folder structure:
internship
internship_code
internship_report
thesis
using a disciplined way to manage the multiple versions of written work, as well as the code.
The second option is probably a bit more work upfront, but a no-brainer in the long run.
I set about the Internet for tips, using some of them, and gradually changed the ways to use the whole system through out the four years. In the end of my thesis, I arrived at a set of practices. This blog post summarizes those. As is often the case in programming practices, they are not perfect, and they are not ideal for every situation and every person. They are the results of a long-term negotiation among the goals I needed to achieve, the suggestion and tips I found on the Internet, my technical ability and real-world constraints of the moment.
Most online tips recommend committing only the files you modify by hand (so the LaTeX files) and ignoring everything else. I tried that out and in the end decided version controlling the .tex
files and the .pdf
files. There are two reasons for it:
The practical downside of this practice is that the git repository is larger, but on modern computers, disk space is nearly never the issue.
One could argue that versioning is for keeping the diffs.
If a file is binary (like a pdf), then it’s not meant to be in the git.
I agree with this argument in principle.
One way to perfectly solve this problem would be to hook up the (mostly) LaTeX
repository with a continuous integration system which compiles and saves the generated artefacts (the pdfs) periodically or after every commit (ShareLatex or Overleaf sort of does that).
But that takes much more work.
As a system that complex could break more easily and might take much time to maintain, and it would not be worthwhile as I was the only one using it.
As I mentioned above, the code part of my thesis was version controlled right from the beginning1. There is plenty of resources on how to correctly version control a piece of software, as it is one of the most common practices of software development.
One thing that was not obvious when I started my thesis, but became gradually clearer was the concept of a library.
As I started out from statistics and not computer science, the code that I wrote before my thesis was mostly stand-alone “algorithms” that are either a single function (in the best case) or a single script with parameters defined at the beginning of the file calling some functions defined in that script (in R
for example) or some files beside it.
For my internship, it was mostly enough.
In this internship, I did something which was typical for someone’s first scientific research project: I put two classic methods together to solve a new problem2.
The code consists of several R
scripts, each in its own file.
Every script serves its purpose: it reads some data (in the form of files), does something to that data, and saves the results somewhere, either as more data to be read as files in a specific place on the disk, or as graphics or tables to be read by human also in a specific place on the disk.
The way that I used them was to run the scripts one by one, and obtaining the results in the end to be used in the report.
I felt safe tweaking the scripts and obtaining new results, as everything was in a git repository, so I can always go back to an earlier version if something bad happens.
When I started the proper phd research, this system started showing some shortcomings:
The answer is quite straightforward, once I dug into the problem a little.
I needed to separate my code into a library and things (scripts mostly) that use the library to do experiments, and achieve the goals, for example to produce graphics which compare algorithms.
In the programming world, this is referred to as the library layer and the application layer.
The library part was not too hard to set up.
In the R
community, everyone writes packages.
The standard practice is to release a package when publishing an article.
Entire websites and books (this one for example) are dedicated to teach you how to do that properly.
The script part, however, is a bit more difficult to organize.
I spent quite some time to look over how to set up the folder structure for research projects, and how to do “reproducible research”.
There are some online resources on the topic (seminars, books, articles, even a MOOC).
But things were a lot less clearer, in my opinion for two reasons:
R
package (defined as a specific way to structure R
code).knitr
for example. But they are only tools which address the needs of some parts of your work. There is no software that you could use or end-to-end procedure that you could follow, that will guarantee that your research is reproducible. Contrary to it are R
packages, basically you just click on the button “new package” in RStudio, and then all you need to do is to fill up the empty folders with your code. All the methodological thinking is done, and you only need to focus on the content.With reproducible research, what you have is a list of principles. You may agree with all of the principles, but you still need to design your own system so that you can actually follow them. This is reasonable for me, as doing scientific experiments is intrinsically an activity which has more loose ends than developing software. It is not uncommon for researchers in the past to design their own note-taking systems or experimental procedures.
The code folder looks like this
main_thesis_work
my_R_package
R
various_R_files_with_functions.R
data
reusable_datasets
docs
R_documentation_generated_using_Roxygen
srcs
things_in_Cpp.cpp
...
everthing_else
data
data_scripts
scripts_to_regenerate_reusable_datasets_from_raw_datasets.R
2016_06_01
metadata.txt
metadata.RDS
actual_dataset.RDS
2016_07_01
...
analysis
2016_06_01
graphic_obtained_with.pdf
2016_07_01
...
scripts
2016_06_01_script_name_which_explains_what_the_experiment_is_about.R
...
internship
internship_code
internship_report
I used R
almost exclusively in my thesis.
I coded some computation-intensive parts of the algorithms with RCpp, but the code is mostly in R
.
The R
package part follows standard package structure.
In the everything\_else
folder, though, I made several choices.
main_thesis_work
folder is a single git repository. It is also a single R workspace. I set up RStudio so that when I do the keyboard shortcut to “build and reload”, the library folder is rebuilt and reloaded (this is not specific to R
. It’s a simple configuration available in most modern IDEs). A typical workflow is when I have an experiment script open, run the experiment, realize that something is wrong, fix the problem in the library, rebuild and reload, and then rerun the experiment.everything\_else
folder and put the datasets in the package. Shipping some datasets with the package makes it easy for users of the package to start and try the algorithms directly. It also made it easy for me to test new things, without having to think about cleaning datasets.knitr
.3 The experiment code I wrote was all spin
scripts with YAML header and knitr
chunks and occasional markdown explanations. When I say I “run” an experiment, I mean I click on the “knit” button in RStudio, the computer does its work, and then graphics and caches are automatically saved into the analysis
and data
folder with the experiment date, because I configure this as default chunk option. This was so powerful. The ease that it brought never having to worry about losing something was so empowering.
The folder structure above makes it easy to keep on track the experiments, which for me were the things that are most easily messed up. For the rest of the activities in my thesis, I adopted a structure that is pretty simple and flat.
main_thesis_work
my_R_package
...
everything_else
...
internship
internship_code
internship_report
writings
20140701_first_presentation_to_advisors
prez.tex
...
20140901_first_article
20150501_presentation_at_seminar_1
20150601_presentation_at_seminar_2
20150701_presentation_at_seminar_3
20150801_second_article
...
The folders in writings
are independent and chronologically organized.
Most of them are individual git repositories.
When I needed graphics and tables produced by the experiments, I copy-paste the them from the experiment part of the main_thesis_work
folder.
I find it nice to have these small individual git repositories, as the git log keeps a clear history of the modifications of that particular article or presentation. Especially for the articles, on which there can be many iterations of rewriting, following feedbacks from advisors and peer reviews, having this history before my eyes really emboldens me to go ahead and make the changes.
At the beginning of my thesis, I considered having a more sophisticated integration between the writings folder and the experiments, for example, to have the graphics automatically copied into the article folders, or write my articles directly in the main_thesis_work
folder, with R
notebooks or something similar to Jupyter.
In the end, I decided to keep it simple and separated, and doing the copy-paste by hand, for two reasons:
R
notebooks or Jupyter).knitr
code better than using the notebook solutions. The upside of being able to control where the graphics are stored how and when outweighed the downside of having the graphics in a separate window than the code.I used git submodules to combine the articles written during the thesis into the actual manuscript. The folder structure looks like this:
main_thesis_work
my_R_package
...
everything_else
...
internship
internship_code
internship_report
writings
20140701_first_presentation_to_advisors
prez.tex
...
20140901_first_article
.git
20150501_presentation_at_seminar_1
20150601_presentation_at_seminar_2
20150701_presentation_at_seminar_3
20150801_second_article
.git
...
thesis
.git
introduction
chapter_from_first_article
.git
chapter_from_second_article
.git
Using git submodules, the chapter_from_article
folders each contains a clone of the repositories of the articles.
This means that the changes done in the article folders can be retrieved easily in the chapter folders, and vice versa.
In practice, the article folders and the chapter folders have different branches checked out. Typically, in the chapter folders, each chapter starts with an introduction of the chapter in the context of the thesis, while the articles are all independent of each other. When I receive reviewers’ feedback on an submission, and modify the article in the article folders, all I need to do is
git pull article_repo article_branch
in the chapter folder, and the changes would be reflected in the thesis chapter.
origin
and working on several computersI worked on my thesis both on a corporate computer and on my own laptop (much lighter and therefore easier to carry every where). I wasn’t allowed to put anything related to my thesis on GitHub (private repositories were a paid service at the time). The company had a GitLab server, but my personal computer did not have access to it.
To synchronize my work on both computers, I bought an external hard drive, put a bare repository on that hard drive, and used that as remote. It did not have all the functionalities of GitHub, but with a graphical UI for git (SourceTree), it was good enough for me.
C++
class at ENSAE and the only requirement for the only home assignment of the class was to submit a program as a group (in any object oriented programming language) on GitHub.
[return]