How does git work?
My lies continue
I’m not going to answer that question, at least not fully.
The Git book has the details of the internal workings of git, and I see no reason to repeat all of that here.
However, there is one aspect of git I think is useful to understand: How does git track changes to files?
diff
Fundamentally, git determines how you’ve changed a file by using the UNIX diff command.
Let’s see how that works. Assume I have a file containing this text:1
You can use git in this way. But that's not the way you're
going to use it when you work with a scientific research group.
The truth is that you're going to use git with a repository of some sort:
- It may be that your files will be fairly independent with respect to
others in your working group. In that case, you'll probably be
asked, at the end of the summer, to upload your files to a central
git repository so that others can access what you've done.
- It's more likely that you'll be asked to work with files that
someone else has written. You'll copy the files to your own area,
and perhaps create one or more branches for your work. At some
point, after testing and review, you might be asked to merge your
work back into the original project.
For the sake of this example, I make a copy of this file and make a couple of changes. Can you spot the differences?
You can use git in this way. But that's not the way you're
going to use it when you work with a scientific research group.
The truth is that you're going to use git with a repository of some sort:
- It may be that your files will be fairly independent with respect to
others in your working group. In that case, you'll probably be
asked, at the end of the summer, to upload your files to a central
git repository so that others can access what you've done.
<p />
- It's more likely that you'll be asked to work with files that
someone else has written. You will copy the files to your own area,
and perhaps create one or more branches for your work. At some
point, after testing and review, you might be asked to merge your
work back into the original project.
Did you see the differences? Fortunately, you don’t have to look for
them by eye. You can let the diff command do that for
you. Assume that first block of text above is in file test.txt
and the second block is in test2.txt
:
diff test.txt test2.txt
10a11,12
> <p />
>
12c14
< someone else has written. You'll copy the files to your own area,
---
> someone else has written. You will copy the files to your own area,
Essentially, the diff command tells us the changes we’d have to make to
test.txt
to make it look like test2.txt
.
In a directory in which you’ve typed git init
, or when you use git clone
to copy a remote repository, git creates a hidden
directory
named .git
. Whenever you type git commit
, git uses diff
to compare each file with the version stored in .git
.2
The net effect is that git doesn’t keep multiple copies of every version of every file of every commit. Instead, it only tracks the differences.
Why even mention this?
The reason why I’ve described this is to explain why git is good at managing and tracking changes for some kinds of files, but is a poor choice for others.
Git is a good choice for:
text files, like the ones used to create the web pages you’re reading now;
source code, such as python
.py
files, C++.cpp
files, and shell scripts like.sh
files;research papers, like one you might write for your work.
Git is bad choice for:
binary files, such as compiled code and shared libraries;
images such as
.png
files;most database formats;
compressed data files, such as ROOT n-tuples.
The reason why is that diff can’t work on binary files. It does a line-by-line comparison, and binary files don’t usually have line breaks. Even if they did, a small change in (for example) an image file typically causes an overall change in the file’s compressed contents.
That’s part of the reason why I advised against using
git add .
: it might idly add frequently-changing binary files to
git’s tracking, which would waste large amounts of disk space
with every commit.
Another potential issue is a .ipynb
file, that is, a file used to
store a Jupyter notebook. If, by some feat of
discipline, there’s nothing stored in the file except lines of text,
then the result of diff won’t be large. But if there’s a
single figure embedded within it, even rendered Markdown, we have same problem as we would with an image
file.
There is an exception: If you have a binary file that is unlikely to
change, then it might make sense to include it in a git
repository. For example, consider the figures I created for this
tutorial. Those .png
files are static, and almost certainly won’t
change once I’ve comitted them.

Figure 139: https://xkcd.com/99/ by Randall Munroe. As long as the heart remains static, we’re fine. Change the heart to a diamond or a club, then diff will have a problem.