How does git work?

My lies continue

I’m not going to answer that question, at least not fully.

The Git book has the details of the internal workings of git, and I see no reason to repeat all of that here.

However, there is one aspect of git I think is useful to understand: How does git track changes to files?

diff

Fundamentally, git determines how you’ve changed a file by using the UNIX diff command.

Let’s see how that works. Assume I have a file containing this text:1

You can use git in this way. But that's not the way you're
going to use it when you work with a scientific research group.

The truth is that you're going to use git with a repository of some sort:

- It may be that your files will be fairly independent with respect to
  others in your working group. In that case, you'll probably be
  asked, at the end of the summer, to upload your files to a central
  git repository so that others can access what you've done.

- It's more likely that you'll be asked to work with files that
  someone else has written. You'll copy the files to your own area,
  and perhaps create one or more branches for your work. At some
  point, after testing and review, you might be asked to merge your
  work back into the original project.

For the sake of this example, I make a copy of this file and make a couple of changes. Can you spot the differences?

You can use git in this way. But that's not the way you're
going to use it when you work with a scientific research group.

The truth is that you're going to use git with a repository of some sort:

- It may be that your files will be fairly independent with respect to
  others in your working group. In that case, you'll probably be
  asked, at the end of the summer, to upload your files to a central
  git repository so that others can access what you've done.

<p />

- It's more likely that you'll be asked to work with files that
  someone else has written. You will copy the files to your own area,
  and perhaps create one or more branches for your work. At some
  point, after testing and review, you might be asked to merge your
  work back into the original project.

Did you see the differences? Fortunately, you don’t have to look for them by eye. You can let the diff command do that for you. Assume that first block of text above is in file test.txt and the second block is in test2.txt:

diff test.txt test2.txt
10a11,12
> <p />
> 
12c14
<   someone else has written. You'll copy the files to your own area,
---
>   someone else has written. You will copy the files to your own area,

Essentially, the diff command tells us the changes we’d have to make to test.txt to make it look like test2.txt.

In a directory in which you’ve typed git init, or when you use git clone to copy a remote repository, git creates a hidden directory named .git. Whenever you type git commit, git uses diff to compare each file with the version stored in .git.2

The net effect is that git doesn’t keep multiple copies of every version of every file of every commit. Instead, it only tracks the differences.

Why even mention this?

The reason why I’ve described this is to explain why git is good at managing and tracking changes for some kinds of files, but is a poor choice for others.

Git is a good choice for:

  • text files, like the ones used to create the web pages you’re reading now;

  • source code, such as python .py files, C++ .cpp files, and shell scripts like .sh files;

  • research papers, like one you might write for your work.

Git is bad choice for:

  • binary files, such as compiled code and shared libraries;

  • images such as .png files;

  • most database formats;

  • compressed data files, such as ROOT n-tuples.

The reason why is that diff can’t work on binary files. It does a line-by-line comparison, and binary files don’t usually have line breaks. Even if they did, a small change in (for example) an image file typically causes an overall change in the file’s compressed contents.

That’s part of the reason why I advised against using git add .: it might idly add frequently-changing binary files to git’s tracking, which would waste large amounts of disk space with every commit.

Another potential issue is a .ipynb file, that is, a file used to store a Jupyter notebook. If, by some feat of discipline, there’s nothing stored in the file except lines of text, then the result of diff won’t be large. But if there’s a single figure embedded within it, even rendered Markdown, we have same problem as we would with an image file.

There is an exception: If you have a binary file that is unlikely to change, then it might make sense to include it in a git repository. For example, consider the figures I created for this tutorial. Those .png files are static, and almost certainly won’t change once I’ve comitted them.

xkcd binary_heart

Figure 139: https://xkcd.com/99/ by Randall Munroe. As long as the heart remains static, we’re fine. Change the heart to a diamond or a club, then diff will have a problem.


1

No points will be awarded if you figure out from which file I excerpted this text.

2

I’ve omitted a lot of details here, including the structure of the .git directory and the specific options supplied to diff. That’s part of the reason the title of this page is a lie.