# How does git work? :::{admonition} My lies continue :class: warning I'm not going to answer that question, at least not fully. The [Git book](https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository) has the details of the internal workings of git, and I see no reason to repeat all of that here. However, there is one aspect of git I think is useful to understand: How does git track changes to files? ::: ## diff Fundamentally, git determines how you've changed a file by using the UNIX [diff](https://www.geeksforgeeks.org/diff-command-linux-examples/) command. Let's see how that works. Assume I have a file containing this text:[^nopoints] [^nopoints]: No points will be awarded if you figure out from which file I excerpted this text. ```text You can use git in this way. But that's not the way you're going to use it when you work with a scientific research group. The truth is that you're going to use git with a repository of some sort: - It may be that your files will be fairly independent with respect to others in your working group. In that case, you'll probably be asked, at the end of the summer, to upload your files to a central git repository so that others can access what you've done. - It's more likely that you'll be asked to work with files that someone else has written. You'll copy the files to your own area, and perhaps create one or more branches for your work. At some point, after testing and review, you might be asked to merge your work back into the original project. ``` For the sake of this example, I make a copy of this file and make a couple of changes. Can you spot the differences? ```text You can use git in this way. But that's not the way you're going to use it when you work with a scientific research group. The truth is that you're going to use git with a repository of some sort: - It may be that your files will be fairly independent with respect to others in your working group. In that case, you'll probably be asked, at the end of the summer, to upload your files to a central git repository so that others can access what you've done.

- It's more likely that you'll be asked to work with files that someone else has written. You will copy the files to your own area, and perhaps create one or more branches for your work. At some point, after testing and review, you might be asked to merge your work back into the original project. ``` Did you see the differences? Fortunately, you don't have to look for them by eye. You can let the {command}`diff` command do that for you. Assume that first block of text above is in file {file}`test.txt` and the second block is in {file}`test2.txt`: ```text diff test.txt test2.txt 10a11,12 >

> 12c14 < someone else has written. You'll copy the files to your own area, --- > someone else has written. You will copy the files to your own area, ``` Essentially, the {command}`diff` command tells us the changes we'd have to make to {file}`test.txt` to make it look like {file}`test2.txt`. In a directory in which you've typed `git init`, or when you use `git clone` to copy a remote repository, git creates a [hidden directory](https://twiki.nevis.columbia.edu/twiki/bin/view/Main/HiddenPackages) named `.git`. Whenever you type `git commit`, git uses {command}`diff` to compare each file with the version stored in `.git`.[^details] [^details]: I've omitted a lot of details here, including the structure of the `.git` directory and the specific options supplied to {command}`diff`. That's part of the reason the title of this page is a lie. The net effect is that git doesn't keep multiple copies of every version of every file of every commit. Instead, it only tracks the differences. ## Why even mention this? The reason why I've described this is to explain why git is good at managing and tracking changes for some kinds of files, but is a poor choice for others. Git is a good choice for: - text files, like the ones used to create the web pages you're reading now; - source code, such as python `.py` files, C++ `.cpp` files, and shell scripts like `.sh` files; - research papers, like one you might write for your work. Git is _bad_ choice for: - binary files, such as compiled code and shared libraries; - images such as `.png` files; - most database formats; - compressed data files, such as ROOT n-tuples. The reason why is that {command}`diff` can't work on binary files. It does a line-by-line comparison, and binary files don't usually have line breaks. Even if they did, a small change in (for example) an image file typically causes an overall change in the file's compressed contents. That's part of the reason why I {ref}`advised against ` using `git add .`: it might idly add frequently-changing binary files to git's tracking, which would waste large amounts of disk space with every commit. Another potential issue is a `.ipynb` file, that is, a file used to store a {ref}`Jupyter notebook `. If, by some feat of discipline, there's nothing stored in the file except lines of text, then the result of {command}`diff` won't be large. But if there's a single figure embedded within it, even rendered {ref}`Markdown `, we have same problem as we would with an image file. There is an exception: If you have a binary file that is unlikely to change, then it might make sense to include it in a git repository. For example, consider the figures I created for this tutorial. Those `.png` files are static, and almost certainly won't change once I've comitted them. :::{figure-md} binary_heart-fig :align: center xkcd binary_heart by Randall Munroe. As long as the heart remains static, we're fine. Change the heart to a diamond or a club, then {command}`diff` will have a problem. :::