How Git keeps track of content history -- a visual approach.

Printer-friendly versionPrinter-friendly version

Because the Git version control system takes a rather novel approach to how it stores history, I thought I'd write a very quick-and-dirty explanation of that. Once new users understand how this works, a great deal about Git suddenly becomes a lot clearer. So let's see how well this works out. (I'm still touching up this post, so if you see a typo, leave a comment. Or just leave a comment, anyway. And, yes, I can see Drupal is doing silly things with overly long lines; I'm working on that.)

Let's start with a regular checkout (clone) of the mainstream Linux kernel source tree, somewhere in or under your home directory (none of this requires root access):

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

which will, after a while, give you a new directory named linux-2.6, so cd into that new directory, and check that you have only a branch named master:

$ git branch
* master
$

So far, so good. And here's where it gets interesting.

Git does not store history as the differences or "deltas" between changesets. Rather, every new "state" of the repository is identified by a 40-digit SHA1 checksum, which is a reference to a collection of all of the components that make up that state of that repository.

For instance, right this minute, I have a fully-updated clone of the mainstream kernel source tree, and that state is represented by a checksum I can see with, among other commands, git log:

$ git log
commit 06867fbb8abc936192195e5dcc4b63e12cc78f72
... snip ...

What this tells me is that a full description of the entire repository at this minute is uniquely identified by that 40-digit value: 06867fbb8abc936192195e5dcc4b63e12cc78f72. But what does that value mean? Simple.

Git has a ls-tree subcommand that allows you to see what that value represents for Git tree-like things and by "tree-like things", I don't mean simply a subdirectory like arch, but rather a specific state of a directory like arch. We can ask for the ls-tree information about our current clone with either of:

$ git ls-tree master
$ git ls-tree 06867fbb8abc936192195e5dcc4b63e12cc78f72

which, in my case, gives me:

$ git ls-tree master
100644 blob 57af07cf7e682e77de69d96587b0ca315ea611a1	.gitignore
100644 blob 9b0d0267a3c3f1ea75a674fe858fac2165a8b683	.mailmap
100644 blob ca442d313d86dc67e0a2e5d584b465bd382cbf5c	COPYING
100644 blob 44fce988eaac8cd22bfe5a5e753ae1bb58b3476d	CREDITS
040000 tree 2912a23c4db9ef335ec243fe45e3d1647e9b3469	Documentation
100644 blob b8b708ad6dc3815eb0d23bfea2c972d03b9477c0	Kbuild
100644 blob c13f48d65898487105f0193667648382c90d0eda	Kconfig
100644 blob 0e7a80aefa0c27d52e9dec4e2a29d181ff7575ac	MAINTAINERS
100644 blob ea51081812f38d5ee8dfaeaab060a9fb4a86ba67	Makefile
100644 blob 0d5a7ddbe3ee8d108bf7079909ddcec30dfb4560	README
100644 blob 55a6074ccbb715d99b642fa510d3c993121f453d	REPORTING-BUGS
040000 tree 441a71cc2d8c5ce4e108c26c67aaf52916400fb4	arch
040000 tree c76a4799e7bbc59a28dfbcb3900aab91935b34e7	block
040000 tree c983e8fa79bbaf07c9c4ed500e06ed6f371fa2b9	crypto
040000 tree 71ff18ac162474014e60665b49310c97fa933d93	drivers
040000 tree e30493c735e1dfdd08b48eb6a4f2fcd9c8a0a6fa	firmware
040000 tree 5b34106e527f8ab64d38cac8df419528aeda5379	fs
040000 tree 5fe09e8bfc976af195008cde96f215f929456698	include
040000 tree 7aeb965e47f6ce4e62af60302cb8ce84d40164e1	init
040000 tree 8940b1012ece9d3ce666556e746bc1b0deb30a0b	ipc
040000 tree e46e01200d1ced07ae062479e4e1e709f1bfebae	kernel
040000 tree d25648460ae9a0990f13d791cf0650f8a339a15e	lib
040000 tree 66cdde83db9c02546a730348368d3b70ad7ec1aa	mm
040000 tree dc81f8b5ce5a5ead0f7c87b123cc90f4b54d9ad0	net
040000 tree a51d84f8c8a1ee2749fd603ba8a7e169936b2b7b	samples
040000 tree 9776c9f7ce1b5782efa0159b247d63ba4e219bd7	scripts
040000 tree c38a8b1047565754630b701131b55521ca5ab90d	security
040000 tree a16d007119da248ed8752ca38c63631f50ea23ce	sound
040000 tree 7c75c99cb655d561c327890a7dcb4b5aca0fbab0	tools
040000 tree 1e3684ddfd7c6b964e6ca2ad6498e3f8b24a7762	usr
040000 tree b9a39d427b5ddb10e8c7c137ef3c81465502ccd4	virt

where master is the name of my current (and only) branch. In addition, as you'll see shortly, if you're working with more than one branch, you can refer to your current branch with HEAD so, since there is currently only the one master branch, you could have done the following just as easily:

$ git ls-tree HEAD

But what does all of that mean? And this is the point at which one gets enlightenment.

In the above, anything of type blob represents the current state of a file, while anything of type tree represents the current state of a directory. In short, the current state of my branch is identified by a single SHA1 checksum, and that unique checksum represents a table of everything in my current directory, with each entry consisting of a Git type and, again, a unique SHA1 checksum which further completely defines those objects and their history.

Let me emphasize that with an actual example. You can run the same command on any subdirectory given its current SHA1 value, such as for the tools/ directory which, from above, we get the SHA1 value:

$ git ls-tree 7c75c99cb655d561c327890a7dcb4b5aca0fbab0
040000 tree 2e26f5db3439122424dfc869fd6ee301a0427033	firewire
040000 tree e0e441c15192f380eaec8e74ff8fa3abbf857722	hv
040000 tree 671b9e241e6ce933662bc571ce6be59ca90f8569	perf
040000 tree 79cdf858c28055c58b276272b8a3ae1f6d9a3ba1	power
040000 tree 9a18299d98cd48e99faf74f30724bcf893cf99b4	slub
040000 tree bdeb0c783855e54278ee8e3e7b9abd89ebdc52c6	testing
040000 tree 0e22270907013b54e71919f51bcb1349292925bd	usb
040000 tree 439c3b825c5bc3496ec77430a711cf6078cd23fb	virtio
$

And what the above tells us is that not only does the current state of the tools directory consist of those subdirectories, it also tells us it consists of those precise states of those subdirectories. And there's one more thing that's worth knowing before showing how Git handles changes.

The current SHA1 value of 7c75c99cb655d561c327890a7dcb4b5aca0fbab0 for the tools/ directory is an absolutely unambiguous representation of its entire state, including all subdirectories. That means that if I had that value on my system and you had the same value on your system, we could be absolutely sure that our respective tools/ directories were identical to the byte. In other words, those SHA1 values not only identify content and its current state, they encode it so that, apart from astronomically unlikely flukes, two identical SHA1 values in Git will always represent exactly the same content with exactly the same history. But we're not done. Let's see what happens when you make a trivial change and save it.

First, create a new branch you plan on throwing away later:

$ git branch junk
$ git branch
  junk
* master
$ git checkout junk
Switched to branch 'junk'
$ git branch
* junk
  master
$

So far, so good. And since the new branch is absolutely identical to the original, we can run all of our git ls-tree commands and we'll get the same results:

$ git ls-tree master
... snip ...
$ git ls-tree junk
... snip
$

Now let's make and commit a small change in our junk branch and see what happens.

Edit, say, the top-level Makefile and change it thusly:

NAME = rday

You can see the change with:

$ git diff
diff --git a/Makefile b/Makefile
index ea51081..2c9fa0a 100644
--- a/Makefile
+++ b/Makefile
@@ -2,7 +2,7 @@ VERSION = 3
 PATCHLEVEL = 2
 SUBLEVEL = 0
 EXTRAVERSION = -rc7
-NAME = Saber-toothed Squirrel
+NAME = rday
 
 # *DOCUMENTATION*
 # To see a list of typical targets execute "make help"

So let's commit the change on our current junk branch and see what happens.

$ git commit -a
$ git log
commit 71fbc474bf971b973a6006dbb9091edc4c0f17be
Author: Robert P. J. Day 
Date:   Sat Dec 31 10:33:09 2011 -0500

    trivial change

commit 06867fbb8abc936192195e5dcc4b63e12cc78f72
Merge: 604a16b abb959f
Author: Linus Torvalds 
... snip ...

From the above, you can see that my junk branch now has a more recent commit than the master from which it branched, but how does Git represent that change? You can see by now running git ls-tree on both branches, and seeing the difference:

$ git ls-tree master
... master ls-tree output ...
$ git ls-tree junk
... junk ls-tree output ...
$

and if you somehow ran those outputs through the diff command (or looked really carefully), you'd see that the entire difference between the master and junk branches was:

< 100644 blob ea51081812f38d5ee8dfaeaab060a9fb4a86ba67	Makefile
---
> 100644 blob 2c9fa0a25469cfb134dbaed89bc4bef5508b63ef	Makefile

All of the other entries in the ls-tree output would be identical between the two branches since there was no difference anywhere else. So if one was handed both the master and junk branches with no clue as to where they came from, it would be a simple matter to tell from the git ls-tree output that both trees were absolutely identical, except for some kind of change in that top-level Makefile.

Questions? Comments? If you followed all that, should I keep going?

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <p> <br> <pre> <h1> <h2> <h3> <h4>
  • Lines and paragraphs break automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.