Why Git is described as a "content tracking system" -- Part 1

Printer-friendly versionPrinter-friendly version

As the next in what might turn into a series on the Git version control system, this post explains what it means to describe Git as a "content tracking" system, a concept that is sometimes a bit difficult for newcomers to wrap their heads around, so let's explain it by way of comparison and a couple trivial Git examples.

Most people who have used a version control system (VCS) are used to the idea that their VCS will initially store the entire content of a new file that has just been added to that VCS's repository, after which what will be stored in the VCS will be the differences or deltas for that file as changes are made to the file and committed to the VCS.

Put another way, users are used to the idea that while a file's initial commit could be quite sizable, if one makes and commits only small changes to the file from then on, what gets stored in the VCS from commit to commit is only the differences between successive versions of that file -- very space-efficient, as users like to think.

But this model of VCS operation has an obvious consequence. It means that while the VCS stores the (space-efficient) deltas between the file versions, getting the current version of any file requires a calculation that involves taking the initial version of the file and applying to it all of the stored deltas.

As a concrete example, if someone added a massive 100M file to such a VCS, then made and committed 100 miniscule changes to that file, what would end up in the VCS would be some representation of that original 100M file, plus 100 very small deltas corresponding to the successive changes made to that file over time. This all makes sense, but it has an obvious consequence.

A VCS that behaves this way can be said to store deltas, but if one wants to fetch the current version of a file, that would require calculating that version using the file's initial version, then applying each delta to it in the correct order to eventually generate the current version of the file.

In other words, a VCS that behaved this way would obviously be very space-efficient, but the tradeoff would be that simply checking out the latest version of a file would be computationally intensive, but that's just the price you pay for a VCS with this model of operation.

Git, however, takes the opposite approach. Rather than storing deltas, Git really does store the entire file contents for every version of that file. Using our earlier example, if we committed an initial file of size 100M, then made a tiny change and committed that, our Git repository would now contain two 100M file representations, which would be almost (but not quite) identical.

Most users learning Git for the first time are shocked to learn this, immediately complaining about how wasteful this has to be in terms of disk space. But this technique has its advantages, as you'll see in a couple hands-on Git examples coming up in the imminent Part 2. So, as they say on "Sports Night," stick around ...

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <p> <br> <pre> <h1> <h2> <h3> <h4>
  • Lines and paragraphs break automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.