CS 372 Spring 2016  >  Notes for January 20, 2016


CS 372 Spring 2016
Notes
for January 20, 2016

Version Control: Background

Introduction

Version control, also called revision control (or sometimes source control), refers to frameworks for keeping track of the history of a project: what changes were made to files, when, who made them, and why. Version control is primarily associated with software development, but it is useful in any kind of project that primarily involves the creation and maintenance of computer files.

Benefits of version control include the following.

Software that allows for modification of files might include built-in version control, or version control might be a separate package.

For example, Mediawiki, the software behind Wikipedia, Wiktionary, and other wiki-style websites, has built-in version control. When viewing any Wikipedia page, click on the “View history” tab to see a list of all revisions, each with timestamp, user who made the revision, and a comment. Revisions can be viewed and compared. Wikipedia editors can fix some problems quickly by reverting a page to an earlier revision.

In this class, we are primarily interested version-control implemented as a separate package: a version-control system. A directory that is managed by such a package is a repository. A user that has changed files enters these changes into the repository history through a commit (or check in). Version-control systems are generally based on one of two models: centralized, or distributed.

We will be using Git, a distributed version-control system.

Centralized Version Control

A centralized version-control system maintains a single repository. In order to change a file, a user first checks out the file—typically this temporarily marks the file as reserved for that user. After making changes, the user can check in the file.

The first widely used centralized version-control package was apparently the Revision Control System (RCS), released in 1982. This only allowed for local repositories; all work was done by users of a single machine.

In 1986, the Concurrent Versions System (CVS) was released. This allowed for remote repositories, accessed via a network. CVS became very popular, particularly for open-source projects.

In 2000, Subversion (SVN) was released by the Apache Software Foundation. Subversion is the most popular of the successors to CVS. It is based on much the same model, but it includes a number of additional features, like support for binary files and links/aliases. It also includes strong safety features, guaranteeing that repository operations are atomic, that is, that each repository operation will complete fully before the next operation is begun. Subversion continues to be actively maintained; it remains in common use today.

Distributed Version Control

In a distributed version-control system (DVCS), a repository can be cloned: copied in full. A clone of a repository is not in any way a second-class repository; it is just as good as the original. We typically refer to a clone as being downstream from the original; the original is upstream from the clone.

When we use a DVCS, we typically do all work on a local clone, perhaps sending changes to an upstream repository when we are satisfied with them. To update a repository with changes from upstream, we pull. To send changes to an upstream repository, we push.

We may also use cloning to begin a new project based on the work already done in some other project. Such a new project is a fork of the original. Often the changes made to a fork are never intended to be included in the original project.

Most DVCS packages support the idea of branching. This is a usually temporary fork of a project, with the files stored in the same repository, but in a different portion of the repository history. To update the master repository files with changes made in a branch, we do a merge operation.

A number of DVCS packages were released in the 1990s. One that became widely used was Bitkeeper, from Bitmover, Inc.

Bitkeeper was released under a proprietary-software license, which made its use in free/open-source projects problematic. In response to problems like this, the GNU Project released a DVCS called GNU arch in 2001. GNU arch was short-lived—there have been no releases since 2006, and it is now deprecated—but it seems to have initiated a wave of high-quality free/open-source DVCS packages, the first of which was Darcs, released in 2002.

2005 saw the initial release of three major DVCS packages: Bazaar (bzr), Git, and Mercurial (hg—the chemical symbol for Mercury is “Hg”).

Today, Git is clearly the mostly heavily used DVCS. Most of the others seem to be fading. However, I expect Mercurial, at least, to be actively maintained for some time to come, because it is used by a number of major software projects, including CPython (the reference implementation of the Python programming language) and the Mozilla Firefox web browser. Facebook, Inc. apparently also uses Mercurial in their internal development.

Version Control: Git

Introduction

In this class, we will do our version control with Git. As noted above, Git is a popular DVCS, initially released in 2005. Git was developed for use by the Linux kernel project, and it is still used by that project today. Git is available under a free/open-source license, specifically, the GNU GPL, version 2.

Because of its roots, Git was initially supported only on Unix-like systems. Its availability on Windows was poor for a number of years. However, today Git is well supported on all major operating systems.

Git can be used from the command line, or via various graphical front-ends. We use Git by giving it commands. I will give Git commands in command-line form, beginning with a dollar sign ($), representing the command-line prompt. Graphical front-ends will, of course, have different interfaces, but the commands should be the same.

Repository Creation & Cloning

When you first begin to work with Git, we need to configure it properly on the local machine. Two of the most important configuration options involve your identity as a Git user: name and email. Each commit you make is marked with this information, so be sure to set it up before you do anything else with Git. I cannot tell you how to do it, as different Git front-ends will their own interfaces. But any documentation or tutorial should cover this very early.

We begin actually using Git by creating a repository. Typically, we will want to deal with a remote repository. If one does not yet exist, then it is created using whatever interface the site uses. (Use of the GitHub site was demonstrated in class.) A GitHub repository is not really fully functional until there is something in it, so if a site allows you to initialize a repository with a file—perhaps a README—then do so.

If you are using Git with an existing project that already has a remote repository, then there is generally no need to create another remote repository.

If, on the other hand, you want to use Git entirely locally—with no remote repository involved—then you can create a new local repository using the Git init command. Create a directory, go into it, and do an init.

$ git init

Once a master repository is created, we begin working with it by making a clone of it, using the clone command. A remote site that hosts Git repositories will give you a URL to use when cloning.

$ git clone REMOTE_URL

Above, REMOTE_URL should be replaced by the URL given by the site. This creates a local clone in an appropriately named directory. Git should tell you what the name of the directory is.

To clone a local repository, go outside the repository directory, and do a clone command with two parameters: the path of the original repository, and the path where you want the clone to be.

$ git clone ORIGINAL_PATH CLONE_PATH

Once a clone exists, we generally work entirely with the clone, handling the upstream repository via pulls and pushes from the clone.

A very useful command is status.

$ git status

This gives lots of useful information: the current checked-out branch, which files have been changed but not staged, what changes have been staged, whether we have done any local commits since the last push, and how to undo recent commands.

Another useful command is log. This gives a history of the current branch, showing commits from newest to oldest.

$ git log

Note: The log command only works on a nonempty repository. If you have an empty repository, you can make it nonempty by committing to it (see Staging & Committing, below).

Staging & Committing

To store changes in a repository, we edit files in the repository directory as usual. A status command will tell us what files we have changed, if we forget. To mark a file for committing, stage the file, using the add command.

$ git add FILENAME

This stages the current version of the file for committing. Any changes made to the file later are not staged and will not be part of a commit, unless another add command is done.

If multiple files are changed, then we can stage them all by giving multple filenames to a single add command, or we can do multiple add commands.

Once the desired changes are made, and the files staged, we commit all staged changes with a single commit command.

$ git commit

The commit command will generally start some kind of editor that will allow us to enter a commit message: a comment indicating what was done, and why. Note that there is no need to sign a commit with your name or the date, as every commit is automatically marked with the current user identity and a timestamp.

A single commit should always contain related changes that can be described simply. It is not a problem if a single change requires the modification of many files. A single commit should be simple to think about, even if the changes required were not simple to do.

Sometimes we edit a file, and then decide not to commit the changes. Or we stage a file, and then decide not to commit it. In such cases, the status command will tell us what to do.

If we commit with an incorrect commit message, then we can fix that as well, by amending the commit. This is done by passing an option to the commit command.

$ git commit --amend

However, we almost never undo a commit, even if it was done incorrectly. Rather, we edit the files so that they are the way we want them to be, and then we stage and commit the updated files.

Pulling & Pushing

As we have said, when we work on a project stored in a Git repository, we generally work only on a local clone of the repository. If no local clone has been created yet, then do a clone, as above.

If we already have a clone, we generally do not need to create another. However, it may be possible that changes have been made in the upstream repository since we our last work session. Therefore, we generally begin any new work session by bringing in changes from the upstream repository, using a pull command.

$ git pull

Changes we have committed to our local repository can be sent upstream using a push command.

$ git push

Note: A pull or push command may result in a conflict, if changes are sent to a repository that has had other, incompatible changes committed to it. We will discuss conflicts later (see Resolving Conflicts, below).

Just as we typically begin a work session with a pull, we typically end it with a push. There is generally nothing wrong with an occasional pull or push in the middle of a work session, but these are usually not required.

Git is actually capable of pulling from and pushing to any other repository. The default for the pull and push commands is the origin repository; by default this is the upstream repository that the current repository was cloned from.

The origin is an example of a Git remote: a named reference to some other repository. List the names of all remotes using the remote command. Add “-v” to see their values as well.

$ git remote -v

Branches & Merging

A Git branch is a named pointer to a particular commit. There is always one branch that is checked out; initially this is the master branch. When a branch is checked out, its pointer will move to each new commit we make, while the pointers for other branches will remain fixed.

List all currently defined branches using the branch command with no parameters.

$ git branch

The currently checked out branch should be noted in the output (in my version it is marked with an asterisk).

Create a new branch by doing a branch command with a parameter: the name of the new branch.

$ git branch BRANCHNAME

Typically we create a branch when we want to do some experimental changes that we are not yet ready to commit to the master branch. The new branch should be named accordingly.

Again, a branch is simply a named pointer. Creating a new branch makes a new pointer, with the given name, pointing at the current commit.

To check out a branch, use the checkout command, with the name of the branch (which should already exist) as a parameter.

$ git checkout BRANCHNAME

Having created and checked out a new branch, we can make commits without affecting the master branch. If we switch back to the master (by checking it out), then commits to the new branch will not show up in the log.

If we decide that the changes made in the new branch should not be part of the master branch, then we can simply abandon the new branch. If we do want these changes in the master branch, then we merge.

Merging our working branch into the master branch is done with a sequence of two commands: checkout the master branch, then merge the working branch.

$ git checkout master
$ git merge BRANCHNAME

We can actually merge any branch into any other; substitute the name of another branch for “master”, above.

If the merge is successful, then all is well. However, merging can create a conflict, if our new changes are incompatible with the master branch. We discuss how to deal with this situation next.

Resolving Conflicts

Sometimes a merge command may fail. Always read the output of a merge command, to determine whether or not it succeeded. The usual cause for failure is a conflict.

A conflict happens when we attempt to make changes to a repository in a way that is incompatible with the changes already in the repository. For example, suppose we create two working branches. In each working branch, we change the same line of the same file, but we make different changes. Merging the first working branch into the master branch will probably work fine. Merging the second will probably fail, since we are attempting to make a change in the master branch that has already been made in a different way.

When there is a conflict in our local repository, an operation has been stopped in the middle, before it is complete. We need to resolve the conflict before we do anything else.

The output of the merge command will indicate what files have conflicts. In the files, conflicts will be marked in something like the following manner.

<<<<<<< HEAD
lines from one
branch go here
=======
conflicting lines from the
other branch go here
>>>>>>> BRANCHNAME

Above, BRANCHNAME will be replaced by the source of the conflicting lines. And the lines themselves will be those from the two conflicting versions of the file.

We resolve a conflict in the local repository by editing all relevant files, staging all of them, and then doing a commit. Git will write a commit message for us; we may, if we wish, edit it to add more information.

A pull command can also result in a conflict, if changes made to the upstream repository are incompatible with changes in our downstream repository that are not yet pushed. We can prevent most such conflicts by pulling at the beginning of each work session. However, if such a conflict does occur, then it the various files are marked as above, and we resolve the conflict in exactly the same way: edit the relevant files, stage them, commit.

A third cause of conflict is a push command. Such conflicts happen when changes have been made to the upstream repository between our most recent pull and the push, and we are attempting to push incompatible changes. This is a different kind of conflict, as it occurs in the upstream repository—which is often a remote repository”and not in our clone.

The solution to such remote conflicts is to do a pull from the upstream repository. Thus turns the remote conflict into a local conflict, and we resolve it exactly as above.