Recently we had a request from the customer to move our project, with all its history, from subversion to git. This type of migration seems quite common and there is plenty of information on the web regarding it. However, most of this advice seems to apply to just small projects and falls over (literally) when applied to a large project with a history of thousands of code commits that has been running for more than two years. This post covers some of the challenges of migrating large projects from svn to git and hopefully provides a few pointers.
We started by studying what was out there and found much good advice. GitHub themselves give instructions for migrating from svn to git and the steps you take are roughly as follows:
- Create a clean, empty git repo
- Pull the code commits from svn and clone them to the new git repository
- Tidy up branches and tags to be in git form
- Push the new git repo to GitHub
We experimented with a small test project that had a trunk, a few tags and a couple of branches and found the steps very quick and easy.
However, the project to actually be migrated has over 11 thousand code commits and a code and assets base that is roughly 35 GB (the svn dump file of the project alone is around 7 GB).
How we went about it
As we’re on windows we installed Git for Windows, msysGit, (https://code.google.com/p/msysgit/downloads/list?q=full+installer+official+git) which comes with a Git Bash shell. We also had a Linux server to play with on which we installed the standard git install (http://git-scm.com/download/linux).
To do the migration we were using the git-svn commands. We initially ran the command:
git svn clone --stdlayout https://hosting-svn/penrillian-project
The –stdlayout flag tells git-svn that the subversion repository uses the standard subversion layout (branches, tags, and trunk). This worked fine on our small test projects but we soon realised that it would take far longer to do the real migration (our svn server is hosted externally so network latency had to be factored in also). Usually after leaving the command running all night we returned in the morning to find that the job had crashed, mostly due to the following errors:
- “RA layer request failed on .. could not read response body: Secure connection truncated ..” – I have read that this error may be due to trying to clone too large a file.
- “Can’t fork at /usr/share/perl5/Git.pm line 1260” – this error appeared on Linux; it appeared that the process crashed because it ran out of resources.
- “fatal: Unable to create ‘/path/index.lock’: File exists.” – most of the time deleting the index.lock file and continuing with the migration will get around this error.
At the time our team were git beginners and one thing I wish we’d known about from the start was that when the clone failed we could have continued where we left off by running:
git svn fetch
That should work fine when stopped by the first two error scenarios above (for the third error scenario you need to delete the index.lock file and then run git svn fetch).
We wrote a script to automatically re-start the migration if it failed and eventually managed to migrate our entire repository. This took two weeks. The messy truth is that we set off three migration tasks on three different machines, one of the tasks crashed irretrievably, and from the other two we used the first one that finished.
Making a large migration faster
To speed up matters, and get around any errors that could be due to network latency, you could try getting a dump file of your entire subversion repository and cloning that. We didn’t do this as we discovered that our hosting provider uses a different version of subversion to that of msysgit
(see https://github.com/msysgit/msysgit/wiki/Frequently-Asked-Questions#subversion-is-too-old-for-git-svn-to-work-properly-cant-you-guys-keep-your-software-up-to-date). As we had no control over the svn version that our hosting provider uses we didn’t go down this route. However, if you can use the same svn versions or are on Linux then this approach may work for you.
Pushing to GitHub
After your repository is cloned you should have a working source directory containing the trunk while your branches and tags will be git remotes, i.e., pointers to the branch and tag sources. If you then push this repository to GitHub you will only push the trunk and when your teammates clone that then they will ask where all the branches have gone. Thank goodness for stack overflow:
Basically to push the entire repo, including branches, you need to create local branches for each remote branch that you want to keep. You also need to tidy up the tags as svn tags are cloned as git branches; so you need to convert them to git tags by checking them out as local branches and then tagging them. We scripted what was suggested by Casey in the URL above – our script is shown below:
git svn fetch
git merge remotes/trunk
echo "Create local branches"
while read branch; do
git checkout -b $branch $branch
done < branches.txt
echo "Convert cloned tags to proper git tags - checkout the cloned tag as a local branch, tag the branch, then delete the local branch"
while read tag; do
git checkout -b tag_$tag remotes/tags/build-$tag
git checkout master
git tag build-$tag tag_$tag
git branch -D tag_$tag
done < listTags.txt
echo "Create a bare repo - tidying up branches again"
git clone $currDir $currDir-clone
cp branches.txt $currDir-clone
while read branch; do
git checkout -b $branch origin/$branch
done < branches.txt
git clone --bare $currDir-clone
git push --mirror https://github.com/penrillian/repo
After finishing the migration “git svn fetch”, the script checks out the remote branches as local branches (branches that we want to keep are read from a branches.txt file). The script then tidies up the tags: git clones tags as a kind of remote branch so the script checks out the tags (from a listTags.txt file) and converts them to proper git tags.
(Note: I have read here (https://help.github.com/articles/importing-from-subversion) that if you use the tool svn2git to do the migration then it may do some of this tidying up of tags for you. In the end we didn’t use svn2git because of the incompatible svn versions issue explained above. Also, using svn2git, we were unsure how to restart the migration when the job inevitably crashed. I now believe you can do “svn2git –rebase” to restart the migration (see http://makandracards.com/jan0sch/16089-mirror-svn-repositories-with-git-and-svn2git). This makes sense as “With the rebase command, you can take all the changes that were committed on one branch and replay them on another one (from ”http://git-scm.com/book/en/Git-Branching-Rebasing”). So, I take it that this should re-apply any new commits from svn to your cloned git repository, but, I can’t say this for sure as we took a different approach.)
Finally, the script also creates a bare git repository, i.e., a repository with no source working directory, to push to GitHub. A bare repository is recommended when you're sharing a repository with other developers and pushing and pulling changes (see http://gitready.com/advanced/2009/02/01/push-to-only-bare-repositories.html).
When migrating a large repository expect some frustration and expect it to take a long time. A migration of this scale also involves looking at your existing workflow to see if it can be improved by using git, getting the team up to speed using git; getting the right developer tools etc. In this post I haven’t concentrated on these important aspects but just the technicalities of doing the migration itself.