Archive for the ‘software development’ Category

Migrating from subversion to git (a git of a migration with a really big repo)

February 11, 2014

The Task

Recently we had a request from the customer to move our project, with all its history, from subversion to git. This type of migration seems quite common and there is plenty of information on the web regarding it. However, most of this advice seems to apply to just small projects and falls over (literally) when applied to a large project with a history of thousands of code commits that has been running for more than two years. This post covers some of the challenges of migrating large projects from svn to git and hopefully provides a few pointers.

We started by studying what was out there and found much good advice. GitHub themselves give instructions for migrating from svn to git and the steps you take are roughly as follows:

  • Create a clean, empty git repo
  • Pull the code commits from svn and clone them to the new git repository
  • Tidy up branches and tags to be in git form
  • Push the new git repo to GitHub

We experimented with a small test project that had a trunk, a few tags and a couple of branches and found the steps very quick and easy.

However, the project to actually be migrated has over 11 thousand code commits and a code and assets base that is roughly 35 GB (the svn dump file of the project alone is around 7 GB).

How we went about it

As we’re on windows we installed Git for Windows, msysGit, ( which comes with a Git Bash shell. We also had a Linux server to play with on which we installed the standard git install (

To do the migration we were using the git-svn commands. We initially ran the command:

git svn clone --stdlayout https://hosting-svn/penrillian-project

The –stdlayout flag tells git-svn that the subversion repository uses the standard subversion layout (branches, tags, and trunk). This worked fine on our small test projects but we soon realised that it would take far longer to do the real migration (our svn server is hosted externally so network latency had to be factored in also). Usually after leaving the command running all night we returned in the morning to find that the job had crashed, mostly due to the following errors:

  • “RA layer request failed on .. could not read response body: Secure connection truncated ..” – I have read that this error may be due to trying to clone too large a file.
  • “Can’t fork at /usr/share/perl5/ line 1260” – this error appeared on Linux; it appeared that the process crashed because it ran out of resources.
  • “fatal: Unable to create ‘/path/index.lock’: File exists.” – most of the time deleting the index.lock file and continuing with the migration will get around this error.

At the time our team were git beginners and one thing I wish we’d known about from the start was that when the clone failed we could have continued where we left off by running:

git svn fetch

That should work fine when stopped by the first two error scenarios above (for the third error scenario you need to delete the index.lock file and then run git svn fetch).

We wrote a script to automatically re-start the migration if it failed and eventually managed to migrate our entire repository. This took two weeks. The messy truth is that we set off three migration tasks on three different machines, one of the tasks crashed irretrievably, and from the other two we used the first one that finished.

Making a large migration faster

To speed up matters, and get around any errors that could be due to network latency, you could try getting a dump file of your entire subversion repository and cloning that. We didn’t do this as we discovered that our hosting provider uses a different version of subversion to that of msysgit
(see As we had no control over the svn version that our hosting provider uses we didn’t go down this route. However, if you can use the same svn versions or are on Linux then this approach may work for you.

Pushing to GitHub

After your repository is cloned you should have a working source directory containing the trunk while your branches and tags will be git remotes, i.e., pointers to the branch and tag sources. If you then push this repository to GitHub you will only push the trunk and when your teammates clone that then they will ask where all the branches have gone. Thank goodness for stack overflow:

Basically to push the entire repo, including branches, you need to create local branches for each remote branch that you want to keep. You also need to tidy up the tags as svn tags are cloned as git branches; so you need to convert them to git tags by checking them out as local branches and then tagging them. We scripted what was suggested by Casey in the URL above – our script is shown below:

git svn fetch
git merge remotes/trunk

echo "Create local branches"
while read branch; do
git checkout -b $branch $branch
done < branches.txt

echo "Convert cloned tags to proper git tags - checkout the cloned tag as a local branch, tag the branch, then delete the local branch"

while read tag; do
git checkout -b tag_$tag remotes/tags/build-$tag
git checkout master
git tag build-$tag tag_$tag
git branch -D tag_$tag
done < listTags.txt

echo "Create a bare repo - tidying up branches again"

git clone $currDir $currDir-clone
cp branches.txt $currDir-clone
cd $currDir-clone

while read branch; do
git checkout -b $branch origin/$branch
done < branches.txt

cd ..
mkdir repository-clone-bare
cd repository-clone-bare
git clone --bare $currDir-clone
cd repository-clone.git
git push --mirror

After finishing the migration “git svn fetch”, the script checks out the remote branches as local branches (branches that we want to keep are read from a branches.txt file). The script then tidies up the tags: git clones tags as a kind of remote branch so the script checks out the tags (from a listTags.txt file) and converts them to proper git tags.

(Note: I have read here ( that if you use the tool svn2git to do the migration then it may do some of this tidying up of tags for you. In the end we didn’t use svn2git because of the incompatible svn versions issue explained above. Also, using svn2git, we were unsure how to restart the migration when the job inevitably crashed. I now believe you can do “svn2git –rebase” to restart the migration (see This makes sense as “With the rebase command, you can take all the changes that were committed on one branch and replay them on another one (from ””). So, I take it that this should re-apply any new commits from svn to your cloned git repository, but, I can’t say this for sure as we took a different approach.)

Finally, the script also creates a bare git repository, i.e., a repository with no source working directory, to push to GitHub. A bare repository is recommended when you're sharing a repository with other developers and pushing and pulling changes (see


When migrating a large repository expect some frustration and expect it to take a long time. A migration of this scale also involves looking at your existing workflow to see if it can be improved by using git, getting the team up to speed using git; getting the right developer tools etc. In this post I haven’t concentrated on these important aspects but just the technicalities of doing the migration itself.


Pair Programming Re-visited

November 3, 2012

Our technical lead (a wise colleague) has put his foot down and said we’re all to pair. We’ve got out of the habit recently I don’t know why. He’s laid down the rules. We’re to pair every day and we decide who we’re pairing with at the morning standup. There should be one person who knows the task to do and one person who doesn’t know that task so well. At the end of the day’s pairing, both people should know the task well. The person at the keyboard is the code monkey who just types and learns, the person ‘navigating’ is the one who thinks ahead and who is thinking about the design. The one who is ahead in understanding should wait for the other to catch up – it is of the utmost importance that a shared understanding is arrived at otherwise it’s a waste of time. The person who is ‘catching up’ should question all the time to ensure their own understanding. The next day you rotate pairs. Ideally, if you haven’t finished the task you were pairing on, then the person who started off knowing the work well should move on and the person left should continue the work with another partner. 

We’ve been going through this procedure for 2 weeks now and I can see the benefit. Knowledge is being spread and as we rotate pairs every day I don’t feel I can get on people’s nerves too much. I also feel that it’s good to get comfortable working closely with all members of the team and to get to know their style and strengths and weaknesses – it makes you appreciate your colleagues more.  


Pair Programming

October 30, 2011

I’ve always maintained that I don’t like pair programming. The reasons for this being self consciousness, resentment because of partners who hog the keyboard, disagreements and tension etc. Having moved to a company, and being in a team, where pairing is strongly encouraged, if not mandatory in some cases, what follows is how to cope with this uncomfortable (for me!) situation.

Regarding the feeling of resentment at someone hogging the keyboard – well if you feel that way just ask to have a go! This is often not easy if you feel that your partner has more experience of the system, more confidence in knowing what to do, or a greater understanding than you. In this situation you could suggest “ping pong” pairing where you take turns to write a failing test and the other makes it pass.

Some recent advice from a wise colleague who told me the following: first of all that he didn’t care if he never touched the keyboard or mouse as long as he had an input into the design. To point out a missing semi-colon or something else the compiler will pick up is trivial – what is important is to listen, to contribute your own ideas and to come to a shared understanding of the design. So, according to my wise colleague, being the “navigator” is by no means subservient to being the “driver” but is a position of great responsibility. If you don’t understand what your partner is doing, and you don’t have any input into the design, then you are wasting your time and you are wasting the company’s time. “You have got to be strong”, he told me, if the other person is going ahead without you then you must always ask questions, make sure that you come to that shared understanding otherwise there is no point. And being shy is no excuse!

I found all that a bit of an alleluia moment (and sadly I feel I should have realised this ages ago).

It is, of course, easier to pair with someone who has a similar style to yourself. For example, if you like to take time to understand things then to pair with someone quicker and sharper than you can be stressful. On the other hand to pair with some hot shot young coder who is very sharp, has a good memory, and uses different design patterns from the same tired old way you’ve been doing things for years may be a good thing. By taking you out of your comfort zone it could force you to raise your game a bit. That’s another way of looking at it. When faced with this situation you could try making some suggestions to improve code quality and if your partner is genuinely delighted with an idea that improves the code then this is a very good sign. In spite of their scary self possession it shows a commendable lack of ego, a willingness to learn, and a desire just to write good code and get things working in a nice way.

If you do make suggestions and your partner gets defensive and wants to do things their way then why not look at the way you are suggesting things – does it display a lack of tactfulness or skill on your part. Perhaps you haven’t thought the suggestion through enough to be able to explain it properly. Or maybe your partner has a point. Try not to get defensive especially pairing with someone you’re not used to. Remember you’ve got to take time to get to know people.

Regarding shyness and self-consciousness ask yourself could this be a form of “ego” on your part. Most people pairing with you aren’t going to test you or put you under the microscope, they’re more concerned with getting the job done. “People have better things to do than look at you you know” (as my Mum used to tell my younger self when I complained about feeling shy or self conscious).

So far these are the ways I’ve coped with pairing and I’ve got to the stage where I quite enjoy some pairing sessions. Also when I think “I don’t like pairing” I try and ask myself why, and is it me that’s the problem? Often it is.

Testing Asynchronous Code

January 16, 2011

A while back I had a project which involved the fitting of certain processes with their own lifecycle, i.e., being able to instruct a process to stop then have the process finish off whatever tasks it was running and set its status to stopped so that the machine could be shutdown. As well as the thread for the process, a task coordinator runs in a separate thread. This task coordinator listens for instructions and creates the tasks. Each task also runs in its own separate thread.

If the stop instruction is given to the process then:

  1. Process lifecycle is set to stopping status
  2. No new tasks are allowed to start
  3. Tasks already running are allowed to run to completion
  4. Last task finishes
  5. Process lifecycle is set to stopped status

The threads for an API test would be:

  1. Test thread which gives the Process lifecycle the instruction to stop
  2. The Process itself which has the lifecycle state (running, stopping, stopped)
  3. The task initiator which listens for instructions to initiate tasks and runs them
  4. The task threads

How can we test that the correct sequence of events occur when a process is given the instruction to “stop”? For example, to ensure that the a “stop” command doesn’t set the process lifecycle status to “stop” while tasks are still running, or sets the lifecycle status to “stopping” yet still picks up and initiates tasks. Also we need to do things in the test such as wait for tasks to start and then have the test give the “stop” instruction.

The test needs some mechanism to wait for all relevant events to take place then when it’s done waiting it needs some way of checking that the sequence of events happened in the right order, i.e., lifecycle stop happened after tasks had run to completion.

Talking it over with someone in the team he suggested I take inspiration from UI frameworks, e.g., Java Swing and make use of an event listener/notifier pattern whereby listeners in separate threads can be registered to listen out for events of interests and other (notifier) threads can notify their registered listeners of certain events.

I fitted in a listener/notification pattern which meant that I could create a test helper class like this (all code examples in java):

class LatchMonitor implements Listener {
	public static HashMap<String, long> NOTIFICATIONS_LOGGED = new ConcurrentHashMap<String, long>();
	private CountDownLatch myLatch;
	private String myNotification;
	private Long myLatchTimeout;

	public LatchMonitor (String notification, CountDownLatch latch, Long latchTimeout) {
		myLatch = latch;
		myNotification = notification;
		myLatchTimeout = latchTimeout

	public boolean setToWaitOnNotifier() {
		boolean countReachedZero = false;
		try {
			countReachedZero = myLatch.await(myLatchTimeout, TimeUnit.MILLISECONDS);
		} catch (InterrupedException e) {
			LOG.error("Interrupted", e);
		return countReachedZero;

	public void doNotify(Event event) {
		if (myNotification.equals(event.getOccurrence())) {
			synchronized(this) {
				NOTIFICATIONS_LOGGED.put(event.getOccurrence(), event.getTime());


The LatchMonitor uses a CountDownLatch, giving it the capacity to be notified of multiple events, for example, if 3 tasks are running 3 notifications of “task finished” should be received. It uses a HashMap class variable so all instances of LatchMonitor can log their events of interest and the time that they were notified of the event (ConcurrentHashMap is used because of multiple separate threads calling doNotify). Thus if the event sequence matters then NOTIFICATIONS_LOGGED can be used to make assertions as to the sequence of events.

Within the test you can use LatchMonitors to wait and listen for events of interest, in our case, when the lifecycle status has been set “stopping”, when tasks have finished, when the lifecycle status has been set to “stopped”. When these events have occurred the test can continue and go on to make assertions as to timings of these events, i.e., lifecycle stopping time < tasks finished time < lifecycle stopped time.

The LatchMonitor class proved useful however, it did sometimes have the snag of introducing race conditions in the test, for example,

  1. Test instructs the process to execute some tasks
  2. Test sets a LatchMonitor waiting for task start events
  3. Test sends the stop command to the process
  4. Tests sets a LatchMonitor waiting for the lifecycle status to be stopping
  5. Test sets a LatchMonitor to wait for the tasks to finish and the lifecycle status to be set to stopped

Between 1. and 2. the tasks could already have started before setting the LatchMonitor to wait; so the LatchMontor is waiting for an event that has already occurred. The same thing can happen between 3. and 4, and between 3. and 5.

What I then did was to have the LatchMonitor implement Runnable and set the LatchMonitors running and waiting in separate threads (FutureTasks) right at the beginning of the test. So now the LatchMonitor is given a run method:

public void run() {

The LatchMonitor is set running and waiting at the beginning of the test before step 1.:

FutureTask<?> futureStart = new FutureTask<Object>((Runnable) taskStartLatchMonitor, null);
Executor executor = Executors.newFixedThreadPool(1);

At step 2. the Test could check:

boolean tasksNotStarted = true;
while (tasksNotStarted) {
	tasksNotStarted = !futureStart.isDone();

The test will wait at step 2. for the event, or if the event had already occured the test continues.

This of course makes the test complicated.

Meanwhile, again someone on the team suggested that I take a look at the awaitility package. This is a DSL for testing asynchronous code. Rather than blocking it uses polling to poll for events of interest. I played around with rewriting the tests that I’d done using the LatchMonitor approach to using awaitility. I think awaitility is great and I found it easy to use and thought it made the code less complex, however, it uses polling which could add time to the tests whereas with the LatchMonitor approach the tests block and continue immediately there’s been a notification of the event.

Having looked at the awaitility code I used it as inspiration for a pattern I’ve started to use in both multithreaded unit and api tests. For example here is a Listener:

class TestListener implements Listener {
	private boolean beenNotified false;
	private String myNotification;
	public Listener(String notification) {
		myNotification = notification;
	public void doNotify(Event event) {
		if (myNotification.equals(event.getOccurrence())) {
			beenNotified = true;
	public boolean haveBeenNotified() {
		return beenNotified;
  1. Set the Listener to listen out for the event of interest
  2. Test sets off something which will end up with the event of interest taking place.
  3. Test waits :
    int waitTime = 0; int waitTimeout = 5000; long pollingTime = 100l;
    while (!listener.haveBeenNotified && !(waitTime > waitTimeout)) {
  4. Test continues with assertions.

This method uses polling rather waiting, but has the advantage of being in the same thread as the test (thus making the test less complex than the LatchMonitor approach). It also has the advantage of not introducing race conditions because if the event of interest has already taken place then the test falls through the while loop straight to the assertion. The waitTimout of course stops the test polling indefinitely if the event never takes place. If you find you have Thread.sleeps in your tests which are causing intermittent failures then I’ve found this pattern helpful in making such tests deterministic (though of course it makes the tests more complicated and verbose).

One book I’ve recently acquired, and which I wish I’d had at the beginning of the project, is Growing Object-Oriented Software Guided by Tests by Steve Freeman and Nat Pryce. I can recommend chapters 26 and 27 for anyone who has to write tests for asynchronous code.

Royal Society Web Science

September 28, 2010

I’ve just spent the last couple of days at the Royal Society Web Science discussion meeting which I felt was a very special event for the following reasons.  Web Science (the internet/www as an object of scientific study) is emerging as a new interdisciplinary field of activity with collaborators from both science and the humanities.  This cross over of ideas from many different disciplines (physics, mathematics, computer science, politics, philosophy, sociology) could prove fruitful, and indeed there were speakers at the event from all these disciplines. All of the speakers were very good indeed, some excellent, and all with high calibre backgrounds and good credentials; people who have obviously paid their dues with years of hard work and good research.

Some common themes, ideas mentioned by more than one speaker were as follows. More than one person mentioned Frigyes Karinthy and the 6 degrees of separation concept. Another theme was the value to researchers of having at their disposal unprecedentedly vast amounts of rich data, the “digital traces” (Kleinberg) of all of our interactions on the web. With this kind of data sociologists and other students of humanity have the ability to examine human behaviour, and may be able to prove and disprove theories by empirical studies at a scale not possible before.  Another common theme was the value of the internet and the web. The value of maintaining the structure of the internet and ensuring its security and scalability and the value of keeping the web democratic and open.

The presentations should be available to view from

Correlative Analytics (google’s way of doing science)

June 30, 2008

A friend sent me this link to an article describing a way of doing science (making predictions) without any theory or hypothetical model to explain the observed data. Instead, if the data is large enough (petabyte levels of data) then what are required are clever statistical algorithms to find correlations in the data and thereby make predictions. No theories or hypothetical models needed to see which one fits the data best.

These are powerful techniques opened up by having access to huge amounts of data and the writer of the article argues that these techniques will not involve discarding the scientific method but could complement it.

Python and Jython: they’re the main two

January 22, 2008

I’ve come across some differences between python and jython while amending a grinder test harness. Using python 2.5.1 I can do this to get a uri embedded in a string:

tup = text.partition('rdf:about="')
resourceTup = tup[2].partition('"')
print resourceTup[0]

partition returns an array of substring before the separator, the separator itself, and the substring after the separator.

Of course, when running under jython, it would error with:
‘string’ object has no attribute ‘partition’

But this works in jython:

tup = text.split('rdf:about="')
resourceTup = tup[1].split('"')
print resourceTup[0]

I wondered if jython were using the java String class as you can use split and join, however, there is no indexOf method etc as there is in the java String class.

Here is a useful command to find out what methods an object supports


and can be used in jython to find out what methods a Java class supports


There are of course far more fundamental differences between the two than itty bitty string handling.

Programming Collective Intelligence

December 30, 2007

It’s easy to get so involved in your day to day work that you don’t find the time to read around the wider areas of your profession (or at least I find this to be the case). Because of this, one of my colleagues suggested that we have a “geek book club” where we read articles and books that are related to software development, and through this I’ve encountered books such as Object Thinking and Pragmatic Programmer that I otherwise wouldn’t have heard of. For holiday reading over Christmas one of my colleagues suggested that we read Programming Collective Intelligence.

Programming Collective Intelligence
Programming Collective Intelligence by Toby Segaran

This is a book about machine learning and AI in relation to developing Web 2.0 applications so there are chapters about search engines, spam filtering and making recommendations a la Amazon. These chapters I haven’t read but, as I’d implemented a genetic algorithm at university, what I immediately did was to skip to chapter 11 entitled Evolving Intelligence which is about Genetic Programming.

Genetic Programming is a term I’d not heard of before but it is, apparently, an offshoot of Genetic Algorithms. The difference, as I understand it, is that Genetic Algorithms start with an initial population of data structures which represent the answers to a problem. These data structures are amended using the evolutionary concepts of crossover and mutation and a fitness function which chooses the fittest structures (answers) to go on to the next generation. However, as the author explains, Genetic Programming evolves the algorithm itself, not just the parameters or results of an algorithm. In Segaran’s example the algorithm is modelled as a parse tree, which is the way in which programs are often first broken down by a compiler or an interpreter. This tree representation of the algorithm is then subject to crossover and mutation to evolve “better” programs as defined by the fitness function.

This kind of programming, the author tells us, has been used in fields such as optics, gaming, evolving scientific inventions such as antennas for NASA, designing a concert hall shape that gives the best acoustics etc. Though this is only one chapter in a book it goes further than the basics, for example, it touches on how you can provide the algorithm with memory and the algorithmic population with shared memory to help it learn longer term strategies, and points you in the direction of implementing this. I was most impressed and wished that I’d had this book to hand when first learning about the subject. I’ve only read chapter 11 and a bit of chapter 5 but these have already given me a good overview of the subject of genetic algorithms/programming, refreshed my memory on stuff I’ve already learned, taught me new things as well as helped me brush up on the python language. If these chapters are anything to go by then the entire book is well worth reading.

hprof for diagnosing memory leaks

June 29, 2007

The last few days I’ve been trying to diagnose a possible memory leak in one of our java web services and have been using hprof the built in java profiler that comes with J2SE. hprof is easy to use just enter

java -Xrunhprof:help

to see a list of options for usage. I wanted the heap profiling option so used


in the command to launch the web server.

heap=sites will break down memory usage according to the amount of memory allocated to particular objects and will also generate stack traces showing the methods which allocated this memory. The depth option sets the depth of the stack trace; I’ve set it to 10 but the default is 4.

hprof generates an output file, java.hprof.txt, on program exit which starts with the stack traces and finishes with a breakdown of memory usage. Here is a snippet of the memory usage part of the file showing the objects using the most amount of memory:

SITES BEGIN (ordered by live bytes) Tue Jun 26 16:04:27 2007
  	percent          live          alloc'ed  stack class
rank   self  accum     bytes objs     bytes  objs trace name
   1 30.27% 30.27% 123015224 1260358 123015224 1260358 331325 char[]
   2 11.11% 41.38%  45130056 667796  45130056 667796 331138 char[]
   3  7.97% 49.35%  32396784    1  32396784     1 331303 java.lang.String[]
   4  7.97% 57.32%  32396784    1  32396784     1 331302 int[]
   5  7.44% 64.77%  30248592 1260358  30248592 1260358 331324 java.lang.String
   6  5.98% 70.74%  24297624    3  24297624     3 331443 byte[]
   7  5.26% 76.00%  21369472 667796  21369472 667796 331140 org.apache.lucene.index.TermInfo
   8  3.94% 79.95%  16027104 667796  16027104 667796 331137 java.lang.String
   9  2.63% 82.58%  10684736 667796  10684736 667796 331139 org.apache.lucene.index.Term
  10  1.31% 83.89%   5342384    1   5342384     1 331134 long[]

There seems to be a fair amount of information regarding hprof on the web, but I haven’t managed to find a definitive explanation of exactly what all these columns mean, however, I have a fair idea by now so here goes:

Sites – are particular stack traces.
rank – ranking is in order of amount of memory taken up by particular objects in a stack trace
self – this is the percentage of space allocated to particular objects
accum – not sure of this one, but guessing it could be the percentage of memory ever accumulated by these objects before garbage collection
live bytes – number of live bytes taken up by currently live objects
live objs – number of currently live objects
alloc’ed bytesI think this is the number of bytes allocated so far for particular objects
alloc’ed objs – likewise I think the number of objects of this type so far allocated
stack trace – stack trace number
class name – class of object

As can be seen most memory is used by the char[] (live bytes = 123015224/objs = 1260358). Live bytes/objs will usually be less than alloc'ed bytes/objs due to garbage collection taking place, but as the web server was only running for a short time before this profile was taken it is likely that garbage collection had not happened by the time I stopped the server.

Where live bytes/objs = alloc'ed bytes/objs this could possibly signify a memory leak. One of the sources I looked at stated that low level objects such as char[] tend to float to the top and advised looking further down the ranking for heads of leaking data structures. (See here).

Here is stack trace 331325 for our highest ranking char[] objects:

TRACE 331325:

java.lang.String.<init>(<Unknown Source>:Unknown line) org.apache.lucene.index.TermBuffer.toTerm(

which were allocated by the section of the code that performs a lucene search with a request to sort the result set.

For the moment it looks as though our “memory leak” issue is due to the sorting functionality in lucene using up a lot of resources rather than a memory leak as such. We are sorting strings which the lucene docs state are the most expensive types to sort in terms of resources because each unique term is cached for each document. This could be a problem for us and we may have to rethink how we do our sorting, if so that will be another topic.

hprof links I found useful: