Archive for the ‘software development’ Category

Correlative Analytics (google’s way of doing science)

June 30, 2008

A friend sent me this link to an article describing a way of doing science (making predictions) without any theory or hypothetical model to explain the observed data. Instead, if the data is large enough (petabyte levels of data) then what are required are clever statistical algorithms to find correlations in the data and thereby make predictions. No theories or hypothetical models needed to see which one fits the data best.

These are powerful techniques opened up by having access to huge amounts of data and the writer of the article argues that these techniques will not involve discarding the scientific method but could complement it.

Python and Jython: they’re the main two

January 22, 2008

I’ve come across some differences between python and jython while amending a grinder test harness. Using python 2.5.1 I can do this to get a uri embedded in a string:

tup = text.partition('rdf:about="')
resourceTup = tup[2].partition('"')
print resourceTup[0]

partition returns an array of substring before the separator, the separator itself, and the substring after the separator.

Of course, when running under jython, it would error with:
’string’ object has no attribute ‘partition’

But this works in jython:

tup = text.split('rdf:about="')
resourceTup = tup[1].split('"')
print resourceTup[0]

I wondered if jython were using the java String class as you can use split and join, however, there is no indexOf method etc as there is in the java String class.

Here is a useful command to find out what methods an object supports

dir(str)

and can be used in jython to find out what methods a Java class supports

dir(String.class)

There are of course far more fundamental differences between the two than itty bitty string handling but there’s only one way to find out which one is best: fight!

Programming Collective Intelligence

December 30, 2007

It’s easy to get so involved in your day to day work that you don’t find the time to read around the wider areas of your profession (or at least I find this to be the case). Because of this, one of my colleagues suggested that we have a “geek book club” where we read articles and books that are related to software development, and through this I’ve encountered books such as Object Thinking and Pragmatic Programmer that I otherwise wouldn’t have heard of. For holiday reading over Christmas one of my colleagues suggested that we read Programming Collective Intelligence.

Programming Collective Intelligence
Programming Collective Intelligence by Toby Segaran

This is a book about machine learning and AI in relation to developing Web 2.0 applications so there are chapters about search engines, spam filtering and making recommendations a la Amazon. These chapters I haven’t read but, as I’d implemented a genetic algorithm at university, what I immediately did was to skip to chapter 11 entitled Evolving Intelligence which is about Genetic Programming.

Genetic Programming is a term I’d not heard of before but it is, apparently, an offshoot of Genetic Algorithms. The difference, as I understand it, is that Genetic Algorithms start with an initial population of data structures which represent the answers to a problem. These data structures are amended using the evolutionary concepts of crossover and mutation and a fitness function which chooses the fittest structures (answers) to go on to the next generation. However, as the author explains, Genetic Programming evolves the algorithm itself, not just the parameters or results of an algorithm. In Segaran’s example the algorithm is modelled as a parse tree, which is the way in which programs are often first broken down by a compiler or an interpreter. This tree representation of the algorithm is then subject to crossover and mutation to evolve “better” programs as defined by the fitness function.

This kind of programming, the author tells us, has been used in fields such as optics, gaming, evolving scientific inventions such as antennas for NASA, designing a concert hall shape that gives the best acoustics etc. Though this is only one chapter in a book it goes further than the basics, for example, it touches on how you can provide the algorithm with memory and the algorithmic population with shared memory to help it learn longer term strategies, and points you in the direction of implementing this. I was most impressed and wished that I’d had this book to hand when first learning about the subject. I’ve only read chapter 11 and a bit of chapter 5 but these have already given me a good overview of the subject of genetic algorithms/programming, refreshed my memory on stuff I’ve already learned, taught me new things as well as helped me brush up on the python language. If these chapters are anything to go by then the entire book is well worth reading.

hprof for diagnosing memory leaks

June 29, 2007

The last few days I’ve been trying to diagnose a possible memory leak in one of our java web services and have been using hprof the built in java profiler that comes with J2SE. hprof is easy to use just enter

java -Xrunhprof:help

to see a list of options for usage. I wanted the heap profiling option so used

-Xrunhprof:heap=sites,depth=10

in the command to launch the web server.

heap=sites will break down memory usage according to the amount of memory allocated to particular objects and will also generate stack traces showing the methods which allocated this memory. The depth option sets the depth of the stack trace; I’ve set it to 10 but the default is 4.

hprof generates an output file, java.hprof.txt, on program exit which starts with the stack traces and finishes with a breakdown of memory usage. Here is a snippet of the memory usage part of the file showing the objects using the most amount of memory:

SITES BEGIN (ordered by live bytes) Tue Jun 26 16:04:27 2007
  	percent          live          alloc'ed  stack class
rank   self  accum     bytes objs     bytes  objs trace name
   1 30.27% 30.27% 123015224 1260358 123015224 1260358 331325 char[]
   2 11.11% 41.38%  45130056 667796  45130056 667796 331138 char[]
   3  7.97% 49.35%  32396784    1  32396784     1 331303 java.lang.String[]
   4  7.97% 57.32%  32396784    1  32396784     1 331302 int[]
   5  7.44% 64.77%  30248592 1260358  30248592 1260358 331324 java.lang.String
   6  5.98% 70.74%  24297624    3  24297624     3 331443 byte[]
   7  5.26% 76.00%  21369472 667796  21369472 667796 331140 org.apache.lucene.index.TermInfo
   8  3.94% 79.95%  16027104 667796  16027104 667796 331137 java.lang.String
   9  2.63% 82.58%  10684736 667796  10684736 667796 331139 org.apache.lucene.index.Term
  10  1.31% 83.89%   5342384    1   5342384     1 331134 long[]

There seems to be a fair amount of information regarding hprof on the web, but I haven’t managed to find a definitive explanation of exactly what all these columns mean, however, I have a fair idea by now so here goes:

Sites - are particular stack traces.
rank - ranking is in order of amount of memory taken up by particular objects in a stack trace
self - this is the percentage of space allocated to particular objects
accum - not sure of this one, but guessing it could be the percentage of memory ever accumulated by these objects before garbage collection
live bytes – number of live bytes taken up by currently live objects
live objs – number of currently live objects
alloc’ed bytesI think this is the number of bytes allocated so far for particular objects
alloc’ed objs – likewise I think the number of objects of this type so far allocated
stack trace – stack trace number
class name – class of object

As can be seen most memory is used by the char[] (live bytes = 123015224/objs = 1260358). Live bytes/objs will usually be less than alloc'ed bytes/objs due to garbage collection taking place, but as the web server was only running for a short time before this profile was taken it is likely that garbage collection had not happened by the time I stopped the server.

Where live bytes/objs = alloc'ed bytes/objs this could possibly signify a memory leak. One of the sources I looked at stated that low level objects such as char[] tend to float to the top and advised looking further down the ranking for heads of leaking data structures. (See here).

Here is stack trace 331325 for our highest ranking char[] objects:

TRACE 331325:

java.lang.String.<init>(<Unknown Source>:Unknown line) org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:104)
org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:155)
org.apache.lucene.search.FieldCacheImpl$6.createValue(FieldCacheImpl.java:282)
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:260)
org.apache.lucene.search.FieldCacheImpl$7.createValue(FieldCacheImpl.java:371)
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
org.apache.lucene.search.FieldCacheImpl.getAuto(FieldCacheImpl.java:334)
org.apache.lucene.search.FieldSortedHitQueue.comparatorAuto(FieldSortedHitQueue.java:338)

which were allocated by the section of the code that performs a lucene search with a request to sort the result set.

For the moment it looks as though our “memory leak” issue is due to the sorting functionality in lucene using up a lot of resources rather than a memory leak as such. We are sorting strings which the lucene docs state are the most expensive types to sort in terms of resources because each unique term is cached for each document. This could be a problem for us and we may have to rethink how we do our sorting, if so that will be another topic.

hprof links I found useful:

http://java.sun.com/developer/technicalArticles/Programming/HPROF.html

http://www.skywayradio.com/tech/WAS51/appserver_hangs.php
http://www.skywayradio.com/tech/WAS51/HProf.php
http://publib.boulder.ibm.com/infocenter/javasdk/v1r4m2/index.jsp?
topic=/com.ibm.java.doc.diagnostics.142/html/id1590.html

http://www.javalobby.org/java/forums/t19612.html
http://www.javaworld.com/javaworld/jw-12-2001/jw-1207-hprof.html