Monday, September 29, 2008

Results

So lots of things have happened since the last post.

Art of War- I had problems getting this to index in TML. After the document reached a certain size it simply disregarded the input. This problem wasted some of my time before i figured out that the text contained the word "Bibliography" in the middle of it. As it happens this is a termination condition in TML. I decided i might as well look for a new document if i'm going to start again.

I went to Project Gutenberg , this site is filled with books which are now in the Public domain, the texts came in plaintext format which made it easier to use. I chose Alice in Wonderland by Lewis Carrol because of the simplicty of the text it didn't seem to have any odd features like table of results or list of points or anything weird that could skew the results.

First thing i had to do was pre-process the text. Unfortunatley the texts had been preformatted an consistted of many extra new lines. What i needed to do was ensure each paragraph ended with a new line as this is how TML defined pargraphs. I noted that a real paragrph would consist of two newline chracters in a row. I used a search and replace technique to do this. I replaced all '\n' with a weird chracter like 'ð' then converted all the 'ðð' to a single '\n' the converted the remaining 'ð' to ' ' (space).

Next i created a small program to split the files up to create a corpus of documents that increased incrementally in size whereby the final document is the original document in its entirety.

Next i ran all these texts through the Benchmark test which measured the time of generating SVD, writing SVD, writing with Compression , reading SVD, Size of SVD and word Count. This took a loooooooong time over 24 hours!

Results showed many of the relationships expected. SVD grew linearly with Document Size, SVD generation grew polynomially whilst reading the SVD grew somewhat linearly. SVD size grew polynomially as well but suprisingly the GZIP compression also grew with the same complexity and was only slightly below the normal SVD. However writing the GZIP file was much more costly.

One thing of note was that even though the relationships were shown the word count had to be significantly large before the SVD generation was orders of magnitude above the process of storing SVD.


however some of that was last week right now i've been analysing the results and writing a bit of the thesis. Since real world documents at least regular assignments won't exceed 2000 words i created a subset of the Alice in Wonderland text to see what the performance is like when the word count is small. Even at this size document the SVD generation starts to rapidly overtake the other methods after 100 words.

That's all for now, i'll be having another meeting tommorow and will probably look into other documents

Wednesday, September 17, 2008

Progress Update

So the semester is moving along and so is the thesis.

The main idea now is to get measurements to establish various relationships between different variables. For example the time to generate SVD as length of the document increases.

I have found a good website called project Gutenberg, which provides non-copyrighted books for download. I'm planning to use these documents and vary their length to help get measurements. For example take the first 2 paragraphs and calculate SVD/measure time, then take the first 4, then the first 6 and so forth.

As for code there has been some changes. Jorge and Steven are moving towards glosser 3 and has completely changed the layout of the code. however functionality still remains the same so this hasn't effected me that much.

However some of the information i want to get like number of terms used by tml for SVD is encapsulated in classes. As a result for me to get this information i had to add additional getter methods. Because these changes are two classes not written by me it was decided that for now i move by project into a branch and work on that since tml code might change during the meantime.

Monday, September 8, 2008

Errors!

In reference to a previous post where i discussed the results of initial testing comparing storing the SVD as opposed to creating it. I found that the code creating this had a bug in it causing the temp lucene index data not to be cleared causing the amount of data to get incrementally larger each time it ran.

I changed the code and also changed the LuceneSearch from document to sentence and the results were quite different.

for example
Document: Diagnostic 04.txt 3kb
Original Corpus with SVD:47ms
Writing: 0ms
Reading: 15ms
Corpus 2: 16ms
SVD Object: 24kb

just to see what it could handle i also tried it with a large text,
Document: Sun Tzu, Art of War
Size: plaintext/ 329kb, approx 55,000 words
Original Corpus with SVD: 715,094 ms
Writing: 2438 ms
Reading: 453 ms
Corpus 2: (with read SVD): 1906ms
SVD object: 36mb

looking at task manager , it was consuming about 150mb memory whilst
calculating the SVD.

So these results are obviously significantly different to the initial ones posted earlier and makes the case that storing the SVD object may not always be the best option especially when the document is short.

Draft Treatise Completed

Well the draft is done and handed in. However its only just a start, there is a lot more work and writing to do before i finish this project. One thing was that since i only found some of my background papers late i could not have a completed background section by the due date.

I was able to get about 2000 words which was mostly in the introduction. I felt this is a good start but things might need to be revised as the exact direction of my thesis is not clear until what experiments to run are decided.

Wednesday, September 3, 2008

Performance Measurement in Java?

Yes its another post, i wanted to keep the individual posts regarding only one topic/issue.

So in this post what I was thinking about is Performance timing in Java. What is the best way to Measure Time / Performance in Java or in other which methods are appropriate when it comes to analysis and comparisons.

The first method i came across is the most basic and the one i've come across before which is simply the currenttime from System i.e
long time = System.currentTimeMillis();
//Operation
time = System.currentTimeMillis() - time;
System.out.println("Operation took: " + time + "ms");
The method returns time in milliseconds since Jan 1 1970. From my understanding though this measures "elapsed time" so it can give variable results when dealing with multitasking as the thread may get swapped out for something else between the two time calls.

From this discovery i found the other method which measures CPU time consumed by the current thread.

import java.lang.management.*;
ThreadMXBean mx;
if(this.mx.isThreadCpuTimeSupported() && this.mx.isThreadCpuTimeEnabled()){
long time = mx.getCurrentThreadCpuTime();
//Operation
time = mx.getCurrentThreadCputTime() - time;
}
This method measures how long the thread was running on CPU , and thus should give you a value which is independent of whatever multitasking your system is doing (which can vary elapsed time) when the test runs.

When i ran my earlier tests i used both methods of measurement and for some operations they both returned similar/same values whilst with other operations the results were significantly different.

So which is the best to use? I'm still not sure, i think the less variability that the thread time gives is an advantage but i will probably need to look into it more to see if this is the appropriate way to measure time.

Cost of SVD

I have also made some progress in possible experiments. After reviewing some of the existing sample code i wrote a some code to test performance. Basically in a few words I'm:

  1. Creating LuceneSearch object and setting up parameters and doing semanticspace.calculate()
  2. Grabbing SVD object semanticspace.getSvd() and storing it in binary file
  3. Reading from binary file and creating SVD object
  4. Creating another LuceneSearch object and doing the same thing as step 1, except replacing space.calculate() with space.setSVD("SVD object from file")
  5. Ccomparing the running times of creating the two luceneSearch objects

Now I'm not getting any errors but that doesn't necessarily mean its
working :) , so i assume this is correct but i have not validated it.

to give a rough idea of statistics:
Document size(Diagnostic1.txt): 3 KB
Object Binary File: 1000 KB
Corpus 1 (original )creation time: 10,593 ms
Writing out SVD obj: 78 ms
Reading in SVD obj: 62 ms
Corpus 2 (using read in SVD object): 234 ms

So these results post some interesting things.

A 3KB document has created a 1000KB SVD. The SVD object here was stored with simple serialization (no compression) but still its a huge difference.

The relationship between the document size and the SVD generated could be an area of interest. Since in the glosser project the LSA model generated is based on essentially a term-sentence matrix , you could expect the terms to grow logarithmically (there are only so many unique words you can use , no matter how long the document is). Whilst on the other hands the amount of sentences are expected to grow linearly as the document grows.

The other main key issue is the difference between generating a new LSA model or restoring an old one. The combined time of writing the SVD out to a file , reading it back in and creating the corpus is still orders of magnitudes less than creating the original SVD.

I guess variations could be made of some of these parameters to see if some loose relationships can be established before creating some formal tests to validate the results shown in this first test.

Progress

I have some updates on the project and i will break up into two posts to explain the two separate things.

So the work on the Draft treatise continues which is due this Friday. I discussed with Rafael earlier in the the week the lack of papers regarding testing performance in implementations and thus lack of content to write about in the background chapter of the treatise.

Rafael suggested maybe changing the focus of the treatise to one implementing an alternative SVD algorithm. But after discussing this with Jorge later on he was able to get me a couple of papers that deal with implementation issues of LSA.

Using Google Scholar and checking the citation some other papers of similarity popped up. Now it looks like i will have sufficient papers and readings to be able to contextualise my project with what is already out there. Unfortunatley as these papers have come at this time i will probably be unable to finish my background by the due date of the draft treatise.

This shouldn't be a problem though since i already have a fair amount of content in the draft and should be a good platform going into the second half of the semester.