Asela's project blog: 2008

Saturday, November 15, 2008

Finished!

This will be the last post in this blog. I had my presentation last tuesday where i outlined my work in a 9 minute presentation.

Basically coming into this there was concerns over whether Glosser would scale if demand grew. LSA does not scale well with document length and even with moderately sized documents, increasing users can greatly degrade response time.

The idea of storing the model was what was looked at in the treatise. In Java this was simple as implementing serializable interface allows objects to be stored in binary files. I looked into compressing these files using GZIP however this was a poor time/space trade-off since there is little redundant information.

As the initial results came in from the tests storing the model provided great increases in time over recalculation. And the size of the model was feasible only 190KB for 3000 word documents.

Looking to add further to the project i looked at possibly changing the SVD to use Lanczos which is much faster than the standard method which was used by the Weka Library. However without a freely available Java implmentation there was simply not enough time in this short project to implement any solution. However Lanczos is important for performance and will eventually be integrated into Glosser. Most likely through linking to a Matlab implementation of Lanczso (like PROPACK).

With the remaining time i modified the Weka/Jama implementation of SVD so that it would truncate the model after dimensionality reduction so as to only keep the useful data.

So summing up , storing the model was a good idea. It provides a good time/space trade-off. For example with 200 concurrent users with 3000 word documents. Even if only 30% of the requests were for previously calculated models the system would still gain a 20% increase in performance by storing models. In addition storing the model can bring about other advantages such as been able to process and stored models offline whilst the user does not wait. For example student emails document and then gets reply when feedback is ready.

Overall this was an interesting project to be part of. Best of luck to the Glosser project and thanks to my Supervisors Dr. Rafael Calvo and Jorge Villalon.

Asela

Monday, November 3, 2008

Presentation

Now with the written treatise over my focus is on the presentation and upcoming final exams.

I have come to realise how little 9 minutes is for a presentation especially when you have a semester of stuff to talk about. I met with Jorge last week who gave me a general template of what topics to talk about and what had to be cut out.

To help achieve the time constraints i won't be discussing the GZIP compression and also the Lanczos algorithm.

I have my draft presentation ready though and as per Rafael's request his thesis students including me will be giving a rehearsal presentation tomorrow. This will give me a chance to get some feedback to refine the presentation.

This semester(which should be my last) is closing and now there are only a couple of weeks until its all over.

Monday, October 27, 2008

Treatise Printed!

I went to officeworks today and got the treatise printed. It is now complete and ready to go (Submission on wednesday). To give a bit of an idea of what i did here is the abstract. I still have the presentation to work on and i'll post some charts of the results later.

With the emergence of Web 2.0 and increasing ubiquity of the internet, there has been a rise in the amount of rich internet applications, and particularly collaborative web applications. One particular application, relevant for Collaborative Work and Collaborative Learning is Collaborative Writing (CW), which corresponds to write a document synchronously by more than one author. CW as a learning activity is especially relevant for the teaching and learning of Academic writing in higher education. Google Docs is a CW application that simulates a Word processor within a web page, it is used by students in the School of EIE at the University of Sydney to write collaboratively.

Feedback on students' writing is an important source of students' learning of academic writing, however providing more feedback is too costly because it requires a lot of human time. Automated tools to provide feedback on writing have been proposed and tested. One of these was implemented at the School of EIE, it's called Glosser and uses Machine Learning
(particularly Text Mining) techniques to provide automatic feedback on essays. Glosser works as a web application integrated with Google Docs.

Machine learning programs are able to extract useful information from data. The Glosser tool uses a text mining technique known as Latent Semantic Analysis (LSA) which has a high computational complexity. This poses a problem, given the high cost of creating the model, and the amount of data produced by collaborative applications, it is particularly complex to achieve a good response time for tools such as Glosser. Particularly in web based applications where response time is vital there needs to be ways to minimise the impact of the ML model creation on the users' response time, by managing these models in an intelligent way.

This treatise proposes a model for managing Machine learning models in a collaborative web application. The proposed method is to essentially cache the Machine learning model. By making the model persistent further calls to the same document would not require recalculation and could simply be restored from storage and thus reduce response time for recalculations. Experimental analysis showed that the proposed method was effective and greatly reduced recalculation time. File compression of the stored model was shown to be a bad time-space trade-off whist truncating the model to remove redundant data provided a significant reduction in the model size. The data showed that in a situation with 200 concurrent users with ~3000 words documents the new method would provide a 20% reduction in time even with a cache hit ratio of 30% with each model requiring 190KB of space.

Sunday, October 19, 2008

Quick Update

Just a quick update to say that thesis writing continues. I hope to have most of it done by tuesday to show Jorge so that i can get feedback since i will most likely be finishing and getting it printed by the end of this week.

back to work ...

Monday, October 13, 2008

Countdown to the End

So there is only a little over two weeks to get the written treatise finished. Finally i have begun writing more of the core of the thesis.

Throughout the semester i was constantly refining exactly how and what i was testing. I still have small issues to deal with to get a full set of results but the results i have are good enough to begin writing as they show all the logical relationships between different variables.

I will leave it to a post later on when i have actually written the conclusion to give a sum up of what i have done and the findings i discovered in regards to Singular Value Decomposition(SVD).

Although i know what the results mean and the conclusions that are drawn from the results in regards to my initial aim , it is still by no means a trivial tasks to express this in the form of a written treatise. So its looking like a busy two weeks to finish it all off.

Monday, October 6, 2008

Lanczos and Java

So the deadline is nearing. After a suggestion from Rafael i decided to look into replacing the current SVD method with the more efficient lanczos.

I figured out quickly that to implement this from scratch would be impossible for me within a two week timeframe, especially since i would need to understand the algorithm in detail.

I started searching for things that would help:

Java Matrix packages: there seems to be alot of them out, a popular one is Jama, which is infact the matrix package used by Weka. Others include Matrix Toolkits for Java, Colt and JLAPack. None of these had a implementation of SVD based off the lanczos
However there was a code out there that did implement lanczos unforuntantley not in Java. These include SVDPack and PROpack. It seemed that the scientific community still sticks to Fortran, C and Matlab for these matrix based solutions.
I did look into calling a C or Matlab program from within Java , however it seemed more difficult than i thought .
The Matlab program would require a MatLab runtime enviroment on the host machine in addition it would make the entire system dependent on Matlab.
As for the C program the code was difficult to read and wasn't structured very well
for the time being it doesn't look like this lanczos implementation can be done, its definitely doable but with only a few weeks left i would need to wrap it up quick so i can write the actual thesis.

With the remaining time i looked into implementing my proposed changes to the actual TML code. One thing i noticed was operations besides SVD that actual cost time that i was not measuring. This added significant amounts of time such that the point where SVD generation only really got bigger than reading SVD at about the 4000-5000 word mark.

I have meeting tommorow as usual and will discuss these issues.

Monday, September 29, 2008

Results

So lots of things have happened since the last post.

Art of War- I had problems getting this to index in TML. After the document reached a certain size it simply disregarded the input. This problem wasted some of my time before i figured out that the text contained the word "Bibliography" in the middle of it. As it happens this is a termination condition in TML. I decided i might as well look for a new document if i'm going to start again.

I went to Project Gutenberg , this site is filled with books which are now in the Public domain, the texts came in plaintext format which made it easier to use. I chose Alice in Wonderland by Lewis Carrol because of the simplicty of the text it didn't seem to have any odd features like table of results or list of points or anything weird that could skew the results.

First thing i had to do was pre-process the text. Unfortunatley the texts had been preformatted an consistted of many extra new lines. What i needed to do was ensure each paragraph ended with a new line as this is how TML defined pargraphs. I noted that a real paragrph would consist of two newline chracters in a row. I used a search and replace technique to do this. I replaced all '\n' with a weird chracter like 'ð' then converted all the 'ðð' to a single '\n' the converted the remaining 'ð' to ' ' (space).

Next i created a small program to split the files up to create a corpus of documents that increased incrementally in size whereby the final document is the original document in its entirety.

Next i ran all these texts through the Benchmark test which measured the time of generating SVD, writing SVD, writing with Compression , reading SVD, Size of SVD and word Count. This took a loooooooong time over 24 hours!

Results showed many of the relationships expected. SVD grew linearly with Document Size, SVD generation grew polynomially whilst reading the SVD grew somewhat linearly. SVD size grew polynomially as well but suprisingly the GZIP compression also grew with the same complexity and was only slightly below the normal SVD. However writing the GZIP file was much more costly.

One thing of note was that even though the relationships were shown the word count had to be significantly large before the SVD generation was orders of magnitude above the process of storing SVD.

however some of that was last week right now i've been analysing the results and writing a bit of the thesis. Since real world documents at least regular assignments won't exceed 2000 words i created a subset of the Alice in Wonderland text to see what the performance is like when the word count is small. Even at this size document the SVD generation starts to rapidly overtake the other methods after 100 words.

That's all for now, i'll be having another meeting tommorow and will probably look into other documents

Wednesday, September 17, 2008

Progress Update

So the semester is moving along and so is the thesis.

The main idea now is to get measurements to establish various relationships between different variables. For example the time to generate SVD as length of the document increases.

I have found a good website called project Gutenberg, which provides non-copyrighted books for download. I'm planning to use these documents and vary their length to help get measurements. For example take the first 2 paragraphs and calculate SVD/measure time, then take the first 4, then the first 6 and so forth.

As for code there has been some changes. Jorge and Steven are moving towards glosser 3 and has completely changed the layout of the code. however functionality still remains the same so this hasn't effected me that much.

However some of the information i want to get like number of terms used by tml for SVD is encapsulated in classes. As a result for me to get this information i had to add additional getter methods. Because these changes are two classes not written by me it was decided that for now i move by project into a branch and work on that since tml code might change during the meantime.

Monday, September 8, 2008

Errors!

In reference to a previous post where i discussed the results of initial testing comparing storing the SVD as opposed to creating it. I found that the code creating this had a bug in it causing the temp lucene index data not to be cleared causing the amount of data to get incrementally larger each time it ran.

I changed the code and also changed the LuceneSearch from document to sentence and the results were quite different.

for example
Document: Diagnostic 04.txt 3kb
Original Corpus with SVD:47ms
Writing: 0ms
Reading: 15ms
Corpus 2: 16ms
SVD Object: 24kb

just to see what it could handle i also tried it with a large text,
Document: Sun Tzu, Art of War
Size: plaintext/ 329kb, approx 55,000 words
Original Corpus with SVD: 715,094 ms
Writing: 2438 ms
Reading: 453 ms
Corpus 2: (with read SVD): 1906ms
SVD object: 36mb

looking at task manager , it was consuming about 150mb memory whilst
calculating the SVD.

So these results are obviously significantly different to the initial ones posted earlier and makes the case that storing the SVD object may not always be the best option especially when the document is short.

Draft Treatise Completed

Well the draft is done and handed in. However its only just a start, there is a lot more work and writing to do before i finish this project. One thing was that since i only found some of my background papers late i could not have a completed background section by the due date.

I was able to get about 2000 words which was mostly in the introduction. I felt this is a good start but things might need to be revised as the exact direction of my thesis is not clear until what experiments to run are decided.

Wednesday, September 3, 2008

Performance Measurement in Java?

Yes its another post, i wanted to keep the individual posts regarding only one topic/issue.

So in this post what I was thinking about is Performance timing in Java. What is the best way to Measure Time / Performance in Java or in other which methods are appropriate when it comes to analysis and comparisons.

The first method i came across is the most basic and the one i've come across before which is simply the currenttime from System i.e

long time = System.currentTimeMillis();
//Operation
time = System.currentTimeMillis() - time;
System.out.println("Operation took: " + time + "ms");

The method returns time in milliseconds since Jan 1 1970. From my understanding though this measures "elapsed time" so it can give variable results when dealing with multitasking as the thread may get swapped out for something else between the two time calls.

From this discovery i found the other method which measures CPU time consumed by the current thread.

import java.lang.management.*;
ThreadMXBean mx;
if(this.mx.isThreadCpuTimeSupported() && this.mx.isThreadCpuTimeEnabled()){
  long time = mx.getCurrentThreadCpuTime();
  //Operation
  time = mx.getCurrentThreadCputTime() - time;
}

This method measures how long the thread was running on CPU , and thus should give you a value which is independent of whatever multitasking your system is doing (which can vary elapsed time) when the test runs.

When i ran my earlier tests i used both methods of measurement and for some operations they both returned similar/same values whilst with other operations the results were significantly different.

So which is the best to use? I'm still not sure, i think the less variability that the thread time gives is an advantage but i will probably need to look into it more to see if this is the appropriate way to measure time.

Cost of SVD

I have also made some progress in possible experiments. After reviewing some of the existing sample code i wrote a some code to test performance. Basically in a few words I'm:

Creating LuceneSearch object and setting up parameters and doing semanticspace.calculate()
Grabbing SVD object semanticspace.getSvd() and storing it in binary file
Reading from binary file and creating SVD object
Creating another LuceneSearch object and doing the same thing as step 1, except replacing space.calculate() with space.setSVD("SVD object from file")
Ccomparing the running times of creating the two luceneSearch objects

Now I'm not getting any errors but that doesn't necessarily mean its
working :) , so i assume this is correct but i have not validated it.

to give a rough idea of statistics:
Document size(Diagnostic1.txt): 3 KB
Object Binary File: 1000 KB
Corpus 1 (original )creation time: 10,593 ms
Writing out SVD obj: 78 ms
Reading in SVD obj: 62 ms
Corpus 2 (using read in SVD object): 234 ms

So these results post some interesting things.

A 3KB document has created a 1000KB SVD. The SVD object here was stored with simple serialization (no compression) but still its a huge difference.

The relationship between the document size and the SVD generated could be an area of interest. Since in the glosser project the LSA model generated is based on essentially a term-sentence matrix , you could expect the terms to grow logarithmically (there are only so many unique words you can use , no matter how long the document is). Whilst on the other hands the amount of sentences are expected to grow linearly as the document grows.

The other main key issue is the difference between generating a new LSA model or restoring an old one. The combined time of writing the SVD out to a file , reading it back in and creating the corpus is still orders of magnitudes less than creating the original SVD.

I guess variations could be made of some of these parameters to see if some loose relationships can be established before creating some formal tests to validate the results shown in this first test.

Progress

I have some updates on the project and i will break up into two posts to explain the two separate things.

So the work on the Draft treatise continues which is due this Friday. I discussed with Rafael earlier in the the week the lack of papers regarding testing performance in implementations and thus lack of content to write about in the background chapter of the treatise.

Rafael suggested maybe changing the focus of the treatise to one implementing an alternative SVD algorithm. But after discussing this with Jorge later on he was able to get me a couple of papers that deal with implementation issues of LSA.

Using Google Scholar and checking the citation some other papers of similarity popped up. Now it looks like i will have sufficient papers and readings to be able to contextualise my project with what is already out there. Unfortunatley as these papers have come at this time i will probably be unable to finish my background by the due date of the draft treatise.

This shouldn't be a problem though since i already have a fair amount of content in the draft and should be a good platform going into the second half of the semester.

Saturday, August 30, 2008

Draft Treatise

Draft treatise has been the focus of attention for the last couple of weeks as it is due next week (5th September).

I am attempting to try to ensure all the information required is contained in the initial pages. As the later parts of the treatise can't be completed yet.

I want to describe and explain various aspects and give background information about the project so that reader should be able to pick it up without having to read other texts. Also i have thought about the experimental stages of the thesis however at this point of time nothing is set in concrete so I'm not a 100% exactly what types of experiments i will run to test the system.

Also in terms of implementation i have been looking at the code, In the upcoming days i hope to continue working. I'm looking at seeing how the system will work if i serialize and store the semantic space (after SVD reduction) and restore it. Comparing this with the current way will show what type of speed difference there is.

Sunday, August 24, 2008

Measuring Performance

After the repository issues were solved. I checked out Corpus Segmenter, glosser-indexer and tml as well as lib. However this was not enough to get it all working. I had installed m2 maven plugin for eclipse but did not install some of the optional components which caused maven to not work properly.

I met with Jorge and got some help setting eclipse up so that everything could build. After that i started looking at some of the example classes in corpus segmenter with the aim of trying to get a better idea about. As the main aim of my project is to determine whether to store or recreate the index i set about trying to meausre the times of various actions as the index was created.

I created a new class and used bits of codes from other classes to get me started. I'm still in the process of finishing this task.

In addition I have begun writing the Introduction for the partial treatise draft that is due in a couple of weeks. This will help clarify my project further and get me started on writing this big document.

Finally i will be meeting Jorge again on Tuesday for the glosser meeting after which we will discuss the progress of the project.

Subclipse issues resolved

The source code of the project was stored in repository with SVN. I was attempting to use the same method as the others to access the code.

I downloaded Eclipse 3.3.1.1 and using the find software updates installed subclipse.

However when attempting to add the url of the repository it failed. The first bit of information i found was to change the the SVN interface (Windows->preference->team->SVN) from JavaHL to SVNkit. This helped as now the when i attempted to add a repository it prompted me for the username and password. However i was met with a new error when the ssh connection simply failed and did not give any information in the error message.

The issue was finally solved by changing the password for the login, with the belieft that the use of the "@" character in the password may have been causing problems. This worked and i was finally able to access the repository.

Saturday, August 9, 2008

Meeting With Jorge

On Thursday afternoon i met with my co-supervisor Jorge Villalon to discuss in more detail the project and exactly where my thesis fit in.

The aim of the project is to create LSA models from students documents and then carry out operations on these generated models to extract useful information such as key text passages and topic identification.

Source
The process starts with grabbing the source document. This could be from a variety of formats but it is converted into a plain text file and moved on to the next step.

Lucene Index
The next step is carried out by "Lucene". this stage can be referred to as indexing and carries out various transformations on the text to prepare it to be made into the LSA model. Firstly its the text document is tokenized. Next stemming is carried out to reduce the number of unique terms. Finally stop words are removed from the text, these words can be considered irrelevant to the overall meaning of sentences and text passages. Thus they are removed to reduce the amount of terms.

TML
TML takes the text after been processed by Lucene and produces the LSA model. One thing to note is that the LSA models made are based on term-sentence or term-paragraph structure. So unlike other models the entire LSA model can be made from one document. The resulting matrix is made of termfrequency vectors indicating the frequency of each term (row) in each sentence/paragraph (column).

TML also handles the computationally expensive process of Singular Value Decomposition which breaks the term-sentence matrix into the product of three matrices.

Once this is done TML can carry out various operations to extract useful information these include

key text passages
key terms
topics
semantic

Where the thesis fit in
So all that was described in the previous section has already been implemented. So where exactly does my thesis fit in? Well there is an issue regarding performance that needs to be addressed as this project moves forward.

The LSA models created are generated for each revision of the document. However the model itself is not stored only the results of the operations carried out. This means that if another operation is later needed the model needs to be generated again.

So question arises is it better to store the LSA model (much larger than operation result sets) or simply generate the LSA model whenever its needed.

Tests will need to be carried out to measure the performance of the system with the different methods to determine which is the best. At first it looks like a time/space tradeoff. However it is unknown exactly what gain/loss will be achieved by storing the large models.

Thus the aim of the thesis will be analyse the problem and review the two approaches and make a recommendation of which one is better based off the result obtained through testing.

Sunday, August 3, 2008

Project Plan

The initial Project plan was written and submitted on Friday August 1st.

It outlined my current view and understanding of the background knowledge about the topic and also of what the project is about.

The timeline i included was an estimate that i made at this stage regarding the various activities that will be require to finish the treatise project.

As i stated before my topic is "Managing Machine Learning Models in collaborative web applications" , but what does this mean?

Well in this project the machine model in question is Latent Semantic Analysis(LSA) . Basically LSA is made of term-document matrix which counts frequency of words (rows) in documents (columns). It is quite useful because it can find word relations such as synonmy and polysemy.

In brief the aim of the project is to manage this model which is been used to research and extract useful information from students.

It will aim to address issues like

* Storing and retrieving the large term-document matrix
* how to deal with documents as they change, do we need to recalculate the entire LSA model? (computationally expensive)
* What to do with information extracting operations on the LSA model, should they be stored to save time later? if stored how to deal with changes to LSA model?

That's all for now. Hopefully i can refine the scope a bit more in the near future.

Tuesday, July 29, 2008

back!

After traveling during the mid-semester break I'm back for the start of the semester and to get things rolling on the project.

I aim to update the blog regularly throughout the semester as i complete my project.

I have read some papers relating to my topic and will post my summaries of them on this blog.

Also i have to complete the initial project plan for the thesis by the end of this week (1st of August) and will post details about it when i have finished.

Monday, May 26, 2008

Blog Created!

This blog will be used to track what i do on my thesis for ELEC4707 Engineering Project at the University of Sydney.

My topic is "Managing machine learning models in collaborative web applications". It is been supervised by Rafael Calvo and co-supervised by Jorge Villalon.

As it is still currently Semester 1 I am busy with the end of semester assessments of other subjects. However Jorge has sent me some papers regarding the topic and I will read them and post some short summaries here when I get a chance to read them.

That's all for now.