Saturday, November 15, 2008

Finished!

This will be the last post in this blog. I had my presentation last tuesday where i outlined my work in a 9 minute presentation.

Basically coming into this there was concerns over whether Glosser would scale if demand grew. LSA does not scale well with document length and even with moderately sized documents, increasing users can greatly degrade response time.

The idea of storing the model was what was looked at in the treatise. In Java this was simple as implementing serializable interface allows objects to be stored in binary files. I looked into compressing these files using GZIP however this was a poor time/space trade-off since there is little redundant information.

As the initial results came in from the tests storing the model provided great increases in time over recalculation. And the size of the model was feasible only 190KB for 3000 word documents.

Looking to add further to the project i looked at possibly changing the SVD to use Lanczos which is much faster than the standard method which was used by the Weka Library. However without a freely available Java implmentation there was simply not enough time in this short project to implement any solution. However Lanczos is important for performance and will eventually be integrated into Glosser. Most likely through linking to a Matlab implementation of Lanczso (like PROPACK).

With the remaining time i modified the Weka/Jama implementation of SVD so that it would truncate the model after dimensionality reduction so as to only keep the useful data.

So summing up , storing the model was a good idea. It provides a good time/space trade-off. For example with 200 concurrent users with 3000 word documents. Even if only 30% of the requests were for previously calculated models the system would still gain a 20% increase in performance by storing models. In addition storing the model can bring about other advantages such as been able to process and stored models offline whilst the user does not wait. For example student emails document and then gets reply when feedback is ready.

Overall this was an interesting project to be part of. Best of luck to the Glosser project and thanks to my Supervisors Dr. Rafael Calvo and Jorge Villalon.


Asela

Monday, November 3, 2008

Presentation

Now with the written treatise over my focus is on the presentation and upcoming final exams.

I have come to realise how little 9 minutes is for a presentation especially when you have a semester of stuff to talk about. I met with Jorge last week who gave me a general template of what topics to talk about and what had to be cut out.

To help achieve the time constraints i won't be discussing the GZIP compression and also the Lanczos algorithm.

I have my draft presentation ready though and as per Rafael's request his thesis students including me will be giving a rehearsal presentation tomorrow. This will give me a chance to get some feedback to refine the presentation.

This semester(which should be my last) is closing and now there are only a couple of weeks until its all over.

Monday, October 27, 2008

Treatise Printed!

I went to officeworks today and got the treatise printed. It is now complete and ready to go (Submission on wednesday). To give a bit of an idea of what i did here is the abstract. I still have the presentation to work on and i'll post some charts of the results later.


With the emergence of Web 2.0 and increasing ubiquity of the internet, there has been a rise in the amount of rich internet applications, and particularly collaborative web applications. One particular application, relevant for Collaborative Work and Collaborative Learning is Collaborative Writing (CW), which corresponds to write a document synchronously by more than one author. CW as a learning activity is especially relevant for the teaching and learning of Academic writing in higher education. Google Docs is a CW application that simulates a Word processor within a web page, it is used by students in the School of EIE at the University of Sydney to write collaboratively.

Feedback on students' writing is an important source of students' learning of academic writing, however providing more feedback is too costly because it requires a lot of human time. Automated tools to provide feedback on writing have been proposed and tested. One of these was implemented at the School of EIE, it's called Glosser and uses Machine Learning
(particularly Text Mining) techniques to provide automatic feedback on essays. Glosser works as a web application integrated with Google Docs.

Machine learning programs are able to extract useful information from data. The Glosser tool uses a text mining technique known as Latent Semantic Analysis (LSA) which has a high computational complexity. This poses a problem, given the high cost of creating the model, and the amount of data produced by collaborative applications, it is particularly complex to achieve a good response time for tools such as Glosser. Particularly in web based applications where response time is vital there needs to be ways to minimise the impact of the ML model creation on the users' response time, by managing these models in an intelligent way.

This treatise proposes a model for managing Machine learning models in a collaborative web application. The proposed method is to essentially cache the Machine learning model. By making the model persistent further calls to the same document would not require recalculation and could simply be restored from storage and thus reduce response time for recalculations. Experimental analysis showed that the proposed method was effective and greatly reduced recalculation time. File compression of the stored model was shown to be a bad time-space trade-off whist truncating the model to remove redundant data provided a significant reduction in the model size. The data showed that in a situation with 200 concurrent users with ~3000 words documents the new method would provide a 20% reduction in time even with a cache hit ratio of 30% with each model requiring 190KB of space.

Sunday, October 19, 2008

Quick Update

Just a quick update to say that thesis writing continues. I hope to have most of it done by tuesday to show Jorge so that i can get feedback since i will most likely be finishing and getting it printed by the end of this week.


back to work ...

Monday, October 13, 2008

Countdown to the End

So there is only a little over two weeks to get the written treatise finished. Finally i have begun writing more of the core of the thesis.

Throughout the semester i was constantly refining exactly how and what i was testing. I still have small issues to deal with to get a full set of results but the results i have are good enough to begin writing as they show all the logical relationships between different variables.

I will leave it to a post later on when i have actually written the conclusion to give a sum up of what i have done and the findings i discovered in regards to Singular Value Decomposition(SVD).

Although i know what the results mean and the conclusions that are drawn from the results in regards to my initial aim , it is still by no means a trivial tasks to express this in the form of a written treatise. So its looking like a busy two weeks to finish it all off.

Monday, October 6, 2008

Lanczos and Java

So the deadline is nearing. After a suggestion from Rafael i decided to look into replacing the current SVD method with the more efficient lanczos.

I figured out quickly that to implement this from scratch would be impossible for me within a two week timeframe, especially since i would need to understand the algorithm in detail.

I started searching for things that would help:
  • Java Matrix packages: there seems to be alot of them out, a popular one is Jama, which is infact the matrix package used by Weka. Others include Matrix Toolkits for Java, Colt and JLAPack. None of these had a implementation of SVD based off the lanczos
  • However there was a code out there that did implement lanczos unforuntantley not in Java. These include SVDPack and PROpack. It seemed that the scientific community still sticks to Fortran, C and Matlab for these matrix based solutions.
  • I did look into calling a C or Matlab program from within Java , however it seemed more difficult than i thought .
  • The Matlab program would require a MatLab runtime enviroment on the host machine in addition it would make the entire system dependent on Matlab.
  • As for the C program the code was difficult to read and wasn't structured very well
  • for the time being it doesn't look like this lanczos implementation can be done, its definitely doable but with only a few weeks left i would need to wrap it up quick so i can write the actual thesis.
With the remaining time i looked into implementing my proposed changes to the actual TML code. One thing i noticed was operations besides SVD that actual cost time that i was not measuring. This added significant amounts of time such that the point where SVD generation only really got bigger than reading SVD at about the 4000-5000 word mark.

I have meeting tommorow as usual and will discuss these issues.

Monday, September 29, 2008

Results

So lots of things have happened since the last post.

Art of War- I had problems getting this to index in TML. After the document reached a certain size it simply disregarded the input. This problem wasted some of my time before i figured out that the text contained the word "Bibliography" in the middle of it. As it happens this is a termination condition in TML. I decided i might as well look for a new document if i'm going to start again.

I went to Project Gutenberg , this site is filled with books which are now in the Public domain, the texts came in plaintext format which made it easier to use. I chose Alice in Wonderland by Lewis Carrol because of the simplicty of the text it didn't seem to have any odd features like table of results or list of points or anything weird that could skew the results.

First thing i had to do was pre-process the text. Unfortunatley the texts had been preformatted an consistted of many extra new lines. What i needed to do was ensure each paragraph ended with a new line as this is how TML defined pargraphs. I noted that a real paragrph would consist of two newline chracters in a row. I used a search and replace technique to do this. I replaced all '\n' with a weird chracter like 'ð' then converted all the 'ðð' to a single '\n' the converted the remaining 'ð' to ' ' (space).

Next i created a small program to split the files up to create a corpus of documents that increased incrementally in size whereby the final document is the original document in its entirety.

Next i ran all these texts through the Benchmark test which measured the time of generating SVD, writing SVD, writing with Compression , reading SVD, Size of SVD and word Count. This took a loooooooong time over 24 hours!

Results showed many of the relationships expected. SVD grew linearly with Document Size, SVD generation grew polynomially whilst reading the SVD grew somewhat linearly. SVD size grew polynomially as well but suprisingly the GZIP compression also grew with the same complexity and was only slightly below the normal SVD. However writing the GZIP file was much more costly.

One thing of note was that even though the relationships were shown the word count had to be significantly large before the SVD generation was orders of magnitude above the process of storing SVD.


however some of that was last week right now i've been analysing the results and writing a bit of the thesis. Since real world documents at least regular assignments won't exceed 2000 words i created a subset of the Alice in Wonderland text to see what the performance is like when the word count is small. Even at this size document the SVD generation starts to rapidly overtake the other methods after 100 words.

That's all for now, i'll be having another meeting tommorow and will probably look into other documents