Saturday, November 15, 2008

Finished!

This will be the last post in this blog. I had my presentation last tuesday where i outlined my work in a 9 minute presentation.

Basically coming into this there was concerns over whether Glosser would scale if demand grew. LSA does not scale well with document length and even with moderately sized documents, increasing users can greatly degrade response time.

The idea of storing the model was what was looked at in the treatise. In Java this was simple as implementing serializable interface allows objects to be stored in binary files. I looked into compressing these files using GZIP however this was a poor time/space trade-off since there is little redundant information.

As the initial results came in from the tests storing the model provided great increases in time over recalculation. And the size of the model was feasible only 190KB for 3000 word documents.

Looking to add further to the project i looked at possibly changing the SVD to use Lanczos which is much faster than the standard method which was used by the Weka Library. However without a freely available Java implmentation there was simply not enough time in this short project to implement any solution. However Lanczos is important for performance and will eventually be integrated into Glosser. Most likely through linking to a Matlab implementation of Lanczso (like PROPACK).

With the remaining time i modified the Weka/Jama implementation of SVD so that it would truncate the model after dimensionality reduction so as to only keep the useful data.

So summing up , storing the model was a good idea. It provides a good time/space trade-off. For example with 200 concurrent users with 3000 word documents. Even if only 30% of the requests were for previously calculated models the system would still gain a 20% increase in performance by storing models. In addition storing the model can bring about other advantages such as been able to process and stored models offline whilst the user does not wait. For example student emails document and then gets reply when feedback is ready.

Overall this was an interesting project to be part of. Best of luck to the Glosser project and thanks to my Supervisors Dr. Rafael Calvo and Jorge Villalon.


Asela

No comments: