Wednesday, September 3, 2008

Cost of SVD

I have also made some progress in possible experiments. After reviewing some of the existing sample code i wrote a some code to test performance. Basically in a few words I'm:

  1. Creating LuceneSearch object and setting up parameters and doing semanticspace.calculate()
  2. Grabbing SVD object semanticspace.getSvd() and storing it in binary file
  3. Reading from binary file and creating SVD object
  4. Creating another LuceneSearch object and doing the same thing as step 1, except replacing space.calculate() with space.setSVD("SVD object from file")
  5. Ccomparing the running times of creating the two luceneSearch objects

Now I'm not getting any errors but that doesn't necessarily mean its
working :) , so i assume this is correct but i have not validated it.

to give a rough idea of statistics:
Document size(Diagnostic1.txt): 3 KB
Object Binary File: 1000 KB
Corpus 1 (original )creation time: 10,593 ms
Writing out SVD obj: 78 ms
Reading in SVD obj: 62 ms
Corpus 2 (using read in SVD object): 234 ms

So these results post some interesting things.

A 3KB document has created a 1000KB SVD. The SVD object here was stored with simple serialization (no compression) but still its a huge difference.

The relationship between the document size and the SVD generated could be an area of interest. Since in the glosser project the LSA model generated is based on essentially a term-sentence matrix , you could expect the terms to grow logarithmically (there are only so many unique words you can use , no matter how long the document is). Whilst on the other hands the amount of sentences are expected to grow linearly as the document grows.

The other main key issue is the difference between generating a new LSA model or restoring an old one. The combined time of writing the SVD out to a file , reading it back in and creating the corpus is still orders of magnitudes less than creating the original SVD.

I guess variations could be made of some of these parameters to see if some loose relationships can be established before creating some formal tests to validate the results shown in this first test.

No comments: