Monday, September 29, 2008

Results

So lots of things have happened since the last post.

Art of War- I had problems getting this to index in TML. After the document reached a certain size it simply disregarded the input. This problem wasted some of my time before i figured out that the text contained the word "Bibliography" in the middle of it. As it happens this is a termination condition in TML. I decided i might as well look for a new document if i'm going to start again.

I went to Project Gutenberg , this site is filled with books which are now in the Public domain, the texts came in plaintext format which made it easier to use. I chose Alice in Wonderland by Lewis Carrol because of the simplicty of the text it didn't seem to have any odd features like table of results or list of points or anything weird that could skew the results.

First thing i had to do was pre-process the text. Unfortunatley the texts had been preformatted an consistted of many extra new lines. What i needed to do was ensure each paragraph ended with a new line as this is how TML defined pargraphs. I noted that a real paragrph would consist of two newline chracters in a row. I used a search and replace technique to do this. I replaced all '\n' with a weird chracter like 'ð' then converted all the 'ðð' to a single '\n' the converted the remaining 'ð' to ' ' (space).

Next i created a small program to split the files up to create a corpus of documents that increased incrementally in size whereby the final document is the original document in its entirety.

Next i ran all these texts through the Benchmark test which measured the time of generating SVD, writing SVD, writing with Compression , reading SVD, Size of SVD and word Count. This took a loooooooong time over 24 hours!

Results showed many of the relationships expected. SVD grew linearly with Document Size, SVD generation grew polynomially whilst reading the SVD grew somewhat linearly. SVD size grew polynomially as well but suprisingly the GZIP compression also grew with the same complexity and was only slightly below the normal SVD. However writing the GZIP file was much more costly.

One thing of note was that even though the relationships were shown the word count had to be significantly large before the SVD generation was orders of magnitude above the process of storing SVD.


however some of that was last week right now i've been analysing the results and writing a bit of the thesis. Since real world documents at least regular assignments won't exceed 2000 words i created a subset of the Alice in Wonderland text to see what the performance is like when the word count is small. Even at this size document the SVD generation starts to rapidly overtake the other methods after 100 words.

That's all for now, i'll be having another meeting tommorow and will probably look into other documents

No comments: