Saturday, August 9, 2008

Meeting With Jorge

On Thursday afternoon i met with my co-supervisor Jorge Villalon to discuss in more detail the project and exactly where my thesis fit in.

The aim of the project is to create LSA models from students documents and then carry out operations on these generated models to extract useful information such as key text passages and topic identification.

Source
The process starts with grabbing the source document. This could be from a variety of formats but it is converted into a plain text file and moved on to the next step.

Lucene Index
The next step is carried out by "Lucene". this stage can be referred to as indexing and carries out various transformations on the text to prepare it to be made into the LSA model. Firstly its the text document is tokenized. Next stemming is carried out to reduce the number of unique terms. Finally stop words are removed from the text, these words can be considered irrelevant to the overall meaning of sentences and text passages. Thus they are removed to reduce the amount of terms.

TML
TML takes the text after been processed by Lucene and produces the LSA model. One thing to note is that the LSA models made are based on term-sentence or term-paragraph structure. So unlike other models the entire LSA model can be made from one document. The resulting matrix is made of termfrequency vectors indicating the frequency of each term (row) in each sentence/paragraph (column).

TML also handles the computationally expensive process of Singular Value Decomposition which breaks the term-sentence matrix into the product of three matrices.

Once this is done TML can carry out various operations to extract useful information these include
  • key text passages
  • key terms
  • topics
  • semantic

Where the thesis fit in
So all that was described in the previous section has already been implemented. So where exactly does my thesis fit in? Well there is an issue regarding performance that needs to be addressed as this project moves forward.

The LSA models created are generated for each revision of the document. However the model itself is not stored only the results of the operations carried out. This means that if another operation is later needed the model needs to be generated again.

So question arises is it better to store the LSA model (much larger than operation result sets) or simply generate the LSA model whenever its needed.

Tests will need to be carried out to measure the performance of the system with the different methods to determine which is the best. At first it looks like a time/space tradeoff. However it is unknown exactly what gain/loss will be achieved by storing the large models.

Thus the aim of the thesis will be analyse the problem and review the two approaches and make a recommendation of which one is better based off the result obtained through testing.

No comments: