Saturday, August 30, 2008

Draft Treatise

Draft treatise has been the focus of attention for the last couple of weeks as it is due next week (5th September).

I am attempting to try to ensure all the information required is contained in the initial pages. As the later parts of the treatise can't be completed yet.

I want to describe and explain various aspects and give background information about the project so that reader should be able to pick it up without having to read other texts. Also i have thought about the experimental stages of the thesis however at this point of time nothing is set in concrete so I'm not a 100% exactly what types of experiments i will run to test the system.

Also in terms of implementation i have been looking at the code, In the upcoming days i hope to continue working. I'm looking at seeing how the system will work if i serialize and store the semantic space (after SVD reduction) and restore it. Comparing this with the current way will show what type of speed difference there is.

Sunday, August 24, 2008

Measuring Performance

After the repository issues were solved. I checked out Corpus Segmenter, glosser-indexer and tml as well as lib. However this was not enough to get it all working. I had installed m2 maven plugin for eclipse but did not install some of the optional components which caused maven to not work properly.

I met with Jorge and got some help setting eclipse up so that everything could build. After that i started looking at some of the example classes in corpus segmenter with the aim of trying to get a better idea about. As the main aim of my project is to determine whether to store or recreate the index i set about trying to meausre the times of various actions as the index was created.

I created a new class and used bits of codes from other classes to get me started. I'm still in the process of finishing this task.

In addition I have begun writing the Introduction for the partial treatise draft that is due in a couple of weeks. This will help clarify my project further and get me started on writing this big document.

Finally i will be meeting Jorge again on Tuesday for the glosser meeting after which we will discuss the progress of the project.

Subclipse issues resolved

The source code of the project was stored in repository with SVN. I was attempting to use the same method as the others to access the code.

I downloaded Eclipse 3.3.1.1 and using the find software updates installed subclipse.

However when attempting to add the url of the repository it failed. The first bit of information i found was to change the the SVN interface (Windows->preference->team->SVN) from JavaHL to SVNkit. This helped as now the when i attempted to add a repository it prompted me for the username and password. However i was met with a new error when the ssh connection simply failed and did not give any information in the error message.

The issue was finally solved by changing the password for the login, with the belieft that the use of the "@" character in the password may have been causing problems. This worked and i was finally able to access the repository.

Saturday, August 9, 2008

Meeting With Jorge

On Thursday afternoon i met with my co-supervisor Jorge Villalon to discuss in more detail the project and exactly where my thesis fit in.

The aim of the project is to create LSA models from students documents and then carry out operations on these generated models to extract useful information such as key text passages and topic identification.

Source
The process starts with grabbing the source document. This could be from a variety of formats but it is converted into a plain text file and moved on to the next step.

Lucene Index
The next step is carried out by "Lucene". this stage can be referred to as indexing and carries out various transformations on the text to prepare it to be made into the LSA model. Firstly its the text document is tokenized. Next stemming is carried out to reduce the number of unique terms. Finally stop words are removed from the text, these words can be considered irrelevant to the overall meaning of sentences and text passages. Thus they are removed to reduce the amount of terms.

TML
TML takes the text after been processed by Lucene and produces the LSA model. One thing to note is that the LSA models made are based on term-sentence or term-paragraph structure. So unlike other models the entire LSA model can be made from one document. The resulting matrix is made of termfrequency vectors indicating the frequency of each term (row) in each sentence/paragraph (column).

TML also handles the computationally expensive process of Singular Value Decomposition which breaks the term-sentence matrix into the product of three matrices.

Once this is done TML can carry out various operations to extract useful information these include
  • key text passages
  • key terms
  • topics
  • semantic

Where the thesis fit in
So all that was described in the previous section has already been implemented. So where exactly does my thesis fit in? Well there is an issue regarding performance that needs to be addressed as this project moves forward.

The LSA models created are generated for each revision of the document. However the model itself is not stored only the results of the operations carried out. This means that if another operation is later needed the model needs to be generated again.

So question arises is it better to store the LSA model (much larger than operation result sets) or simply generate the LSA model whenever its needed.

Tests will need to be carried out to measure the performance of the system with the different methods to determine which is the best. At first it looks like a time/space tradeoff. However it is unknown exactly what gain/loss will be achieved by storing the large models.

Thus the aim of the thesis will be analyse the problem and review the two approaches and make a recommendation of which one is better based off the result obtained through testing.

Sunday, August 3, 2008

Project Plan

The initial Project plan was written and submitted on Friday August 1st.

It outlined my current view and understanding of the background knowledge about the topic and also of what the project is about.

The timeline i included was an estimate that i made at this stage regarding the various activities that will be require to finish the treatise project.

As i stated before my topic is "Managing Machine Learning Models in collaborative web applications" , but what does this mean?

Well in this project the machine model in question is Latent Semantic Analysis(LSA) . Basically LSA is made of term-document matrix which counts frequency of words (rows) in documents (columns). It is quite useful because it can find word relations such as synonmy and polysemy.

In brief the aim of the project is to manage this model which is been used to research and extract useful information from students.

It will aim to address issues like

* Storing and retrieving the large term-document matrix
* how to deal with documents as they change, do we need to recalculate the entire LSA model? (computationally expensive)
* What to do with information extracting operations on the LSA model, should they be stored to save time later? if stored how to deal with changes to LSA model?


That's all for now. Hopefully i can refine the scope a bit more in the near future.