edu.mit.nlp.segmenter.dp
Class DPSeg

java.lang.Object
  extended by edu.mit.nlp.segmenter.dp.DPSeg

public class DPSeg
extends Object

This class implements the dynamic programming Bayesian segmentation, for both DCM and MAP language models.

Now with EM estimation of priors. Note that we use log-priors everywhere. The reason is that the log of the prior is in [-inf,inf], while the prior itself is in [0,inf]. Since my LBFGS engine doesn't take constraints, it's better to search in log space. This requires only a small modification to the gradient computation.


Nested Class Summary
protected  class DPSeg.PriorOptimizer
          A class for LBFGS optimization of the priors
 
Field Summary
 boolean m_debug
           
 
Constructor Summary
DPSeg(DPDocument[][] docs, int[][] truths)
           
 
Method Summary
 double[] computeGradient(double[] logpriors)
          computes the gradient of the likelihood, across the whole dataset.
protected  double[] computePDur(int T, double edur, double log_dispersion)
           
 double computeTotalLL(double[] logpriors)
          compute the loglikelihood for the whole dataset.
 double[] getParams()
           
 int[][] getResponses()
          get the segmentations
 void printSegs()
           
 SegResult[] segEM(double[] init_params)
          segEM estimates the parameters using a form of hard EM it computes the best segmentation given the current parameters, then does a gradient-based search for new parameters, and iterates.
 SegResult[] segment(double[] params)
          segment each document in the dataset.
protected  SegResult[] segmentKnown(double[] params)
          segment in the case that the number of segments per doc is known. same arguments as segment(double[])
protected  SegResult[] segmentUnknown(double[] params)
          segment in the case of an unknown number of segments. same arguments as segment(double[])
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_debug

public boolean m_debug
Constructor Detail

DPSeg

public DPSeg(DPDocument[][] docs,
             int[][] truths)
Parameters:
docs - The documents to segment. It's a 2D array for the multimodal segmentation case, but if you're just doing text then it will be [N][1].
truths - The ground truth segmentations. [N][], with each row being an another array of ints. I'd like to refactor so that this isn't necessary, but at the moment it is.
Method Detail

segEM

public SegResult[] segEM(double[] init_params)
segEM estimates the parameters using a form of hard EM it computes the best segmentation given the current parameters, then does a gradient-based search for new parameters, and iterates. As an argument it takes the initial settings, in log terms. one idea of how to speed this up is to only recompute the segmentation for a subset of files. or, just call segem on a few files, then call the final segmentation on all of them. we could add a class member variable indicating "active" files, and then only apply segment(), computeGradient(), and computeLL() to those files. by default all files would be active.


printSegs

public void printSegs()

computePDur

protected double[] computePDur(int T,
                               double edur,
                               double log_dispersion)

segment

public SegResult[] segment(double[] params)
segment each document in the dataset.

Parameters:
params - the (log) parameters the last entry in the input array is the log of the dispersion parameter for the duration distribution. the other ones are the logs of the priors (for each modality)
Returns:
the results for each document. kind of a bad design, it ought to just return the segmentation.

segmentUnknown

protected SegResult[] segmentUnknown(double[] params)
segment in the case of an unknown number of segments. same arguments as segment(double[])


segmentKnown

protected SegResult[] segmentKnown(double[] params)
segment in the case that the number of segments per doc is known. same arguments as segment(double[])


computeTotalLL

public double computeTotalLL(double[] logpriors)
compute the loglikelihood for the whole dataset. useful for reestimating priors


computeGradient

public double[] computeGradient(double[] logpriors)
computes the gradient of the likelihood, across the whole dataset.


getResponses

public int[][] getResponses()
get the segmentations


getParams

public double[] getParams()


Copyright © 2008 MIT. All Rights Reserved.