edu.mit.nlp.segmenter.mcmc
Class CuCoSeg

java.lang.Object
  extended by edu.mit.nlp.segmenter.mcmc.CuCoSeg
All Implemented Interfaces:
InitializableSegmenter, Segmenter

public class CuCoSeg
extends Object
implements InitializableSegmenter

CuCoSeg -- Cue-phrase + Cohesion Segmentation

Loads a second copy of all texts, from which stop words are not removed (because stopwords are sometimes good cues) Keeps a separate LexMap for each document. Thus, must also keep a separate fastdcm for each doc.


Nested Class Summary
protected  class CuCoSeg.PriorOptimizer
          An LBFGS optimizer to search the parameter space
 
Constructor Summary
CuCoSeg()
           
 
Method Summary
 void addCountsForSentence(int doc, int t)
          addCountsForSentence i -- the document j -- the sentenec uses the segs[] variable: complexity: K[i] + N[i][j], where K is the number of segs, and N[i][j] is the number of words in sent j
protected  void changeCountsForSentence(int doc, int t, int sign)
           
protected  double computeCueLogProb()
          computeCueLogProb() computes the log-likelihood of the cue phrase counts
 double computeLogProb()
          computes the overall log probability
 double computeLogProb(int doc, int seg)
          computes the portion of the log-probability associated with a change to segment seg in doc considers the b-counts, o-counts, and the i-counts for seg, seg-1, and seg+1 (where applicable)
 double computeXtraProb()
           
 Empirical getMoveProposal(int doc, int seg)
          generates an empirical distribution over moves of a given segmentation point
 edu.mit.nlp.segmenter.mcmc.CuCoSeg.Unigram[] getSortedUnigrams(LexMap lexmap, int[] b_counts, int[] non_b_counts)
           
 void initialize(String config_filename)
          Do whatever initialize you need from this config file
 void initSegs(String segfilename)
          initSegs -- load initial segmentation guesses from a file.
protected  double minkaApprox(int[] counts)
          sets the prior on the cue phrase language model, using the approximation proposed by Minka in "Estimating a Dirichlet Distribution" (eq 114)
protected  void printStatus(PrintStream out, int i)
          prints a status message.
 List[] segmentTexts(MyTextWrapper[] texts, int[] K)
          massively long method that segments all the texts
 void setDCMPrior(FastDCM dcm, double prior)
          Set the symmetric prior on the DCM language models
 void setDebug(boolean debug)
          tells your d00d to set its debug flag
 void setPDurs()
          Since durations are discrete, we keep a cache of the probability of each duration length.
 void subCountsForSentence(int doc, int t)
           
 void updateCounts(int lambda_b)
          update the counts given a new lambda parameter
 void updateSegmentation(int doc, int seg, int amount)
          update the segmentation move the segpt in the doc by the amount will update segs[] and also all the counts
static boolean validMove(List segpoints, int seg, int amount)
          assesses whether a given move is valid (doesn't cross segment boundaries)
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CuCoSeg

public CuCoSeg()
Method Detail

initialize

public void initialize(String config_filename)
Description copied from interface: Segmenter
Do whatever initialize you need from this config file

Specified by:
initialize in interface Segmenter
Parameters:
config_filename - the path to the config file

setDebug

public void setDebug(boolean debug)
Description copied from interface: Segmenter
tells your d00d to set its debug flag

Specified by:
setDebug in interface Segmenter

initSegs

public void initSegs(String segfilename)
initSegs -- load initial segmentation guesses from a file. This is handy because if we're just trying different stuff with the MCMC part, we want to avoid the time-consuming initialization from the DP segmenter.

Specified by:
initSegs in interface InitializableSegmenter

segmentTexts

public List[] segmentTexts(MyTextWrapper[] texts,
                           int[] K)
massively long method that segments all the texts

Specified by:
segmentTexts in interface Segmenter
Parameters:
texts - all the texts in the dataset
K - number of segments per document
Returns:
a list of arrays of segmentation points

printStatus

protected void printStatus(PrintStream out,
                           int i)
prints a status message. the format is
       iteration LL [A1 A2 A3] [theta0 phi_b0 dispersion] Pk WD

         A1 = num moves accepted since last message
         A2 = proportion of moves accepted since last message
         A3 = f(.5), where f() is the annealing function
         theta0 = symmetric dirichlet prior on language models
         phi_b0 = symmetric dirichlet prior on cue phrases
         dispersion = dispersion parameter on segment durations (not used) 
         Pk = metric of segmentation quality
         WD = other metric of segmentation quality
         

Parameters:
out - the printstream to write the message to
i - the iteration number

getSortedUnigrams

public edu.mit.nlp.segmenter.mcmc.CuCoSeg.Unigram[] getSortedUnigrams(LexMap lexmap,
                                                                      int[] b_counts,
                                                                      int[] non_b_counts)

setPDurs

public void setPDurs()
Since durations are discrete, we keep a cache of the probability of each duration length. This fills the cache, given our parameters.


setDCMPrior

public void setDCMPrior(FastDCM dcm,
                        double prior)
Set the symmetric prior on the DCM language models

Parameters:
dcm - the DCM cache
prior - the new prior

minkaApprox

protected double minkaApprox(int[] counts)
sets the prior on the cue phrase language model, using the approximation proposed by Minka in "Estimating a Dirichlet Distribution" (eq 114)


computeXtraProb

public double computeXtraProb()

computeLogProb

public double computeLogProb()
computes the overall log probability


computeLogProb

public double computeLogProb(int doc,
                             int seg)
computes the portion of the log-probability associated with a change to segment seg in doc considers the b-counts, o-counts, and the i-counts for seg, seg-1, and seg+1 (where applicable)


computeCueLogProb

protected double computeCueLogProb()
computeCueLogProb() computes the log-likelihood of the cue phrase counts


validMove

public static boolean validMove(List segpoints,
                                int seg,
                                int amount)
assesses whether a given move is valid (doesn't cross segment boundaries)


getMoveProposal

public Empirical getMoveProposal(int doc,
                                 int seg)
generates an empirical distribution over moves of a given segmentation point


updateSegmentation

public void updateSegmentation(int doc,
                               int seg,
                               int amount)
update the segmentation move the segpt in the doc by the amount will update segs[] and also all the counts


addCountsForSentence

public void addCountsForSentence(int doc,
                                 int t)
addCountsForSentence i -- the document j -- the sentenec uses the segs[] variable: complexity: K[i] + N[i][j], where K is the number of segs, and N[i][j] is the number of words in sent j


subCountsForSentence

public void subCountsForSentence(int doc,
                                 int t)

changeCountsForSentence

protected void changeCountsForSentence(int doc,
                                       int t,
                                       int sign)

updateCounts

public void updateCounts(int lambda_b)
update the counts given a new lambda parameter



Copyright © 2008 MIT. All Rights Reserved.