edu.mit.nlp.segmenter
Class SegTester

java.lang.Object
  extended by edu.mit.nlp.segmenter.SegTester

public class SegTester
extends Object

The purpose of this class is to provide a unified framework to evaluate and run various segmenters.

Evaluation

To evaluate a segmenter on a dataset, here's what you say:
 SegTester -config config -dir dir -suff suff [-init init] [-debug]
      config   the configuration file for the experiment (see the config directory)
      dir      the directory where the data files are located
      suff     the suffix of the data files
      init     for initializable segmenters (e.g. CuCoSeg), 
                      this specifies the name of a file with the initial segmentations.
      debug    print debugging info                            

  Outputs: the configuration, the files that it's reading in, anything the segmenter itself wants to say, 
  and the pk/wd per file.
  

Running

To run a segmenter on some text, you say:

  cat file | SegTester -config config [-debug debug] [-num-segs num-segs]
      config   the configuration file for the experiment (see the config directory)
      debug    print debugging info
      num-segs number of segments desired.  if not provided, will be read from the file itself, unless
                      the configuration specifies that the number of segments is unknown
  
Outputs: the configuration, the line numbers of the segment endpoints

Proposed future functionality

Author:
Jacob Eisenstein

Field Summary
protected static String para_ending
           
protected  MyTextWrapper[] texts
           
 
Constructor Summary
SegTester(ml.options.OptionSet optset)
           
 
Method Summary
 void eval(Segmenter segmenter)
          Evaluate a segmenter.
static ParaData getParaData(String filename)
          gets "paralinguistic" data, e.g. pause durations and prosodic markers.
protected  void loadFiles(ml.options.OptionSet optset)
           
 MyTextWrapper loadText(String fileName)
           
static void main(String[] args)
           
static void preprocessText(MyTextWrapper text, boolean use_choi, boolean is_windowing_enabled, boolean remove_stops, boolean use_stems, int window_size)
          does some preprocessing stuff on the text -- stemming, removing stop words, handling segment boundries, and breaking the text into K-word blocks.
protected static List stemStopWords(List stopWords)
          if we're doing stemming, then we need to also stem the stopwords (otherwise they won't match) This does that.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

texts

protected MyTextWrapper[] texts

para_ending

protected static String para_ending
Constructor Detail

SegTester

public SegTester(ml.options.OptionSet optset)
          throws Exception
Throws:
Exception
Method Detail

main

public static void main(String[] args)

loadFiles

protected void loadFiles(ml.options.OptionSet optset)

loadText

public MyTextWrapper loadText(String fileName)

getParaData

public static ParaData getParaData(String filename)
gets "paralinguistic" data, e.g. pause durations and prosodic markers. not used in this implementation.


preprocessText

public static void preprocessText(MyTextWrapper text,
                                  boolean use_choi,
                                  boolean is_windowing_enabled,
                                  boolean remove_stops,
                                  boolean use_stems,
                                  int window_size)
does some preprocessing stuff on the text -- stemming, removing stop words, handling segment boundries, and breaking the text into K-word blocks. based on Malioutov's MinCutSeg.jar library

Parameters:
text - the text file to preprocess
use_choi - use choi-style segment boundaries
is_windowing_enabled - whether to break the text into fixed-length chunks (as opposed to using sentence breaks)
window_size - the size of the fixed-length chunks
remove_stops - whether to remove stopwords
use_stems - whether to use stemming

stemStopWords

protected static List stemStopWords(List stopWords)
if we're doing stemming, then we need to also stem the stopwords (otherwise they won't match) This does that.


eval

public void eval(Segmenter segmenter)
Evaluate a segmenter.

Parameters:
segmenter - the segmenter class that we're evaluating Doesn't return anything, just prints stuff. Uses Malioutov's evaluation code.


Copyright © 2008 MIT. All Rights Reserved.