[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: SFC ideas
- To: Jonathan Wolfe <jwolfe@MIT.EDU>
- Subject: Re: SFC ideas
- From: Seth Teller <seth@graphics.lcs.mit.edu>
- Date: Thu, 17 Apr 2003 17:24:02 -0400
- CC: sfc <sfc@graphics.lcs.mit.edu>
- References: <002c01c304f9$03bd95f0$9700cb12@mit.edu>
- User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.0.0) Gecko/20020530
SFC/e-clerk'ers, welcome jonathan wolfe. he's going to join
the effort as a UROP for the rest of this term, and assuming
all goes well will stay on as an MEng.
jon sent a bunch of thoughtful comments to me, so i'm
forwarding them (along with my responses below) to the
group. please follow up by email to the group with your
own comments too of course.
[jon, present SFC members are:
Trevor Darrell <trevor@ai.mit.edu> vision
David Karger <karger@lcs.mit.edu> IR, haystack
Joe Polifroni <joe@sls.lcs.mit.edu> speech
Jim Glass <glass@mit.edu> speech
Seth Teller <teller@lcs.mit.edu>
Konrad Tollmar <konrad@ai.mit.edu> vision
Jonathan Wolfe <jwolfe@mit.edu>
]
continue reading for comments.
seth
Jonathan Wolfe wrote:
> Prof. Teller,
>
> Here are my preliminary thoughts on the SFC project.. they are a little
> haphazard as they were triggered by a lot of the different material on
> the site, but I tried to organize them into hardware-related,
> software-related, and some Next Steps:
>
>
> Hardware-related
>
> * A camera in combination with a scanner seems like the best option as
> it provides a quick backup visual from the camera and the scanner
> provides much higher resolution images for OCR and re-printing
> * What's important in the area of visual recognition in the first
> iteration? It seems that mono video would be fine for detecting the
> presence of an item to be dealt with, which, in combination with
> appropriate audio from the user, would supply enough information to the
> system for it to respond.. although I haven't seen what a stereo video
> system can do.
i think we want color (but low-res, probably NTSC is fine) video to
assure that we can match the video stream with the higher-res scanned
documents. stereo is probably not needed at this point, since we
envision a fixed "staging area" where the user can place each
document in turn. but in future perhaps stereo would help us
deal with documents offered in a more free-form way.
> * The directional mic seems the way to go, it's the least obtrusive to
> the user as the user would most likely be sitting at a desk or in the
> same general vicinity (an office). How well do they work? I believe
> SLS uses them, they would know.
> * The CPU should depend on the computationally-intensive software
> running on the machine (speech recognition and potentially the backend
> like Haystack.. according to the hardware requirements listed at
> http://haystack.lcs.mit.edu/downloads.html) and how beneficial the dual
> processors would ACTUALLY be.. my guess is not very.
i think we should go with 2 or even 4 processors, but divide their
workloads into batch/background and interactive/foreground. so a
few could be chugging away, servicing the scanner, video frames, and
logging data streams to disk, whereas another one might run the
touch-screen and display and other interface. i think it's important
to maintain the "fluidity" of the interface as much as possible, no
matter what the (batch) workload of the clerk.
> * The rest of the hardware specs scoped out (by Fred?) look good and
> have drivers for Linux.
> * How big of a queue does the sheet feeder have for documents waiting to
> be scanned? According to Epson's web site for the scanner on the SFC
> hardware page, 30. That seems like plenty.
>
>
this is probably fine, but we won't know for sure until we see
how the dataflow works, i.e., how quickly users offer and
describe new documents to the system, versus how quickly the
system ingests them.
> Software-related
>
> text/document binding:
> * How real-time is the speech recognition? If it showed up on the
> screen as it's spoken next to a low-res picture of the current document,
> how long to wait before moving on to the next text/document pair?
the speech demos i've seen have a latency of (say) 1-3 seconds,
but that was some time ago. jim, joe, comments ?
> * Are they always pairs? Maybe there's more than one utterance or
> document per "item". We can try to parse through the text to see if the
> user makes reference to more than one document.
tough problem. we'll have to be as flexible as possible in somehow
segmenting utterances and binding them appropriately to documents.
we can always bring up a dialog with thumbnails, and a question,
"were you referring to this or that when you said ... X".
>
> speech recognition:
> * HP used a domain-specific solution that still required training.. what
> to use for this? (HP paper at
> http://www.hpl.hp.com/techreports/2001/HPL-2001-145.html)
> * HP paper mentions the GALAXY project - the '98 GALAXY paper
> (http://www.sls.lcs.mit.edu/sls/publications/1998/icslp98-galaxy.pdf)
> describes a system that could be used for a lot of the control flow
> associated with voice communications with SFC, although the speech
> recognition component appears to also be domain-specific
> * Section 3 of the '99 GALAXY-II paper
> (http://www.sls.lcs.mit.edu/sls/publications/1999/eurospeech99-seneff.pd
> f) talks about multimodal interactions and how GALAXY handles them (in
> this instance, binding mouse clicks to a user utterance). This is a
> great starting point for binding user speech to other actions for the
> SFC.
> * '00 paper on GALAXY-II
> (http://www.sls.lcs.mit.edu/sls/publications/2000/lrec-2000.pdf) focuses
> on using the system to evaluate different components, less on the system
> itself
>
> OCR and automatic "agent"-like actions:
> * JOCR/GOCR at jocr.sourceforge.net
> * Clara ORC at www.claraocr.org (This one looks more promising, they
> claim to be near production quality.)
> * Links can be read out of the document post-OCR, that doesn't seem like
> a problem at all.. links could be inferred from a document title search
> on Google and a quick content match, potentially easier to look for a
> company name or organization name at the top of the document and search
> for a link to that.
> * In order to take actions online from a document, a software "agent"
> could search through for keywords like "go to http://whatever" or
> "click" or something that resembles a list of things to do.
yes. for the prototype, though, we should focus on pulling out
plaintext, and segmenting out lineart and images, with some decent
success (say above 90% -- is that a reasonable starting target?).
link extraction and other fancy actions can come later.
>
> Haystack (http://haystack.lcs.mit.edu):
> * Partial queries that get progressively more restrictive.. ease and
> efficiency in Haystack?
> * The screenshots looks awesome. It's like a whole new FS where objects
> and relations are the most important.
> * The requirements call for a fairly powerful machine.. with everything
> else running for this system, will Haystack cause performance issues?
>
again, if we spec things out and partition the CPU loads
appropriately, we should be OK.
> comments on design issues:
> * Text searches in combination with specific "agents" (i.e. one for
> following online instructions for you) seems like the most general
> approach.
> * With the GALAXY system, it seems very possible to have speech queries
> run primarily via speech recognition, with text conversion as a backup
> or alternative approach.
yes, that's what we have in mind.
> * The repository should probably be based on some kind of server model,
> even if the local machine is actually storing the repository.. that way,
> the user could easily change the repository location for a certain
> subset of items to a removable storage medium (i.e. to take on a
> business trip) for use with a viewer (Haystack or just a system that can
> do queries on the database instead of a full SFC)
>
yes, good idea.
> Next Steps (in my opinion):
>
> * Hash out a lot more of the design issues with the people who are
> experts on the component systems (i.e. GALAXY, Haystack).
yep -- you should meet with each of us in turn. or we
could all get together, it's probably time to do that
anyway.
> * Create a concrete list of design principles to follow for the first
> iteration to check the design against before construction (i.e. ease for
> the user measured by speed/distance from SFC/etc, robustness for
> handling imperfect user queries through speech or touch input, etc)
yes.
> * I have experience using Debian and have 2 machines (1 desktop and 1
> laptop) both running Debian with kernel 2.4.20, so I can test software
> or drivers if necessary.
>
great.
>
> Hope that all made sense! =)
>
> -Jon
>
indeed it did. looking forward to getting going !
seth.