Go to the previous, next chapter.
Most of the crises that occur in the lab, such as students wedging their
Scheme processes or paper jams in the printer, are of a rather routine
nature. You will get the hang of these rather quickly and will usually
have no problems dealing with them. If something extraordinary happens,
however--a server dies, for instance, or it looks like someone is about
to immanentize the Eschaton(12)---and you don't know what to do or need help in
dealing with the situation (when the server dies, for example, you will
probably appreciate a few warm bodies who know the hacker password and
can recover people's work present in the lab), use all the means known
to you to summon help. Send out e-mail, if you can, since quite a few
people on the staff are addicted enough that they will actually read it
within minutes after you send it. Don't hesitate to call the head TA,
head LA or lecturers at their offices or at home, if needed (you can
probably find out their phone numbers by fingering them). Get the
instrument desk people to help you out.
Whenever you encounter a hardware problem (such as a machine not
operating correctly) that you cannot fix yourself, do all of the
following: report the problem to the instrument desk, send mail to
6001-feedback
, and put a blue tag on the machine, filling in the
problem, your name, and the date. The tags can usually be found either
on top or in the top drawer of the filing cabinet.
If a snake is showing no signs of life (specifically, the heartbeat
light, which is the first one on the left, is not blinking), the problem
is usually a tripped circuit breaker. Unfortunately, this happens more
often than it should, because the cheap circuit breakers in the lab are
flakey and unreliable. There are two circuit breaker boxes on the 001
side of the lab, one, to the right of the Lambda Lounge, by the help Q
board (#1), another to the left of the lounge, close to the blackboards
at the back of the room (#2). To recover a machine in case its circuit
breaker trips:
-
Get under the table on which the machine is standing. Find one of the
cables that lead from the machine to the power outlet, and read off the
number written on the outlet corresponding to the cable (be sure to get
the correct cable--you don't want to get the wrong number and make a
student on a neighboring machine lose by playing with the wrong circuit
breaker).
-
Proceed to the appropriate circuit breaker box, according to the number
you read off. For example, 2-23 means it's box #2. Open it and look
for the breaker labeled 23. If the number is correct, the corresponding
breaker should look different from the other ones: it will not be lined
up with the others but will usually be in the dead center
position.(13) To recover it, first push it all
the way OFF (i.e., opposite the direction the other circuit
breakers point in), and only then bring it back to line up with the
rest. The machine should now make some noise and show signs of life.
It will reboot and come around in a minute.
If you notice that a particular machine experiences the circuit breaker
problem on a regular basis, report the problem in a manner described
above.
If you cannot get a snake to respond (i.e., the screen is locked up so
nothing you type is being acknowledged and the mouse does not respond),
you need to reboot it. For soft reboot, press
C-SHIFT-RESET; it should only take a few seconds for
the machine to come around. To powercycle the machine, press the white
power button, located at the bottom of the machine, below the disk
drive; hard reboot takes a little longer, maybe a couple of minutes.
If the machine does not reboot properly, leave it alone and report the
problem (make sure to put a tag on it).
If one machine seems to be using the wrong version of a problem set or
any other file, such as `motd', chances are all the other machines
on that side of the room--i.e., connected to the same server--are
experiencing the same problem. Check this by looking at the file in
question on both sicp-00 and sicp-01 and comparing the two versions; if
that is indeed the case, and only if you are absolutely sure
which of the versions is correct, copy the correct version to the
losing server (see the section on how to make software changes to both
servers in the previous chapter). If you are not sure what the correct
version is, don't try to guess; simply report the problem to the head
TA/LA. In any case, the head TA should be informed if you encounter two
different versions of a file.
If the problem seems to be something else and you cannot figure out how
to fix it, ask the student to switch to another machine and put a sign
on the machine asking people not to use it. In addition, report the
problem by following the procedure described above.
Speaking of immanentizing the Eschaton... Murphy's law says that the
first server crash must occur the night before a problem set is due, in
which case you will notice it momentarily, for it will be accompanied by
the sound of 46 students screaming, sighing, taking a deep breath, and
doing whatever else people do in times of crisis. Don't panic. Try to
calm the students down, too. A server crash is a serious problem, but
their work can usually still be recovered. Call for help right away
(you will need it even if you can fix the problem itself, to recover
people's work as fast as possible), get into the server closet (it is
located in the front of the lab, behind the printers), and start
investigating.
One of the ways a server can crash is the same way any machine does, by
means of a circuit breaker tripping. The consequence of this, as you
might imagine, is that all the machines connected to that server panic
and crash as well. To fix the problem, do what you would normally do in
a case of circuit breaker blowing (consult the section on circuit
breakers above). The circuit breaker box for the servers is located
inside the server closet.
If the server crashes for an unknown reason, and is completely
catatonic, the only thing you should try is powercycling it (see the
section on rebooting above): turn it off, wait a few seconds, turn it
back on. If this doesn't help, do not spend the time in an attempt to
figure out what the problem is; get help immediately. Do not even do
that much (i.e., don't powercycle the server) if it seems to be running,
since that is usually an indication of a different problem.
Powercycling a system with an active filesystem may cause data lossage,
and we only want that as our last alternative.
It will take quite a while for the machines to come back to life. When
they do, start recovering students' work (see the software section below
to find out how). Be extremely careful. If you can find a student who
did M-x checkpoint-floppy recently and thus doesn't care if
his/her files get recovered, make his/her machine your first guinea pig.
If and when the reinforcement arrives, the total recovery shouldn't take
long.
If you think there is a paper jam in one of the printers, lift the cover
of the printer by pressing the button on the left of the cover, and look
inside. If no paper is visible, try pulling out the paper trays and see
if any paper is stuck on the bottom. In addition, there is a side
access door; open it by pressing the left side of the printer in the
right magic place--the access door should pop open. In any case, you
will usually be able to pull the paper out somehow. If there is no
paper seen anywhere inside the printer, maybe the paper jam is not the
problem.
When the printer complains that the toner is low and does not print:
-
If the number following "toner low" is 1 or 2, you don't need to
replace the toner cartridge. Just lift off the cover, pull it out, shake
it up a little, and put it back in. The printer should come back to life.
-
If the number is 3, the toner cartridge needs to be replaced. Go to the
instrument desk and ask for help.
If a printer goes catatonic with some weird error message, you can try
powercycling it.
If one of the printers dies mysteriously or experiences a problem you
cannot fix yourself, report the problem. Ask the students on the side
of the room with the broken printer as their default to switch printers,
by executing M-x set-variable RET lpr-switches RET
'("-Psicp-48") [or 49] in Edwin. If both printers die, we are in
trouble; get help fast.
One of the more common software problems is the lack of synchronicity
between the code on the two servers--consult the section on making
software changes on both servers in Chapter 3 Fnord! to find out how to
bring the wrong versions of the code up to date. The two others are
stray Scheme processes and people with wedged Schemes.
If a student complains that his/her machine is slow, check the network
lights. If the machine is swapping, that is usually a sign that there
are some extra Scheme processes running on it, left over from previous
users (often the machines will be thrashing to death and students won't
even notice it). Here is how to look for/get rid of them:
-
Ask the student to save, Just In Case.
-
Open an xterm.
-
% ps -ef | fgrep scheme
-
If there's more than one "scheme -edwin" process, look for the one that
has the "xterm ... edwin ..." process as a parent (the lines in the
fgrep output should have matching magic PID's).
-
Kill all the other scheme processes (using kill without
options--i.e., % kill PID---works fine if you're doing it from a
student's xterm).
-
% exit
You can also get rid of stale Scheme processes by typing
~u6001/bin/killschemes at the prompt; this will run the
appropriate script. However, doing so will kill all the Scheme
processes currently running on the machine which have u6001 as their
owner---including the student's Scheme. Needless to say, you
should not do this when logged in as u6001, unless you desire to kill
your current Scheme process and be logged out without even having a
chance to save your work on disk.
If the machine is slow but no extra Schemes are running, it may be the
case that the student is hosing it down by writing losing Scheme
procedures (for example, creating a lot of streams will slow down a
machine considerably). Tell the students to at least kill their
xclocks, xloads, xterms, etc. to speed it up a little. If the problem
seems to be something else, report it: it may be a hardware problem.
When a student's Scheme crashes, you get to (joy!) recover his/her
files. Naturally, you can only recover as much as they have saved, so
if the student doesn't Fnord! believe in C-x C-s, some of his/her
work is bound to be lost. Students should get into the habit of saving
often; beat it into their heads every time a crash occurs and
they call you to recover their work. The best way to save is by doing
M-x checkpoint-floppy after writing each significant chunk of
code, and especially before running the code that is likely to
crash the machine (this happens a lot when dealing with streams, for
example, because it is easy to get into an infinite loop and run out of
memory)---this way, your help may not even be needed, since the versions
of the files read in from the floppy on login will be recent. If
M-x checkpoint-floppy seems as too much trouble, the students
should at least do C-x C-s as often as possible.
There is a fallback, though. The system creates and updates an
`.asv' (autosave) file for each of the `.scm' files in the
`work' directory. These `.asv' files prove to be pretty
useful; they get updated often enough that most of the time they contain
the more recent versions of the files than the `.scm' files
themselves. Don't forget about them--sometimes they will save a
student a headache by recovering a good chunk of work that is otherwise
lost. After recovering someone's files, show him/her the .asv
versions and ask him/her if those are more recent. If that is the case,
and if the student would like the contents of .asv files to be
copied to the `.scm' files, from an xterm,
-
% cd work (if you're not already there)
-
% cp filename1.asv filename1.scm
-
% cp filename2.asv filename2.scm ; and so on
With regards to the machine crash itself, there are several kinds of
lossages which require recovery of students' files, and each dictates a
different kind of treatment. We discuss them all below. Note that this
is just one way of doing things; there are other ways to go about
recovering files, some riskier than others. You may like something else
better--use it at your will, but please make sure it is safe as to not
make students lose. Here, I tried to come up with procedures which are
both fast and safe.
If the Scheme is hung up, but you can still manipulate your mouse, it is
possible (although not very likely) that you can bring the Scheme back
to life with the help of the Edwin xterm window. After de-iconizing it,
(if not already in REPL) press C-c C-c; at the prompt, type
(edwin) RET. If this doesn't liven up your Scheme, there's
not much you can do to recover the process itself (option `R',
"hard reset possibly killing Scheme in the process," though it looks
vaguely promising, seems to justify the warning every time, indeed
killing Scheme in the process).
Try at least recovering the files by typing (save-editor-files)
RET. This will save all your buffers the files for which already
exist; it will also ask you whether you want to save other buffers, such
as `*scheme*', `*transcript*', and so on. That can prove
useful to students who would like to save the transcript of their
work--to figure out what went wrong, for example, or in order to try to
get the bug to repeat. (Save-editor-files) is an excellent
command (when it works), because the students don't lose any work.
However, it doesn't solve the problem of writing the saved files out to
the floppy. You still have to follow the procedure outlined in the next
section to make sure the files which are now in the `work'
directory find their way to the student's disk.
If you can get no help from Edwin xterm, but you still have control of
the screen--i.e., you can open an xterm--your best bet is to do the
following (the last eight steps of this procedure, 5 through 12, may also
be used to recover the work of a previous user of the machine who has
lost his/her floppy or some such thing; the only change is in step 8,
where the appropriate directory---`work.~2~' or
`work.~3~'---should be substituted for `work.~1~'.):
-
Open an xterm.
-
% ps -ef | fgrep scheme
-
Find the PID of your scheme -edwin process.
-
% kill PID (this logs you out)
-
Now log back in as u6001 (not hacker!), with the disk in drive.
-
Open another xterm.
-
% su hacker
-
% cd ../work.~1~
-
% mv
*
../work
-
% chown u6001
*
(important!)
-
% exit
-
In Edwin, do M-x checkpoint-floppy, so the student can go crash
again as soon as (s)he re-evaluates the code, and not really care.
The advantage of the above method is that it minimizes disk accesses,
and leaves the machine in a state where the student can continue work as
soon as file recovery is complete, without having to logout. Logging in
as hacker in this kind of situation is unnecessary and will only
complicate things. Since you issue the kill command, the logout
should proceed as normal, in the sense that the `work' directories
will be rotated. Thus, you will be able to find the student's files in
`work.~1~'.
If the screen is locked up completely and you can't open an xterm or
even move your mouse, your only alternative is to reboot the machine and
log back in to recover files. It should still be safe to login as
u6001. Presumably, if the logout was clean (i.e., if the directories
got rotated correctly), you can follow the procedure outlined in case 2
above (steps 5 through 12). If, on the other hand, the logout script
did not get run, this will be detected on login.
As mentioned earlier, the login script checks for unclean logouts and,
if one is found, explains the situation to the user. If the student
wishes to recover files, the system then prompts for the hacker
password. With the correct password entered, directories are left
unrotated, and the student can continue the login process by selecting
the N option (practice login), not the default (L)
option. Note that this means the student should use M-x
checkpoint-floppy before logging out, since M-x logout won't try
to save anything to disk. Besides that, however, everything should
appear just the way it was before the crash--in particular, the
student's files will still be in the `work' directory, and nothing
will need to be moved.
While logging in as hacker may be considered a safe thing to do in some
situations, it is a necessary action if the student's work is
already in the `work.~3~' directory, since logging in as
u6001 will hose it all. Here is the entire procedure, remarkably
similar to the one above (the major difference is that you need to
create a directory to temporarily store the student's files):
-
Login as hacker.
-
Open an xterm.
-
% cd ../u6001
-
% mkdir joe (for example; the name of the directory you create should
not be something obvious like `tmp' or `temp', in case another
LA is doing something like this at the same time; that's why I use the
student's name)
-
% cd work.~3~
-
% cp
*
../joe
-
Logout; log back in as u6001.
-
Open an xterm.
-
% su hacker
-
% cd ../joe
-
% mv
*
../work
-
% cd
-
% rmdir joe (clean up after yourself!)
-
% exit
-
In Edwin, do M-x checkpoint-floppy.
Don't forget to remind students that `.asv' files may be more
up-to-date and that they may recover some of their lost work that way.
Go to the previous, next chapter.