Go to the previous, next chapter.

Crises, Big and Small

Most of the crises that occur in the lab, such as students wedging their Scheme processes or paper jams in the printer, are of a rather routine nature. You will get the hang of these rather quickly and will usually have no problems dealing with them. If something extraordinary happens, however--a server dies, for instance, or it looks like someone is about to immanentize the Eschaton(12)---and you don't know what to do or need help in dealing with the situation (when the server dies, for example, you will probably appreciate a few warm bodies who know the hacker password and can recover people's work present in the lab), use all the means known to you to summon help. Send out e-mail, if you can, since quite a few people on the staff are addicted enough that they will actually read it within minutes after you send it. Don't hesitate to call the head TA, head LA or lecturers at their offices or at home, if needed (you can probably find out their phone numbers by fingering them). Get the instrument desk people to help you out.

Hardware

Whenever you encounter a hardware problem (such as a machine not operating correctly) that you cannot fix yourself, do all of the following: report the problem to the instrument desk, send mail to 6001-feedback, and put a blue tag on the machine, filling in the problem, your name, and the date. The tags can usually be found either on top or in the top drawer of the filing cabinet.

One Machine Dies/Acts Funny

Circuit Breaker Problems

If a snake is showing no signs of life (specifically, the heartbeat light, which is the first one on the left, is not blinking), the problem is usually a tripped circuit breaker. Unfortunately, this happens more often than it should, because the cheap circuit breakers in the lab are flakey and unreliable. There are two circuit breaker boxes on the 001 side of the lab, one, to the right of the Lambda Lounge, by the help Q board (#1), another to the left of the lounge, close to the blackboards at the back of the room (#2). To recover a machine in case its circuit breaker trips:

  1. Get under the table on which the machine is standing. Find one of the cables that lead from the machine to the power outlet, and read off the number written on the outlet corresponding to the cable (be sure to get the correct cable--you don't want to get the wrong number and make a student on a neighboring machine lose by playing with the wrong circuit breaker).

  2. Proceed to the appropriate circuit breaker box, according to the number you read off. For example, 2-23 means it's box #2. Open it and look for the breaker labeled 23. If the number is correct, the corresponding breaker should look different from the other ones: it will not be lined up with the others but will usually be in the dead center position.(13) To recover it, first push it all the way OFF (i.e., opposite the direction the other circuit breakers point in), and only then bring it back to line up with the rest. The machine should now make some noise and show signs of life. It will reboot and come around in a minute.

If you notice that a particular machine experiences the circuit breaker problem on a regular basis, report the problem in a manner described above.

Rebooting a Snake

If you cannot get a snake to respond (i.e., the screen is locked up so nothing you type is being acknowledged and the mouse does not respond), you need to reboot it. For soft reboot, press C-SHIFT-RESET; it should only take a few seconds for the machine to come around. To powercycle the machine, press the white power button, located at the bottom of the machine, below the disk drive; hard reboot takes a little longer, maybe a couple of minutes.

If the machine does not reboot properly, leave it alone and report the problem (make sure to put a tag on it).

A Machine Is Using a Wrong Version of a PS

If one machine seems to be using the wrong version of a problem set or any other file, such as `motd', chances are all the other machines on that side of the room--i.e., connected to the same server--are experiencing the same problem. Check this by looking at the file in question on both sicp-00 and sicp-01 and comparing the two versions; if that is indeed the case, and only if you are absolutely sure which of the versions is correct, copy the correct version to the losing server (see the section on how to make software changes to both servers in the previous chapter). If you are not sure what the correct version is, don't try to guess; simply report the problem to the head TA/LA. In any case, the head TA should be informed if you encounter two different versions of a file.

If the problem seems to be something else and you cannot figure out how to fix it, ask the student to switch to another machine and put a sign on the machine asking people not to use it. In addition, report the problem by following the procedure described above.

A Server Dies

Speaking of immanentizing the Eschaton... Murphy's law says that the first server crash must occur the night before a problem set is due, in which case you will notice it momentarily, for it will be accompanied by the sound of 46 students screaming, sighing, taking a deep breath, and doing whatever else people do in times of crisis. Don't panic. Try to calm the students down, too. A server crash is a serious problem, but their work can usually still be recovered. Call for help right away (you will need it even if you can fix the problem itself, to recover people's work as fast as possible), get into the server closet (it is located in the front of the lab, behind the printers), and start investigating.

One of the ways a server can crash is the same way any machine does, by means of a circuit breaker tripping. The consequence of this, as you might imagine, is that all the machines connected to that server panic and crash as well. To fix the problem, do what you would normally do in a case of circuit breaker blowing (consult the section on circuit breakers above). The circuit breaker box for the servers is located inside the server closet.

If the server crashes for an unknown reason, and is completely catatonic, the only thing you should try is powercycling it (see the section on rebooting above): turn it off, wait a few seconds, turn it back on. If this doesn't help, do not spend the time in an attempt to figure out what the problem is; get help immediately. Do not even do that much (i.e., don't powercycle the server) if it seems to be running, since that is usually an indication of a different problem. Powercycling a system with an active filesystem may cause data lossage, and we only want that as our last alternative.

It will take quite a while for the machines to come back to life. When they do, start recovering students' work (see the software section below to find out how). Be extremely careful. If you can find a student who did M-x checkpoint-floppy recently and thus doesn't care if his/her files get recovered, make his/her machine your first guinea pig. If and when the reinforcement arrives, the total recovery shouldn't take long.

Printer Problems

If you think there is a paper jam in one of the printers, lift the cover of the printer by pressing the button on the left of the cover, and look inside. If no paper is visible, try pulling out the paper trays and see if any paper is stuck on the bottom. In addition, there is a side access door; open it by pressing the left side of the printer in the right magic place--the access door should pop open. In any case, you will usually be able to pull the paper out somehow. If there is no paper seen anywhere inside the printer, maybe the paper jam is not the problem.

When the printer complains that the toner is low and does not print:

If a printer goes catatonic with some weird error message, you can try powercycling it.

If one of the printers dies mysteriously or experiences a problem you cannot fix yourself, report the problem. Ask the students on the side of the room with the broken printer as their default to switch printers, by executing M-x set-variable RET lpr-switches RET '("-Psicp-48") [or 49] in Edwin. If both printers die, we are in trouble; get help fast.

Software

One of the more common software problems is the lack of synchronicity between the code on the two servers--consult the section on making software changes on both servers in Chapter 3 Fnord! to find out how to bring the wrong versions of the code up to date. The two others are stray Scheme processes and people with wedged Schemes.

Machine is Slow: Killing Extra Scheme Processes

If a student complains that his/her machine is slow, check the network lights. If the machine is swapping, that is usually a sign that there are some extra Scheme processes running on it, left over from previous users (often the machines will be thrashing to death and students won't even notice it). Here is how to look for/get rid of them:

  1. Ask the student to save, Just In Case.

  2. Open an xterm.

  3. % ps -ef | fgrep scheme

  4. If there's more than one "scheme -edwin" process, look for the one that has the "xterm ... edwin ..." process as a parent (the lines in the fgrep output should have matching magic PID's).

  5. Kill all the other scheme processes (using kill without options--i.e., % kill PID---works fine if you're doing it from a student's xterm).

  6. % exit

You can also get rid of stale Scheme processes by typing ~u6001/bin/killschemes at the prompt; this will run the appropriate script. However, doing so will kill all the Scheme processes currently running on the machine which have u6001 as their owner---including the student's Scheme. Needless to say, you should not do this when logged in as u6001, unless you desire to kill your current Scheme process and be logged out without even having a chance to save your work on disk.

If the machine is slow but no extra Schemes are running, it may be the case that the student is hosing it down by writing losing Scheme procedures (for example, creating a lot of streams will slow down a machine considerably). Tell the students to at least kill their xclocks, xloads, xterms, etc. to speed it up a little. If the problem seems to be something else, report it: it may be a hardware problem.

Unwedging People With Dead Schemes

When a student's Scheme crashes, you get to (joy!) recover his/her files. Naturally, you can only recover as much as they have saved, so if the student doesn't Fnord! believe in C-x C-s, some of his/her work is bound to be lost. Students should get into the habit of saving often; beat it into their heads every time a crash occurs and they call you to recover their work. The best way to save is by doing M-x checkpoint-floppy after writing each significant chunk of code, and especially before running the code that is likely to crash the machine (this happens a lot when dealing with streams, for example, because it is easy to get into an infinite loop and run out of memory)---this way, your help may not even be needed, since the versions of the files read in from the floppy on login will be recent. If M-x checkpoint-floppy seems as too much trouble, the students should at least do C-x C-s as often as possible.

There is a fallback, though. The system creates and updates an `.asv' (autosave) file for each of the `.scm' files in the `work' directory. These `.asv' files prove to be pretty useful; they get updated often enough that most of the time they contain the more recent versions of the files than the `.scm' files themselves. Don't forget about them--sometimes they will save a student a headache by recovering a good chunk of work that is otherwise lost. After recovering someone's files, show him/her the .asv versions and ask him/her if those are more recent. If that is the case, and if the student would like the contents of .asv files to be copied to the `.scm' files, from an xterm,

  1. % cd work (if you're not already there)

  2. % cp filename1.asv filename1.scm

  3. % cp filename2.asv filename2.scm ; and so on

With regards to the machine crash itself, there are several kinds of lossages which require recovery of students' files, and each dictates a different kind of treatment. We discuss them all below. Note that this is just one way of doing things; there are other ways to go about recovering files, some riskier than others. You may like something else better--use it at your will, but please make sure it is safe as to not make students lose. Here, I tried to come up with procedures which are both fast and safe.

Case 1: Edwin xterm Saves the Day.

If the Scheme is hung up, but you can still manipulate your mouse, it is possible (although not very likely) that you can bring the Scheme back to life with the help of the Edwin xterm window. After de-iconizing it, (if not already in REPL) press C-c C-c; at the prompt, type (edwin) RET. If this doesn't liven up your Scheme, there's not much you can do to recover the process itself (option `R', "hard reset possibly killing Scheme in the process," though it looks vaguely promising, seems to justify the warning every time, indeed killing Scheme in the process).

Try at least recovering the files by typing (save-editor-files) RET. This will save all your buffers the files for which already exist; it will also ask you whether you want to save other buffers, such as `*scheme*', `*transcript*', and so on. That can prove useful to students who would like to save the transcript of their work--to figure out what went wrong, for example, or in order to try to get the bug to repeat. (Save-editor-files) is an excellent command (when it works), because the students don't lose any work. However, it doesn't solve the problem of writing the saved files out to the floppy. You still have to follow the procedure outlined in the next section to make sure the files which are now in the `work' directory find their way to the student's disk.

Case 2: You Can Open an xterm.

If you can get no help from Edwin xterm, but you still have control of the screen--i.e., you can open an xterm--your best bet is to do the following (the last eight steps of this procedure, 5 through 12, may also be used to recover the work of a previous user of the machine who has lost his/her floppy or some such thing; the only change is in step 8, where the appropriate directory---`work.~2~' or `work.~3~'---should be substituted for `work.~1~'.):

  1. Open an xterm.

  2. % ps -ef | fgrep scheme

  3. Find the PID of your scheme -edwin process.

  4. % kill PID (this logs you out)

  5. Now log back in as u6001 (not hacker!), with the disk in drive.

  6. Open another xterm.

  7. % su hacker

  8. % cd ../work.~1~

  9. % mv * ../work

  10. % chown u6001 * (important!)

  11. % exit

  12. In Edwin, do M-x checkpoint-floppy, so the student can go crash again as soon as (s)he re-evaluates the code, and not really care.

The advantage of the above method is that it minimizes disk accesses, and leaves the machine in a state where the student can continue work as soon as file recovery is complete, without having to logout. Logging in as hacker in this kind of situation is unnecessary and will only complicate things. Since you issue the kill command, the logout should proceed as normal, in the sense that the `work' directories will be rotated. Thus, you will be able to find the student's files in `work.~1~'.

Case 3: No Keys are Responding

If the screen is locked up completely and you can't open an xterm or even move your mouse, your only alternative is to reboot the machine and log back in to recover files. It should still be safe to login as u6001. Presumably, if the logout was clean (i.e., if the directories got rotated correctly), you can follow the procedure outlined in case 2 above (steps 5 through 12). If, on the other hand, the logout script did not get run, this will be detected on login.

As mentioned earlier, the login script checks for unclean logouts and, if one is found, explains the situation to the user. If the student wishes to recover files, the system then prompts for the hacker password. With the correct password entered, directories are left unrotated, and the student can continue the login process by selecting the N option (practice login), not the default (L) option. Note that this means the student should use M-x checkpoint-floppy before logging out, since M-x logout won't try to save anything to disk. Besides that, however, everything should appear just the way it was before the crash--in particular, the student's files will still be in the `work' directory, and nothing will need to be moved.

Case 4: Student's Work is in `work.~3~'

While logging in as hacker may be considered a safe thing to do in some situations, it is a necessary action if the student's work is already in the `work.~3~' directory, since logging in as u6001 will hose it all. Here is the entire procedure, remarkably similar to the one above (the major difference is that you need to create a directory to temporarily store the student's files):

  1. Login as hacker.

  2. Open an xterm.

  3. % cd ../u6001

  4. % mkdir joe (for example; the name of the directory you create should not be something obvious like `tmp' or `temp', in case another LA is doing something like this at the same time; that's why I use the student's name)

  5. % cd work.~3~

  6. % cp * ../joe

  7. Logout; log back in as u6001.

  8. Open an xterm.

  9. % su hacker

  10. % cd ../joe

  11. % mv * ../work

  12. % cd

  13. % rmdir joe (clean up after yourself!)

  14. % exit

  15. In Edwin, do M-x checkpoint-floppy.

Don't forget to remind students that `.asv' files may be more up-to-date and that they may recover some of their lost work that way.

Go to the previous, next chapter.