Chapter 13: Case Studies

by Philip Greenspun, part of Database-backed Web Sites

Note: this chapter has been superseded to some extent by its equivalent in the new edition    

(some of the studies in this chapter aren't covered in the new book because there is really not point, esp. given that this older chapter remains online)


This chapter contains seven case studies, each with some actual source code. I present user interface, interaction design, and data modeling ideas that should be useful to you in many contexts. This chapter is not a step-by-step guide. If it were, it would be 800 pages long, absurdly boring, and extremely tool-specific. My classified ad system case (number 7) has some source code that works with the Illustra RDBMS and the AOLserver. But any of the ideas can be used with an object database and a Franz Common Lisp-based server. Most of the ideas are useful for building systems that don't have anything to do with classified ads.

Because I'm afraid that your eyes will glaze over and you'll skip this, I'll try to give you a hint of what is in store. Case 1 is straightforward and dull. It is intended to help you understand how AOLserver Tcl works with a simple mailing list registration application. Feel free to skip it. Case 2 shows how to generalize this application so that many static Web services can share a single RDBMS-backed service. It is a powerful idea that I have used at least a dozen times. Case 3, the birthday reminder system, introduces the idea that less can be more. It does less than a calendar management program but is easier to use. Case 3 also demonstrates how to build a back-end to loop through a database table and send e-mail when necessary. In doing so, it addresses in a practical way an important question about concurrency and RDBMS.

Case 4, the bulletin board system, hammers home the less is more theme, showing how a less powerful Q&A forum can be much more useful to readers than fully threaded discussions. Case 5, the bulletin board system extended with user-requested features, pulls together a lot of threads from previous cases: generalizing a service with distributed maintenance; using e-mail to pull a community together around a Web site; e-mail alerts sent by a nightly sweeping function. Case 6 gets nuts-and-bolty with the full-text search capabilities of modern RDBMS. Case 7 uses classified ads to show how most Web sites overdo their data modeling and search form. It also demonstrates the construction of an auction system and talks about which transactions are legal given referential integrity constraints.

I hope that you're inspired. If you don't have the patience to read the source code then please at least skim the text underneath each new case headline.

Case 1: The Mailing List

We went through this in the first chapter on RDBMS-backed sites, but now we'll do it over again with actual code. Remember that you want a mailing list system for your site. Users can add and remove themselves, supplying email addresses and real names. You want to be the only one who can view the list and the only one who can send mail to the list.

Step 1: The data model

create table mailing_list (
        email           text not null primary key,
        name            text
);

Step 2: Legal Transactions

Here are examples of the two types of transactions:

insert into mailing_list (email,name)
values ('philg@mit.edu','Philip Greenspun');
delete from mailing_list where email = 'philg@mit.edu';

Step 3: Mapping Transactions onto Web Forms

A form to add someone to the list ...

<html>
<head>
<title>Add yourself to the mailing list</title>
</head>
<body bgcolor=#ffffff text=#000000>
<h2>Add yourself to the mailing list</h2>
<form method=post action=add.tcl>
<table>
<tr><td>Name<td><input name=name type=text size=35>
<tr><td>email<td><input name=email type=text size=35>
</table>
<p>
<input type=submit value="Add Me">
</form>
</body>
</html>

... and a form so that someone can delete himself from the list ...

<html>
<head>
<title>Remove yourself from the mailing list</title>
</head>
<body bgcolor=#ffffff text=#000000>
<h2>Remove yourself from the mailing list</h2>
<form method=post action=remove.tcl>
<table>
<tr><td>email<td><input name=email type=text size=35>
</table>
<p>
<input type=submit value="Remove Me">
</form>
</body>
</html>

Figure 13-1: Add Me form rendered by Netscape Navigator

Figure 13-2: Delete Me form rendered by Netscape Navigator

Step 4: Writing Code to Process Those Forms

This is supposed to be the easy part and it really is. However, looking at someone else's source code is always confusing. Don't infer from your confusion that these are complex programs. The only real art to them is how they handle errors and ill-formed input.

Here is a AOLserver Tcl script to process the "add me" form. The script will

Here's the full program:

# call philg's magic functions to set local variables
# to what the user typed into the form
set_form_variables
set_form_variables_string_trim_DoubleAposQQ
# name, email, QQname, QQemail are now set
# get an open database connection from the AOLserver
set db [ns_db gethandle]
# Check for errors in user input before doing anything else
# we use the Tcl REGEXP command to see if the email variable
# has the following form:  1 or more ASCII characters (.+) followed
# by the "at sign" (@) then 1 or more ASCII characters (.+)
# followed by at least one period (\.) then 1 or more ASCII characters (.+)
if { ![regexp {.+@.+\..+} $email] } {
    # the REGEXP didn't match
    ns_return $conn 200 text/html "<html>
<head><title>Problem Adding $email</title></head>
<body bgcolor=#ffffff text=#000000>
<h2>Problem Adding $email</h2>
Your email addresss doesn't look right to us.  We need your full
Internet address, something like one of the following:
<code>
<ul>
<li>Joe.Smith@att.com
<li>student73@cs.stateu.edu
<li>francois@unique.fr
</ul>
<hr>
<address>webmaster@greedy.com</address>
</body>
</html>
"
    # RETURN terminates the AOLserver source.tcl command
    # so none of the code below this point will be executed
    # if the email address had an incorrect form
    return
}
# if we got here, that means the email address was OK
if { $name == "" } {
    # the variable NAME was an empty string
    ns_return $conn 200 text/html "<html>
<head><title>Problem Adding $email</title></head>
<body bgcolor=#ffffff text=#000000>
<h2>Problem Adding $email</h2>
You didn't give us your name.  Please back up using your
browser and make a more complete entry.
</body>
</html>
"
    # this terminates the AOLserver source.tcl command
    return
}
# Error checking complete; ready to do real work
# construct the SQL query using the versions of the form
# variables where apostrophes have already been doubled
# so that names like "O'Grady" don't cause SQL errors
set insert_sql "insert into mailing_list (email, name) 
                values ('$QQemail','$QQname')"
# we execute the insert inside the Tcl function CATCH
# if the database raises an SQL error, the AOLserver API
# call ns_db dml will raise a Tcl error that would result
# in a "Server Error" page being returned to the user.  We
# don't want that, so we catch the error ourselves and return
# a more specific message
if [catch { ns_db dml $db $insert_sql } errmsg] {
        # the insert went wrong; the error description
        # will be in the Tcl variable ERRMSG
        ns_return $conn 200 text/html "<html>
<head><title>Problem Adding $email</title></head>
<body bgcolor=#ffffff text=#000000>
<h2>Problem adding $email</h2>
<hr>
The database didn't accept your insert, most likely because your email
address is already on the mailing list.
<p>
Here was the message:
<pre>
$errmsg
</pre>
<hr>
<address>webmaster@greedy.com</address>
</body>
</html>
"
} else { 
   # the insert went fine; no error was raised
   ns_return $conn 200 text/html "<html><head><title>$email Added</title>
</head>
<body bgcolor=#ffffff text=#000000>
<h2>$email Added</h2>
<hr>
You have been added to the <a href=/index.html>www.greedy.com</a>
mailing list.
<hr>
<address>webmaster@greedy.com</address>
</body>
</html>
"

The AOLserver Tcl script to process the "remove me" form is much simpler.

set_form_variables
set_form_variables_string_trim_DoubleAposQQ
# now email and QQemail are set as local variables
# ask for a database connection
set db [ns_db gethandle]
# note that the dual calls to the SQL UPPER function
# ensure that the removal will be case insensitive
set delete_sql "delete from mailing_list 
                where upper(email) = upper('$QQemail')"
# execute the delete statement in the database
ns_db dml $db $delete_sql
# call the special AOLserver API call ns_ill resultrows
# to find out how many rows were affected by the delete
if { [ns_ill resultrows $db] == 0 } {
    # 0 rows were affected
    ns_return $conn 200 text/html "<title>Problem Deleting $email</title>
<h2>Problem deleting $email</h2>
<hr>
We could not find <code>\"$email\"</code> on the mailing list (and our search is
<em>not</em> case-sensitive).
<hr>
<address>webmaster@greedy.com</address>
"
} else { 
  # the delete affected at least one row so removal must
  # have been successful
  ns_return $conn 200 text/html "<html><head><title>$email Removed</title>
</head>
<body bgcolor=#ffffff text=#000000>
<h2>$email Removed</h2>
<hr>
You have been removed from the <a href=/index.html>www.greedy.com</a>
mailing list.
<hr>
<address>webmaster@greedy.com</address>
</body>
</html>
"
}

Case 2: The Mailing List

"Mailing List"? This sounds vaguely like Case 1. It is. Vaguely. It turns out that you need mailing lists for 4 other services that you offer. Also, ten of your friends want to run mailing lists for their sites. You'd be happy to give them your code, but you know that they aren't willing to endure the pain of maintaining a relational database management system just for this one feature. They are grateful, but "oh, while you're at it, would you mind allowing us to also store Snail Mail information?"

Could you make the code generic? You just need an extra table to store information about each of your and your friends' services.

create table spam_domains (
        domain          text not null primary key,
        backlink        text,   -- a URL pointing back to the  page
        backlink_title  text,   -- what to say for the link back
        blather         text,   -- arbitrary HTML text 
        challenge       text default 'Your mother''s maiden name',
        response        text,
        maintainer_name         text,
        maintainer_email        text,
        notify_of_additions_p   boolean default 't',
        list_type       text    -- 'basic', 'snail_too', 'snail_plus_demographics'
);

Each domain is identified with a string, e.g., "photonet" for the magazine photo.net. Then you store the URL, "http://photo.net/photo/" in the backlink column and the title, "photo.net" in backlink_title. You provide a space for some descriptive HTML for the top of the "add me" form, e.g., "you will get mail once every month describing new articles in photo.net".

Rather than a password, which your friends might forget and then bug you to manually retrieve from the database, you store a challenge question of their choice, e.g., "Mother's Maiden Name", and their response.

You keep track, per domain, of the name and email address of the list maintainer. If the notify_of_additions_p column is set to true then your "add me" script will send email to the maintainer when someone new adds himself to the list. Finally, you keep track of how much data is stored in the list_type column. A "basic" list is only name and email. A "snail_too" list also asks for and stores a physical mail address. A "snail_plus_demographics" list also asks for age and sex.

Once that new table is defined, you need to beef up the mailing list table as well. It would be possible to build this system so that it defined a separate table for each new domain, but I think it is cleaner to add a domain column to the mailing list table as long as we're adding all the extra columns for physical mail and demographics:

create table spam_list (
        domain          text not null references spam_domains,  -- which list is this entry for
        email           text not null,
        name            text,
        -- info for 'snail_too'
        line1           text,
        line2           text,
        city            text,
        state           text,
        postal_code     text,           -- ZIP+4 better, but five digits ok.  Canada, too!
        country         char(2),        -- ISO country code
        -- info for snail_plus_demographics
        birthday        date,
        sex             text check(sex in ('M','F')),
        primary key( domain, email )
);

Note that you have to remove the primary key constraint on the email column. There is no reason why "philg@mit.edu" can't be in the mailing list table 10 times, each time for a different domain. However, you don't want "philg@mit.edu" on the photo.net list 10 times. Thus the primary key( domain, email ) constraint at the end of the table definition.

How does this all work? If you visit you can see the whole system in action. Here's an example of how the add-me.html form has been replaced by a Tcl procedure:

set_form_variables_string_trim_DoubleAposQQ
# "domain" was the one form argument so QQdomain is set
set db [ns_db gethandle]
# ask for all the columns in the domains table for this domain
set selection [ns_db 1row $db "select unique * from spam_domains
                               where domain='$QQdomain'"]
# call philg's magic function to set local variables for all columns
set_variables_after_query
# these are the basic fields, "name" and "email", that every mailing
# list will have.  If this domain requires more, they will be appended
set form_fields "<tr><td>Name<td><input name=name type=text size=35>
<tr><td>email<td><input name=email type=text size=35>"
if { $list_type == "snail_too" || $list_type == "snail_plus_demographics" } {
    # this domain wants a more complicated list; add Snail Mail fields
    append form_fields "<tr><td>Address Line 1<td><input name=line1 type=text size=35>
<tr><td>Address Line 2<td><input name=line2 type=text size=35>
<tr><td>City, State, Postal Code<td><input name=city type=text size=12>
<input name=state type=text size=6>
<input name=postal_code type=text size=8>
<tr><td>Country<td><input name=country type=text size=3 limit=2> (ISO Code, e.g., \"us\", \"fr\", \"ca\", \"au\", etc.)"
} 
if { $list_type == "snail_plus_demographics" } {
    append form_fields "<tr><td>Birthday<td><input name=birthday type=text size=12> (YYYY-MM-DD format must be exact)
<tr><td>Sex<td><input name=sex type=radio value=M CHECKED> Male
<input name=sex type=radio value=F> Female
"
}
# finally return the page to the user
ns_return $conn 200 text/html "<head><title>Add Yourself to the Mailing list</title></head>
<body bgcolor=#ffffff text=#000000>
<h2>Add Yourself</h2>
to <a href=\"home.tcl?domain=[ns_urlencode $domain]\">the mailing list</a> 
for <a href=\"$backlink\">$backlink_title</a>
<hr>
<form method=post action=add-2.tcl>
<input type=hidden name=domain value=\"$domain\">
<table>
$form_fields
</table>
<input type=submit value=Submit>
</form>
<hr>
<address>
<a href=\"mailto:$maintainer_email\">
$maintainer_name ($maintainer_email)
</a>
</address>
"

Note how the final HTML page is strewn with values from the database, e.g., $maintainer_email, $backlink, and $backlink_title. Now it looks to all intents and purposes like it is part of your friend's service and you won't be getting e-mail from the confused. See Figure 13-3 for an example.

Figure 13-3: My mailing list system being used by another Web publisher (billg40@tiac.net). If users were to look carefully at the location box, they might notice that they were being bounced from www.tiac.net to www.greenspun.com after clicking the "join mailing list" link. Billg40 can keep his Secret Diary of Bill Gates at Tiac, where they don't run an RDBMS, and yet look just like a high-tech Web publisher with a staff of programmers and a database administrator.

Case 3: The Birthday Reminder System

Olin refused to give Alex (http://photo.net/photo/alex.html) a sample off his plate at brunch. "You're just a dog, Alex," Olin said. We pressed Olin as to his reason for feeling superior to Alex. Olin thought for a few minutes all he could come up with was "I have a Ph.D. and he doesn't."

Olin demonstrated the practical value of his Carnegie-Mellon computer science degree in 1994 by turning down Jim Clark's offer to become Employee #3 at a little start-up called Mosaic Communications (grab http://www.netscape.com if you want to see how Jim and Co. are doing now). Consequently, his resort to credentialism set off howls of laughter throughout the room.

"Let's see what value society places on your Ph.D., Dr. Shivers," I said. "We'll take both you and Alex to Harvard Square and hang signs around your necks. Alex's sign will read ‘Needs home.' Your sign can read ‘Needs home. Has Ph.D.' Which one of you do you think would have to sit out there longer?"

"Nooooo contest," opined Olin's girlfriend.

Anyway, Olin clung to his belief that his Ph.D. was worth something despite the fact that the marketplace was crushing him under a burden of poverty to correspond to his burden of ignorance of how to build an RDBMS-backed Web service.

I kept offering to show Olin but he was too busy writing papers for academic journals that even he didn't bother reading. We began to joke that Olin was "afraid to be rich." Then one night Olin came over to my house and said "Let's jack into this World Wide Internet thing."

We sat down to build a toy AOLserver/Illustra-backed birthday reminder system. Sure there are plenty of fancy calendar management systems that incorporate one-time events, recurring events, and reminders. But most of these calendar management programs require you to maintain them on a Macintosh or Windows machine. If you switch from computer to computer then they don't do you much good. We all read our e-mail no matter where we are, so why not build a system that feeds reminders into our e-mailbox? Again, it turns out that there are Web-based calendar management systems that will do just that. But these programs are very complicated. I don't have a job. I don't make appointments. I don't plan in advance. I don't want to invest in learning and using a calendar management program. I just want an email message a week before my friend's birthday so that I can send him a card.

Olin and I sat down at 9 pm to build RemindMe. We were finished by midnight. Then we showed the text-only system to Ulla Zang (http://www.ullazang.com) and asked her to do a spiffy graphic design. Now we have a nice public service to offer.

Step 1: The data model

Note: the syntax is for the Illustra 3.2 RDBMS.

--
-- this table has one row for each person using the system
-- the PRIMARY KEY constraint says that there can't be two
-- rows with the same value in the EMAIL column
--
create table bday_users (
 email  text not null primary key,
 password text not null
);
--
-- we index bday_users because of the integrity constraint
-- on bday_reminders (so that an insert to bday_reminders 
-- doesn't have to do a sequential scan of bday_users)
--
create index bday_users_by_email on bday_users using btree ( email );
create table bday_reminders (
 email   text references bday_users,
 event_description text,
 event_date  date,
 remind_week_before_p boolean default 'f',
 remind_day_before_p boolean default 'f',
 remind_day_of_p  boolean default 'f',
 last_reminded  date
);

The first item of interest in this data model is the integrity constraint that values in the EMAIL column of BDAY_REMINDERS must correspond to values in the EMAIL column of BDAY_USERS. That's what "references bday_users" tells the database. After a row in BDAY_REMINDERS is inserted or updated, the RDBMS will check to make sure that this integrity constraint is true. If not, the transaction will be aborted. Also, nobody will be able to delete a row from BDAY_USERS if there are still rows in BDAY_REMINDERS that contain the same e-mail address.

Integrity constraints are critical if you have users typing data into a shell database tool. But here users will only be able to access the RDBMS through our Web pages. Why can't we just write our forms-processing software so that it never allows bad data into the database? Well, we can and we will. But unless you are the rare programmer who always writes perfect code, it is nice to have the RDBMS's integrity constraint system as a last line of defense.

Thomas Jefferson did not say "Eternal vigilance is the price of liberty." That's because he wasn't John Philpot Curran (Irish statesman who never set foot in the United States). Nor did Jefferson say "Eternal sluggishness is the price of integrity." That's because he wasn't an RDBMS programmer.

Let's focus a bit on "values in the EMAIL column of BDAY_REMINDERS must correspond to values in the EMAIL column of BDAY_USERS." It is not magic fairies who maintain integrity but rather plodding stupid computer programs. During every attempt to insert a row into BDAY_REMINDERS, the plodding stupid relational database management system will grind through every row in BDAY_USERS to see if the proposed value for the EMAIL column matches one of those rows. This is an "order N" operation, requiring time proportional to the number of rows in he BDAY_USERS table. If, however, we create an index on the BDAY_USERS table

create index bday_users_by_email on bday_users using btree ( email );

then the plodding stupid RDBMS can check the index instead of the full table. For the binary tree index we just created, this takes "order log N" time. If you have 64,000 rows, "N" is 64,000 and "log N" is 32.

Step 2: Legal Transactions

You'd think that the most obvious legal transaction would be "add user to bday_users table." However, I decided to not make that one of the legal transactions. I don't want a table full of email addresses for people who aren't really using the system. Thus it is only legal to add a user atomically with at least one reminder:

begin transaction;
insert into bday_users 
(email, password)
values
('philip@greenspun.com','hairysamoyed');
insert into bday_reminders 
(email, event_description, 
event_date, remind_week_before_p, remind_day_before_p, remind_day_of_p)
values
('philip@greenspun.com','remember to finish PhD',
'1997-09-01','t','t','t');
end transaction;

This transaction inserts the user "philip@greenspun.com" with password "hairysamoyed" and a reminder to finish his PhD by September 1, 1997. Reminders will be sent a week before, a day before, and the day of.

Suppose the user doesn't want to be reminded the day of? That's another legal transaction:

update bday_reminders 
set remind_day_of_p = 'f'
where oid = '2056.2017'

OID? That's a non-standard Illustra hidden column, the "object identifier". It uniquely identifies a row. If I were a good database programmer, bday_reminders would have a unique key. Either I would demand that the triple of EMAIL, EVENT_DESCRIPTION, and EVENT_DATE be unique and then use those as the key. Or I would add a REMINDER_ID column to the table and then assign unique values to this column from a sequence generator.

It is possible to index the table by EMAIL, EVENT_DESCRIPTION, and EVENT_DATE then use triples as a key. However, it makes for rather unwieldy looking update statements. Had I been using Oracle, I would have used a REMINDER_ID column and then inserted using "reminder_id_sequence.NextVal" for the value. This takes advantage of a built-in Oracle sequence generation facility. Illustra doesn't have anything like this so I'd have to define a separate sequence generator table and complicate my transactions.

Where do I draw the line at vendor-specific kludges like this? On the client/server boundary. If other tables needed to refer to rows in BDAY_REMINDERS, I'd create a REMINDER_ID column. I would never use an OID as a pointer from one database row to another. However, in this case I'm just going to use it in my client Tcl program. If I move the system to another RDBMS, I'll add a column to the data model and rewrite the two Tcl pages in question.

A reminder system with only one reminder isn't anything to write home about. We ought to be able to add new events:

insert into bday_reminders
(email, event_description, 
event_date, remind_week_before_p, remind_day_before_p, remind_day_of_p)
values
('philip@greenspun.com','Wash dog whether he needs it or not',
'1997-12-01','t','f','t')

A week before December 1, I'll be reminded to buy shampoo for my Samoyed (he usually requires an entire bottle). I disabled the day before reminder so that's "f". Then I'll be reminded on December 1 itself.

One last legal transaction: deleting a reminder. Suppose that I finish my PhD thesis (to which supposition my friends invariably respond "suppose the sun falls out of the sky"):

delete from bday_reminders where oid = '2056.2017'

Again, we're using the OID but just for user interface.

Step 3: Mapping Transactions onto Web Forms

My general philosophy is to have as few pages as possible. This keeps the user interface simple. Figure 13-4 shows the interaction flow for the RemindMe system. Oftentimes, the system uses redirects to pages that reflect current status rather than separate confirmation pages. For example, after a user disables a "week before" alert, there is no confirmation page. Instead, after doing the database update, the Tcl script issues a 302 redirect back to the reminder summary page where the alert is showed to be disabled.

Figure 13-4: interaction for the RemindMe system. Note the extensive use of redirects when bad input is received and/or after a simple update.

Figure 13-5: the text-only welcome page, built by me and Olin

Figure 13-6: the welcome page, design by Ulla Zang; we saved a lot of time and heartache by completing our interaction design with a text-only site before consulting a graphic designer. Ulla turns out to be one of the few graphic designers I've worked with who is also an excellent interaction designer. Nonetheless, by having finished the programming beforehand, we were able to use Ulla's time to maximum advantage.

Figure 13-7: the reminders summary page, core of the user interface. Olin and I weren't satisfied with this design but decided to dump the user interface issue onto Ulla.

Figure 13-8: Ulla's reminders summary page. Much bigger than our text-only version, but also much cleaner. We'll be able to take her simplifications and translate them back into the text-only site as well. [That's another advantage of doing a full text-only site first; you always have something to satisfy the 28.8 modem crowd even if your graphic designer goes wild with huge images.]

Step 4: Writing Code to Process Those Forms

Here is the .tcl page that summarizes a user's reminders (see figures 13-7 and 13-8).

set_form_variables
set_form_variables_string_trim_DoubleAposQQ
# email, password
set db [ns_db gethandle]
if { [database_to_tcl_string $db "select unique upper(password)
from bday_users
where upper(email) = upper('$QQemail')"] != 
     [string toupper $password] } {
    ns_returnredirect $conn \
                      "bad-password.tcl?email=[ns_urlencode $email]"
    # we've returned a 302 redirect to the user, now exit this thread
    return
}
# if we got here it means that the password checked out OK
# we're going to use this a bunch of times so let's save some work
set emailpassword \
    "email=[ns_urlencode $email]&password=[ns_urlencode $password]"
# we return the headers and top of the page before doing a query
# into the BDAY_REMINDERS table; that way the user isn't staring at
# a blank screen
ReturnHeaders $conn
ns_write $conn "<html>
<head>
<title>Reminders for $email</title>
</head>
<body bgcolor=#ffffff text=#000000>
<h2>Reminders</h2>
for $email
<hr>
<ul>
"
set selection [ns_db select $db \
"select *,
        oid,
        extract(month from event_date) as event_month, 
        extract(day from event_date) as event_day
from bday_reminders 
where upper(email) = upper('$QQemail')
order by event_month, event_day"]
while {[ns_db getrow $db $selection]} {
    set_variables_after_query
    ns_write $conn "<li>$event_description : [util_IllustraDatetoPrettyDate $event_date] "
    # it would have been cleaner to think more and come up with
    # a general-purpose action.tcl function, but I think it is 
    # also OK to do what we've done, use a separate .tcl page for
    # each kind of action
    # for each reminder, we test to see if it is already set,
    # then present an appropriate current status hyperlinked to
    # a URL that will toggle the state of that reminder
    # note that we're using the Illustra OID as a key
    if { $remind_week_before_p == "t" } {
 ns_write $conn "\[Week Before: <a href=\"week-before-off.tcl?oid=[ns_urlencode $oid]&$emailpassword\">On</a>\]" } else {
 ns_write $conn "\[Week Before: <a href=\"week-before-on.tcl?oid=[ns_urlencode $oid]&$emailpassword\">Off</a>\]" }
    if { $remind_day_before_p == "t" } {
 ns_write $conn "\[Day Before: <a href=\"day-before-off.tcl?oid=[ns_urlencode $oid]&$emailpassword\">On</a>\]" } else {
 ns_write $conn "\[Day Before: <a href=\"day-before-on.tcl?oid=[ns_urlencode $oid]&$emailpassword\">Off</a>\]" }
    if { $remind_day_of_p == "t" } {
 ns_write $conn "\[Day Of: <a href=\"day-of-off.tcl?oid=[ns_urlencode $oid]&$emailpassword\">On</a>\]" } else {
 ns_write $conn "\[Day Of: <a href=\"day-of-on.tcl?oid=[ns_urlencode $oid]&$emailpassword\">Off</a>\]" }
    ns_write $conn "\[<a href=\"delete.tcl?oid=[ns_urlencode $oid]&$emailpassword\">DELETE</a>\]" 
} 
ns_write $conn "
<P>
<li>
<a href=\"add-reminder.tcl?$emailpassword\">Add new reminder</a>
</ul>
<hr>
<a href=\"mailto:[bday_system_owner]\">
<address>[bday_system_owner]</address>
</a>
</body>
</html>"

Most of the interesting points about this procedure are documented in the comments above. The only thing worth stressing is that this is the meat of the user interface. The links from this page mostly just update the RDBMS and then redirect back to this page. For example, here's week-before-off.tcl:

set_form_variables
# email, password, oid
set db [ns_db gethandle]
# do a password check here as in the above procedure (code omitted)
ns_db dml $db "update bday_reminders 
set remind_week_before_p = 'f'
where oid = '$oid'"
ns_returnredirect $conn "domain-top.tcl?email=[ns_urlencode $email]&password=[ns_urlencode $password]"

This procedure just checks the password, updates the BDAY_REMINDERS row, then redirects back to the reminder summary page.

Step 5: Step 5?

Yes, there is a Step 5 for this system: Making it work. If you are an Internet entrepreneur who has just raised $40 million for your WebJunkware System 2001 then Step 5 is optional. You only need a front end good enough to show to venture capitalists and then grab some screen shots for an initial public offering prospectus. If, however, you are a cringing little engineer whose ego is pathetically dependent upon producing a useful service, then you need to write a back end to send out reminders.

Almost all back ends require that a function run every day at a set hour. I like to write my back end code using the same tools as the rest of the system. In the old days, I would write the nightly sweeper or whatever as a dynamic Web page. Then I'd use the Unix cron facility to run a shell script every night (the Windows NT equivalent is the At command). The shell script would call htget (a Perl script) to grab this dynamic Web page.

AOLserver, however, has a built-in cron-like function. I prefer to use it rather than an operating system facility because it means less system administration when moving a service from one physical computer to another. Also, my code is portable across operating systems and I won't need to install Perl scripts like htget.

It should be easy:

ns_schedule_daily 5 0 bday_sweep_all

This tells the AOLserver to run the function BDAY_SWEEP_ALL at 5:00 am every day.

The first and most obvious problem with this statement is that the server might be down at 5 am. If we are careful to define BDAY_SWEEP_ALL so that it won't send out duplicates, we can just schedule it for a bunch of times during the day:

ns_schedule_daily 5 0 bday_sweep_all
ns_schedule_daily 9 0 bday_sweep_all
ns_schedule_daily 13 0 bday_sweep_all

The second problem is an AOLserver bug (my opinion) or feature (Doug McKee's opinion; he's one of the server's authors): if these statements are executed multiple times, the function will be multiply scheduled. For example, if you put these schedule requests in the Tcl directory that gets sourced on server startup then re-initialize Tcl five times (to test changes in other code), you will find that BDAY_SWEEP_ALL is called six times at 5 am.

Here's my workaround:

global bday_scheduled_p 
if { ![info exists bday_scheduled_p] } {
    ns_schedule_daily 5 0 bday_sweep_all
    ns_schedule_daily 9 0 bday_sweep_all
    ns_schedule_daily 13 0 bday_sweep_all
    set bday_scheduled_p 1    
}

I tell the Tcl interpreter that the variable BDAY_SCHEDULED_P is to be global among all the AOLserver threads. If it has not been set yet ("![info exists"), I schedule the sweep at 5 am, 9 am, and 1 pm. Then I set the flag to 1 so that subsequent loads of this Tcl file won't result in redundant scheduling.

All we have to do now is write the BDAY_SWEEP_ALL procedure. We can expect the algorithm to be more or less for all three reminder types, so we posit a basic BDAY_SWEEP procedure that takes the reminder type as an argument:

proc bday_sweep_all {} {
    bday_sweep "day_of"
    bday_sweep "day_before"
    bday_sweep "week_before"
}

Now we just have to write the sweeper per se:

proc bday_sweep {message_type} {
    # message_type can be "week_before", "day_before", "day_of"
    switch $message_type {
 week_before { set sql_clause "remind_week_before_p
and event_month=extract(month from (current_date + interval '7' day))
and event_day=extract(day from (current_date + interval '7' day))"
                      set subject_fragment "NEXT WEEK"
                      set body_fragment "next week"
                }
 day_before { set sql_clause "remind_day_before_p
and event_month=extract(month from (current_date + interval '1' day))
and event_day=extract(day from (current_date + interval '1' day))"
                      set subject_fragment "TOMORROW"
                      set body_fragment "tomorrow"
                }
 day_of { set sql_clause "remind_day_of_p
and event_month=extract(month from current_date)
and event_day=extract(day from current_date)"
                 set subject_fragment "TODAY"
                 set body_fragment "today"
                }
    }
    # grab two database connections, one to look for reminders,
    # and one for updates to mark reminders as having been done
    set db_connections [ns_db gethandle [philg_server_default_pool] 2]
    set db [lindex $db_connections 0]
    set db_sub [lindex $db_connections 1]
    # we're going to keep track of how many reminders we sent and when
    set email_count 0
    set start_stamp [database_to_tcl_string $db "return current_timestamp(0)"]
    # we use our first database connection to look for relevant reminders
    # note that $SQL_CLAUSE is set in the SWITCH statement above
    # note also the final clause that prevents us from sending two
    # reminders on the same day for the same event
    set selection [ns_db select $db "select oid,*,extract(month from event_date) as event_month, extract(day from event_date) as event_day
from bday_reminders 
using (isolation level read uncommitted)
where $sql_clause
and (last_reminded <> current_date or last_reminded is null)"]
    while {[ns_db getrow $db $selection]} {
 set_variables_after_query
       # now all the columns from the DB are accessible as Tcl vars
       # we wrap a Tcl CATCH around the call to sendmail; we don't
       # want one error (e.g., a really badly formed email address
       # stopping the entire sweep)
 if [catch { ns_sendmail $email [bday_system_owner] "$event_description is $subject_fragment" "Reminder:
$event_description ($event_date)
is $body_fragment.
This message brought to you by [bday_system_name].
If you don't want to receive these reminders in the future,
just visit [bday_system_url].
"
    } errmsg] {
 # failed to send email
       # this AOLserver API call writes an entry into the error log
 ns_log Error "[bday_system_name] failed sending to $email: $errmsg"
    } else {
 # succeeded sending email, mark the row as reminded
       # note that we are using the $DB_SUB database connection
       # so as not to disturb our sweep through the reminders
 ns_db dml $db_sub "update bday_reminders 
set last_reminded = current_date
where oid = '$oid'"
        incr email_count
    }
  }
  
  # we're all done, let's log the work
  ns_db dml $db "insert into bday_log (message_type, message_count, start_stamp, end_stamp) 
values 
('$message_type',$email_count,'$start_stamp',current_timestamp(0))"
  # we call these directly because bday_sweep_all calls this fcn
  # three times in succession and otherwise AOLserver won't allow
  # the ns_db gethandle to go through
  ns_db releasehandle $db
  ns_db releasehandle $db_sub
  
}

An interesting highlight of this sweeping function is that we add "using (isolation level read uncommitted)" to the first SELECT. Illustra uses a page locking system rather than the row level locking selectable (per table) in heavier duty RDBMSes such as Informix and Oracle. We want to make sure that we don't deadlock ourselves by trying to read from the table and update it simultaneously. The "read uncommitted" tells Illustra not to grab any locks on the table when doing the SELECT.

Oh yes, in case you were wondering, we do have to define the BDAY_LOG table:

create table bday_log (
 message_type text,
 message_count integer,
 start_stamp timestamp(0),
 end_stamp timestamp(0)
);

My Concurrency Question

Suppose I have two database connections open.

Connection 1 eventually reaches the 2000th row, maybe 1 minute later. Does connection 1 find that the FOOBAR column has the old value of 56 or the updated value of 15? And is this behavior required by the SQL standard?

Answer From My Friend Who Works at Oracle

Okay: when you execute a query, the first thing the kernel does is take note of the time you started the query. As the query progresses through the table, it will look at the SCN ("system change number") of each row, which indicates when it has last been updated. If it finds that the row was updated after the query began, the kernel goes to the rollback buffer to fetch the last value the row held _before_ the query began. So in your example, connection 1 will not see the update made by connection 2 if the update is done after the query begins.

In fact, even after the query finishes and you start another, it _still_ won't see the change...not until you explicitly commit the update. (I shouldn't say "explicitly" since closing the connection cleanly will also perform a commit.) But even if you commit while the query is running, it still won't get the new value. However, the next time you query, it will catch the change.

But wait, there's more. Let's say you don't do a "commit" in connection 2, but you enter the same query as before, this time in connection 2. While that's running, you also run the query in connection 1. The results will differ. Why? Because the query running in connection 2 KNOWS you've made an update, and sees those SCNs rather than the ones stored in the permanent table structure. [Technically: before a commit, the results of a write operation are stored in the ITL (Interested Transaction Layer). A session has access to the portion of those results performed in that session ONLY; these operations are "presumably committed" for the purposes of whatever you do in that session. It can not, however, see anything in other sessions, which is why connection 1 has no clue what you've done until you truly commit it.] This illustrates one of the challenges of concurrent programming with databases -- Lesson One: Always commit your changes as soon as you know they're permanent. (Corollary: Always have frequent backups for those plentiful occasions when they weren't and aren't reversible.)

Regarding locks [a subject which I know for a fact varies among database vendors], connection 1 will not lock the table, unless the person executing the query has explicitly done so beforehand by doing "select * from table_name for update". If so, it would be released after an explicit "commit is typed, and connection 2 would wait and/or time out while the lock is held.

Most likely, you would not want to lock an entire table for a query. The only reason I can dream of doing this is if: 1) you need to run a very long query (many hours), 2) there are tons of transactions going on simultaneously, 3) those transactions are less important than the query results, and 4) you have a VERY small amount of rollback space. Remember that point where it finds a newer SCN and fetches the old value from the rollback? If the rollback doesn't go back far enough, the query errors out with an ORA-1555 : snapshot too old" error. So as long as the entire table isn't locked, and no other connections are locking the rows you need to update, connection 2 will do its update right away.

When connection 2 does update, it will either lock the entire table or just the row it is trying to update, depending on the database's system and session parameters, e.g. if something like "ROW_LOCKING=ALWAYS" is in your oracle.ini file. I think the default is to lock the table, but don't quote me on that (it would be silly if we did, since row locking is a big thing for Oracle.)

The question of yours regarding the SQL standard: that I don't know. I've never read the ANSI standards for SQL or SQL92, but I do not believe there are any rules regarding the behavior of concurrent sessions. If there is, though, I'm fairly confident we're doing what it says. (If we weren't, we'd be fixing it, and I've heard of no intentions to do so.)

My Conclusion

Thank God I don't have to write my own RDBMS.

Case 4: The Bulletin Board

In 1982, I tried USENET news for the first time. I'd just graduated from MIT and took up my cubicle at HP Labs in Palo Alto. Though I had access to good operating systems (TOPS-20 and the Lisp Machine), for some reason I decided to ask for an account on the VAX, which was running Unix. I think it was my friend Luigi who signed me up to net.jokes. After two days, I counted 1 funny joke, 5 not funny jokes, 20 people complaining that Joke X was inappropriate for the list because it was racist/sexist/notfunny/whatever, and 80 people complaining that the complainers should not have complained.

I was so impressed with Unix and USENET that I went to talk to the sysadmin, a paunchy black-T-shirted guy who sat in his cube all day reading science fiction novels. He'd been saying for years that he would upgrade the DEC-20's operating system to Digital's latest release, but instead the stack of novels just grew larger. Even top managers didn't dare challenge his terrifying power. He knew PDP-10 assembler and they didn't. It took all of my courage to even approach the man, but I decided that conquering my terror would be worth it. "Would you please delete my Unix account?" I asked.

About a decade later, my 18th year on the Internet, my officemate at MIT showed me the alt.sex hierarchy. I had no idea . . .

With this deep affinity for and experience with bboard systems, then, who could have predicted that one day I would write my own?

Can 30 million USENET users really be wrong?

Given the high average quality of information on USENET, the obvious intelligence of the people posting, the high performance and reliability of news servers, and the quality user interface of such programs as rn (the incredibly primitive original Unix news reader), it is tough to believe that anyone would dare try improving on USENET. Nonetheless, there are a few things one can do with a database-backed bboard:

If you have enough money to buy and maintain an RDBMS, it is a fairly safe bet that you have enough money to buy Netscape News and Collabra Servers. These are reasonably full-featured products. However, they don't come with source code or much customization ability. If you want to learn how to design and build a collaboration environment that suits your users, keep reading...

Where's the Beef?

The only thing hard about building a threaded bboard system is sorting into threads. If all you have is a binary "response to" relation (i.e., that Message N is a response to Message J), then you need to calculate the transitive closure of this relation to find the thread. I like using the phrase "transitive closure" because it makes me feel that computer science graduate school really was the best 10 years of my life. Anyway, the transitive closure of a relation X is just the set of things that are related under relation X.

So what?

Well, transitive closure in the worst case (where all the messages are one thread) provably requires computing resources that grow as the cube of the number of messages ("order N-cubed" or O[n^3]). That's bad. It could be very expensive for the server and slow for the user if one had to calculate the threads every time the bboard was surfed.

My program cheats by calculating a sort key every time a message is added to the bboard. Then the thread sort is done with a simple SQL "order by" clause. This kind of sorting is "O[n log n]" or maybe 100 times faster for a moderate-sized bboard.

Step 1: The data model

create table bboard (
        msg_id          char(6) not null primary key,
        refers_to       char(6),
        email           text,
        name            text,
        one_line        text,
        message         text,
        notify          boolean default 'f',
        posting_time    timestamp(0),
        sort_key        text
);
create index bboard_index on bboard using btree ( msg_id );
create table msg_id_generator (
        last_msg_id     char(6)
);
insert into msg_id_generator values ('000000');

If you are using Oracle or Sybase, you'll need to use VARCHAR(2000) instead of TEXT for some fields and will have to decide if you want your message column to be LONG or VARCHAR (limited to 2000 bytes in Oracle).

Step 2: Legal Transactions

There is only one legal user transaction:

begin transaction;
select unique last_msg_id from msg_id_generator 
using ( lock = table, exclusive);
update msg_id_generator set last_msg_id = '00001n';
insert into bboard 
(msg_id,refers_to,email,
name,one_line,
message,
notify,sort_key,posting_time)
values
('00001n',NULL::char(6),'philg@mit.edu',
'Philip Greenspun','generating some queries for my book',
'I''m posting this so that I can look in the server.log
and harvest the queries for my book...',
't','00001n',current_timestamp(0));
end transaction;

The first portion of the transaction is a bit odd. For message IDs, I'm using 6-character strings. The next ID after "00001m" is "00001n". Each character cycles through the integers, then the capital letters, then the lowercase letters. So after "00001z" the next message is "000020", 9 messages later, "000029", then "00002A", and "00002Z" will be followed by "00002a".

This kind of sequence is rather tough to generate in standard SQL so I do it all in Tcl. However, that means that I have to read from the msg_id_generator table, do some computation, then write to the msg_id_generator table. That's the classic recipe for deadlock. We have to tell the database to lock the msg_id_generator table immediately and not start any other transactions using this table. That's why I added the "using ( lock = table, exclusive)" clause to the first SELECT. (Note: in Oracle, we'd just use "SELECT .. FOR UPDATE" to accomplish the same goal.)

Sort keys deserve some discussion. The sort key for a top level message like this is just the message ID, "00001n". For messages that respond to the top level, the sort keys are "00001n.00", "00001n.01", . . . , "00001n.09", "00001n.0A", . . . , "00001n.0Z", "00001n.0a", . . . , ""00001n.0z", "00001n.10", . . . There are 62 possibilities for each character and two characters per thread level. So any one message can have 3844 immediate responses. A response to a response will have a sort key with two extra characters, for example, "00001n.0900" would be the first response to the ninth response to top-level message 00001n.

This sort key enables me to get the messages out in thread-sorted order with

select msg_id, one_line, sort_key from bboard order by sort_key

and the level of a message can be calculated simply as a function of the length of the sort key.

The administrator is able to delete messages subject to the constraint that threads or subthreads must be removed together. In other words, it is not possible to remove a message if it has dependent messages. The user interface offers an option "remove this message and its 4 dependents?" That way we make sure that no message ever has a REFER_TO column pointing to a non-existent message.

Here's the kind of deletion that works:

delete from bboard b1
where msg_id in 
 (select b2.msg_id from bboard b2 where b2.sort_key like '0000e6.%');

Just to be on the safe side, we've used correlation names for the two different uses of the table BBOARD. Because of the way the sort keys are constructed, this query will remove the message with ID "0000e6" and the entire thread that it supports.

Step 3: Mapping Transactions onto Web Forms

I initially conceived my bboard system as a kind of runt USENET. I went to a lot of trouble to make fully threaded discussions work and was going to deliver all of that power to users. So I dragged out the big Web user interface iron: frames. I'd use the top frame to show subject lines and the bottom frame to show the full message, just like all the USENET clients (including the one built into Netscape Navigator itself). See Figures 13-9 and 13-10 for examples of USENET and my bboard software.

Figure 13-9: A view of the USENET group comp.infosystems.www.servers.unix, presented by GNUS, the news reader built into GNU Emacs (a good programmer never leaves Emacs for any reason!). Subject lines for about 450 messages are available in the top window, the full text of one posting in the bottom window. Here Alan McCoy of NASA's Langley Research Center has asked about the difference between NCSA 1.5 and Apache. Bob Steele, who apparently runs a commercial ISP, tells him that NCSA has shut down their HTTP server project and points McCoy to the FAQ at www.apache.org. Despite what you might hear from politicians, this kind of exchange constitutes the overwhelming of Network News Transfer Protocol (NNTP) traffic.

Figure 13-10: My bboard software in use at The American Society of Mechanical Engineers (http://www.asme.org/). ASME runs more than 40 public and private forums, mostly using the full threads interface. Casual inspection of the topics, though, reveals that most discussions have only two levels (question, answer, answer, answer, question, answer, answer, answer, . . .).

Then I realized that actually threads were annoying and complicated. A bboard anchored to some Web content was most likely useful as a question and answer forum. The questions would mostly be directed at the author but other readers might wish to answer as well. It would be a real user-interface convenience to be able to see the question and all the answers on one page. See Figure 13-11 for an example.

Figure 13-11: a typical question and answer in the http://photo.net/photo/ Q&A forum. You can see the question and all the responses on the same page. Note that I personally answered the question on November 4, 1996. One month later, a much more experienced photographer from Germany contributed a much better answer. Both answers were sent by e-mail to the woman who asked the question. This interoperates with the threaded discussion user interface against the same data model. That is, a posting in the threaded interface is visible in the Q&A interface and vice versa.

For a few minutes, I thought about starting from scratch to write a Q&A forum. Then I realized that I could drive the same data model from both threaded and Q&A user interfaces. Furthermore, I could make them interoperate. A posting from the threaded interface would be visible in the Q&A interface and vice versa!

How did I do it? An answer added from the Q&A interface has a subject line of "Response to $original_question_subject_line" automatically inserted. This is never displayed to users of the Q&A interface but threads users will see it in the subject frame. An answer added from the threads interface will have a custom subject line. This is displayed in bold face between answers in the Q&A interface.

Step 4: Writing Code to Process Those Forms

I'm reluctant to give you a tour of this software because it is available for download from http://demo.webho.com.

Case 5: The Bulletin Board (again)

The bboard project taught me something profound: When you build something in three days, you tend to leave out some important features. I released the bboard software to what I thought would be an adoring public. Instead they came back with

Step 1: The data model

In order to support multiple forums, I need to store some information about each forum. No longer can I assume that the person who installed the software is the moderator. I need to support distributed moderation. So we at least need to store a maintainer_name, maintainer_email, and admin_password for each group. Also, each forum needs to have a name and some links back to the static content with which it is associated. Finally there are some columns associated with Q&A presentation. Here's a new table in which information about bboard_forums is represented:

create table bboard_topics (
 topic  text not null primary key,
 backlink text, -- a URL pointing back to the relevant page
 backlink_title text, -- what to say for the link back
 blather  text, -- arbitrary HTML text for top of page
 admin_password text,
 user_password text, -- for restricted bboards
 ns_perm_group text, -- for using AOLserver's permissions system
 -- non-null if we added a custom group when creating this topic
 ns_perm_group_added_for_this_forum text,
 maintainer_name  text,
 maintainer_email text,
 subject_line_suffix text, -- e.g., 'name'
 -- send email to the maintainer when a message is added?
 notify_of_new_postings_p boolean default 't',
-- HTML encouraging user to search elsewhere before posting
 pre_post_caveat  text,
 -- present the Q&A or threads from the top level list?
 q_and_a_primary_p boolean default 'f'
 -- stuff just for Q&A use
 -- note that the sort_order has an integrity constraint that
 -- prevents it from containing values other than ‘asc' and ‘desc'
 q_and_a_sort_order text default 'asc' not null 
             check (q_and_a_sort_order in ('asc','desc')),
 q_and_a_categorized_p boolean default 'f',
 -- defines what is considered "new" and therefore displayed above
 -- that older, categorized, questions
 q_and_a_new_days integer default 7,
 -- at posting time, do we ask the user to suggest a category?
 q_and_a_solicit_category_from_user boolean default 't',
 -- do we allow users to type in new categories?
 q_and_a_categorization_user_extensible_p boolean default 'f'
);

Now that bboard messages can be categorized, we need to keep track of what the legal categories are:

create table bboard_q_and_a_categories (
 topic  text not null references bboard_topics, 
 category text not null,
 primary key (topic, category)
);

Note the integrity constraint that rows here must have a TOPIC column value that is present in BBOARD_TOPICS. Also note the primary key constraint that prevents two rows from having the same TOPIC and CATEGORY.

We'll have to add two columns, TOPIC and CATEGORY to the BBOARD table:

create table bboard (
 msg_id  char(6) not null primary key,
 refers_to char(6),
 topic  text not null references bboard_topics,
 category text, -- only used for categorized Q&A forums
 email  text,
 name  text,
 one_line text,
 message  text,
 notify  boolean default 'f',
 posting_time timestamp(0),
 sort_key text
);

Note that the CATEGORY column is only relevant for top level posts in Q&A forums.

It was always my intent that the bboard system support casual collaboration by sending posters e-mail copies of responses to their messages. Also, the forum maintainer can opt to receive immediate e-mail notification of new postings. However, a lot of users asked for periodic summaries of postings. Here's the table that registers their interest:

create table bboard_email_alerts (
 email text not null,
 topic text not null references bboard_topics,
 -- we set this to 'f' if we get bounces 
valid_p boolean default 't', 
 frequency text,
 keywords  text
);

Fundamentally this registers EMAIL's interest in TOPIC. Frequency is one of "instant", "daily", "Monday/Thursday", or "weekly". If I sign up for a "weekly" alert for the "photo.net" topic then I'll get a summary prepared early Monday morning. If I add a string to the KEYWORDS column, I'll get only those new messages that also match in the full-text index (see below). For example, I can restrict any frequency alert to only return messages related to "nikon". Instant alerts are e-mailed individually as new messages are posted.

We need a housekeeping table to implement the alerts. This contains the last time that alerts were mailed, used when querying for new messages, plus a count of all the messages that have been sent.

create table bboard_email_alerts_last_updates (
 weekly timestamp(0),
 weekly_total integer,
 daily timestamp(0),
 daily_total integer,
 monthu timestamp(0),
 monthu_total integer
);

Step 2: Legal Transactions

Not much has changed. If the user provides a category when posting a new question from the Q&A interface, we insert that into the BBOARD table. There are a bunch of new admin functions to manage categorization. There are a bunch of obvious transactions on the alerts table when users add and disable alerts.

Step 3: Mapping Transactions onto Web Forms

Nothing too surprising here. I'll let the figures do the talking.

Figure 13-12: Top of the main administration page for the photo.net forum. Note that the administration can choose whether or not the Q&A interface is primary. There is a separate administration section for items that are only relevant to the Q&A code (see Figure 13-13).

Figure 13-13: Q&A forum administration for photo.net. The top level threads are listed below the "forum personality" slots from the bboard_topics table. Each thread can be deleted or categorized by the administrator.

Figure 13-14: Forms to let users specify new e-mail alerts and disable old ones. Instant notification is done by a thread that stays alive following each new posting to the forum. Daily, weekly, and Monday/Thursday notifications are done by a procedure scheduled inside the AOLserver.

Step 4: Writing Code to Process Those Forms

I noticed that a lot of naïve users were posting the same message two or even three times. Hence I added a check to insert-msg.tcl to see if there was already a row with the same TOPIC, ONE_LINE, and MESSAGE:

if [catch { set n_previous \
                [database_to_tcl_string $db \
                                        "select count(*) from bboard
where topic = '$QQtopic'
and one_line = '$QQone_line'
and message = '[bboard_convert_plaintext_to_html $QQmessage]'"]} \
    errmsg] {
    ns_log Notice "failed trying to look up prev posting: $errmsg"
    } else {
    # lookup succeeded
    if { $n_previous > 0 } {
 incr exception_count
 append exception_text "<li>There are already $n_previous 
messages in the database with the same subject line and body.  
Perhaps you already posted this?  Here are the messages:
<ul>
"
        set selection [ns_db select $db \
                             "select name, email, posting_time 
from bboard 
where topic = '$QQtopic'
and one_line = '$QQone_line'
and message = '[bboard_convert_plaintext_to_html $QQmessage]'"]
        while {[ns_db getrow $db $selection]} {
     set_variables_after_query
     append exception_text "<li>$posting_time by $name ($email)\n"
 }
 append exception_text "</ul>
If you are sure that you also want to post this message, then back up 
and change at least one character in the subject or message area, 
then resubmit."
    }
}

Note that this check doesn't have too much confidence in itself. It offers the user a list of the previous messages and ultimately relies on human intelligence. Input validation is nice but there is always a risk that it will outsmart itself and squash legitimate postings.

Case 6: The Bulletin Board (full-text indexed)

When the LUSENET software is installed at a site with the Illustra PLS Blade for full-text indexing, I take advantage of it by adding a PLS index on the bboard table:

create index bboard_pls_index on bboard using pls
( one_line, message, email, name );

This means that all the words in all four of the named columns will be indexed. How do you query using the index? Here's an example:

select msg_id, one_line
from PlsQueryOrdered('bboard_pls_index', 
                     'Canon macro lenses')::setof("Table bboard")
where topic='photo.net';

This user is looking for a discussion of "Canon macro lenses" in the http://photo.net/photo/ bboard. The function PlsQueryOrdered returns a set of rows from the "bboard_pls_index" that match this query, ordered by relevance. For obscure reasons known only to Illustra, these must be explicit typecast to "set of rows from the bboard table" with

setof("Table bboard")

After the cast, we can do anything we want with these rows. We can restrict the SELECT with WHERE clauses, in this case to messages from the "photo.net" topic. We can restrict the columns retrieved with the FROM list. We can JOIN with another table.

Oracle ConText Instead

With Oracle ConText, the syntax is a bit hairier for creating the index:

execute ctx_ddl.create_policy('BBOARD_POLICY', BBOARD.MESSAGE); 
execute ctx_ddl.create_index('BBOARD_POLICY');

but then somewhat simpler for queries:

select * from BBOARD where contains(MESSAGE, 'Canon macro lenses')>0;

Although in general Illustra/PLS is clever enough that I don't have to even try to be clever, I did do one intelligent thing in my implementation of full text search for my bboard system. In the Q&A interface, when the user's query matches a response to a message, clicking on the search hit takes the user to the standard page with the full question plus all the responses. In the threads interface, the user gets the option of seeing the full thread or just the message in question.

Figure 13-15: The brilliant Illustra/PLS Blade finds all the messages relevant to the query string "Canon macro lens." My software makes each search hit a link to the full question and answer page so that users see answers in context. The subject line "Response to Canon 100/2.8 Macro?" appears several times because several of the six responses to this message are relevant to the query string. I could reformulate the query to eliminate duplicates but in some ways the repetition is good because it encourages the user to look at that thread.

Case 7: Classified Ads

Once you have an RDBMS that can index the strings, it would be a crime if you didn't build a good classified ad system. I've built about five classified systems and think I've learned one or two useful things.

First is a publishing philosophy point. People who come from the dead trees world always want to charge money for on-line classifieds. Their dead trees publication makes a huge profit selling classifieds to readers and they think that they can cup their palms then lap up similar profits on-line. In meetings with such folks, I'd point out the the existence of USENET. Anyone who is on-line has access to hundreds of free classified services, including USENET newsgroups. Why should they pay to be listed in a Web classifieds system that surely gets less traffic than ba.market.vehicles (1000 messages/month; ba = "Bay Area") or rec.photo.marketplace (5000 messages/month)?

It then occurred to me that there was a deeper point to be made. Given that Web publishers pay nothing for paper and printing, someone placing a classified ad is doing the publisher a big favor. The ad placer is contributing content. The publisher can use that content to build traffic, sell ads, and lure readers into other services. If anything, the publisher should be paying advertisers to post classifieds.

That's Useful Thing 1: Don't think about charging money unless you have some kind of monopoly.

My second epiphany is that full-text search engines work well enough that you don't need much structure in your data model. Typical RDBMS-backed Web sites appear to have been developed by 15-year-olds who unpacked their Oracle boxes and said "cool, I can have lots of different columns and then limit my searches based on values in those columns." The casual user of such a site is very quickly confronted with a 15-input search form. Fifteen inputs to fill out? All the user wanted to do was browse around and see what was in the system. After making those fifteen choices, the complicated query is submitted to a relational database management system that works heroically to produce . . . zero matches.

This design might have been great if there were 275,000 ads in the database, but not if there are only 300. For some kinds of classifieds systems, personals for example, part of the fun is browsing ads that aren't 100% relevant. Forms that encourage overconstrained searches rob users of their fun. Furthermore, the heavy structure of the data model imposes a burden on users when they are entering ads.

Useful Thing 2: Don't add columns to your data model if you won't be allowing users to constrain searches based on those values.

At this point, you're probably nodding your head and wondering how someone who belabors the obvious to the extent that I do can get published on real dead trees. Yet in February 1997 I was asked by a publisher, call them Naif News, to evaluate a bunch of commercial classified ad systems. Naif News publishes more than 100 small newspapers so they were being wooed heavily. Millions of dollars were about to change hands. The vendors already had already sold systems to dozens of newspapers and were on-line with graphics-heavy user interfaces.

What did I find? All of these vendors had fallen into the "15-year-old with Oracle" trap. It was very difficult to construct a query that would return any ads at all, even when they had thousands of ads in their database. Furthermore, their structured systems weren't able to use the structure to do simple things such as return Chevrolet and Chevy ads to someone querying for "Chevy" in the manufacturer column (any full-text search engine would have figured out for itself that Chevy and Chevrolet are related words after seeing a few hundred ads).

Enough about how stupid everyone else is . . . let's talk about how stupid I am.

Classifieds Idea: Categorize

My generic classified ads system depends first upon categorization of ads. User interface research back in the 1960s concluded that humans are very bad at formulating boolean queries, for example "Nikon AND NOT microscope AND NOT ‘integrated circuit mask stepper'". The Holy Grail of information retrieval was supposed to be the full-text search engine that is so common today. Just type in a string that reflects your curiosity, look at some results, then refine by picking a good result and asking for "more like this."

Why then categorize? Why bother to make users of my http://photo.net/photo/ classifieds say whether they are placing an ad for Underwater Equipment or Darkroom? Why can't the search engine direct people to appropriate ads?

It turns out that users have trouble coming up with that initial query string. It is a lot easier to browse than to think. It is also a lot more efficient for the RDBMS to grab all ads in the Underwater category rather than send PLS off looking for "flash for my Nikonos V". There are only about 10 ads in the Underwater section most of the time so why not let the user see all of them?

Classifieds Idea: Automatic Management

I don't want to spend the rest of my life managing my classified ad system. Users post ads using forms. Users edit and delete their obsolete ads by typing in a password. If the password isn't the same as what they typed when they posted the ad, my software offers to e-mail it to them.

How well does this work? My system has handled about 4100 ads so far. I've gotten about 10 e-mail messages asking me to edit ads. In all but one or two cases, the user was able to edit his own ad after I pointed him to the forms.

Classifieds Idea: E-mail Alerts

With only a few hundred current ads in my database, I have to do something to distinguish myself from rec.photo.marketplace with its 1000 current ads at any time. Starting a couple of years ago, I offered to mail out e-mail alerts to people who were interested in seeing all the ads, just ads in a particular category, or just ads that match a query string (according to PLS). There are 205 people currently requesting alerts (options are daily, Monday/Thursday, and weekly as with the RemindMe system).

Classifieds Idea: Show Off New Ads

One-line summaries of the newest ads should be displayed on the very first page of the system. Users shouldn't have to click to see ads.

Classifieds Idea: Auction

One thing that Web classifieds can do that newspaper classifieds can't is turn into an auction. Users of my system can choose at posting time whether or not they'd like Philip's Classifieds to record bids. A bid is instantly e-mailed to the seller and is also displayed underneath the ad for subsequent displays (see Figure 13-16).

Figure 13-16: I'm too lazy to add the figures from my hardcopy edition. So just go over to http://photo.net/photo/ and surf around. My generic classified ad system can be configured to run auctions for advertised items. Eric Zarakov is auctioning a delicious Canon EOS 300/2.8L lens. They cost about $4600 new. Mr. Zarakov is asking $3800 or best offer. Over 8 days, he has received six bids (the highest so far is $2750). [Note: this is not the most expensive item sold in the http://photo.net/photo classifieds; that distinction belongs to a $20,000 Zeiss 300mm lens. The seller was so grateful that he wanted to send me some money. I told him that the classifieds were a free service. He insisted. I asked him to write a check for whatever amount he deemed appropriate to Angell Memorial Animal Hospital. I didn't hear from him for a few weeks. Eventually I Angell got the check. It turned out that the seller had a millionaire relative who'd died and left all of her money to an animal charity, rather disappointing the family. My request was a painful reminder!

Steps 1-4

My classified system is a bit too complicated to describe in a book. However, I think it is worth showing off a few points, starting with the cover page (see Figures 13-17 and 13-18).

Figure 13-17: Top level page for a domain in my generic classifieds system. Note that the user gets a tremendous amount of information right on the first page: one-line summaries of recent ads, a listing of the available categories, a count of the number of ads in each category. See Figure 13-18 for the bottom of the page.

Figure 13-18: The bottom portion of the first page for a domain in my generic classified systems (see Figure 13-17 for the top). Note that only here is the full text search engine made available. Users can browse just by clicking the mouse on the links above. Note also that I provide a link to the AltaVista search engine. My form targets a little script of mine that tacks on additional syntax to restrict AltaVista's search to the rec.photo.marketplace newsgroup. So a user typing "Canon macro lens" on my page results in a 302 redirect to http://www.altavista.digital.com/cgi-bin/query?pg=aq&what=news&fmt=.&q=newsgroups:rec.photo.marketplace+AND+Canon+macro+lens

The Cover Page

First, here's a table that defines an ad domain. I run classifieds for photo.net, Web Tools Review, and Web Travel Review all from the same server (http://classifieds.photo.net/gc/) and from the same tables. Each of these publications is a domain as far as my generic classified system is concerned. Here's what I represent for each domain:

create table ad_domains (
 domain   text not null primary key,
 maintainer_name  text,
 maintainer_email text,
 maintainer_telephone text,
 display_telephone text, -- maybe a customer service 800 number
 backlink_url  text, -- if this is part of a bigger service
 backlink_title  text,
 blurb   text, -- top of the front page
 blurb_bottom  text, -- top of the front page
 insert_form_fragments text, -- HTML form INPUTs
 ad_deletion_blurb text, -- printed out after an ad is deleted
 default_expiration_days integer default 100,
 -- how many levels of categorization (if they just 
 -- use primary_category then it is 1)
 levels_of_categorization integer default 1,
 user_extensible_categorization_p boolean default 'f'
); 

Here's the Tcl script for the top level page that pulls data from the ad_domain table to generate the user interface, then pulls information from the ads table.

#
# parameters
# this probably should be a column in the ad_domains table but it
# isn't; so a change in the .tcl file will affect all the domains
set how_many_recent_ads_to_display 5
set_form_variables
set_form_variables_string_trim_DoubleAposQQ
# domain
set db [ns_db gethandle]
set selection [ns_db 1row $db "select * from ad_domains where domain = '$QQdomain'"]
set_variables_after_query
# now all the properties of the domain are available as Tcl local vars
# we're doing some expensive database queries; we don't want users
# staring at a blank window, so return headers and some text
ReturnHeaders $conn
ns_write $conn "<html>
<head>
<title>$backlink_title Classified Ads</title>
</head>
<body bgcolor=#ffffff text=#000000>
<h2>Classified Ads</h2>
for <a href=\"$backlink_url\">$backlink_title</a>
<hr>
$blurb
<hr>
\[ <a href=\"place-ad.tcl?domain=$domain\">Place An Ad</a> |
<a href=\"edit-ad.tcl?domain=$domain\">Edit Old Ad</a> |
<a href=\"add-alert.tcl?domain=$domain\">Add/Edit Alert</a>
\]
<h3>Recent Ads</h3>
<ul>
"
# we only want the first 5 ads, but there is no clean way to specify 
# that in SQL. So we just ask for all the unexpired ads in this domain
# ordered by descending classified_ad_id
set selection [ns_db select $db "select classified_ad_id,one_line
from classified_ads
where domain = '$QQdomain'
and (current_date <= expires or expires is null)
order by classified_ad_id desc"]
set ad_counter 0
while {[ns_db getrow $db $selection]} {
    incr ad_counter
    if { $ad_counter > $how_many_recent_ads_to_display } {
       # we've already displayed enough recent ads, so we shouldn't
       # read anymore rows from the preceding SELECT.  However, we 
       # can't just walk away from an "open cursor". That can leave
       # all kinds of garbage around in the RDBMS client and server.
       # We call the AOLserver function "ns_db flush" which will in
       # turn inform the RDBMS server that no more rows are wanted.
 ns_db flush $db
       # call a Tcl function to break out of the while loop
 break
    }
    # we still have some recent ads to display
    set_variables_after_query
    ns_write $conn "
<li><a href=\"view-one.tcl?classified_ad_id=$classified_ad_id\">
$one_line
</a>
"
}
# we're done displaying the recent ads
ns_write $conn "<p>
<li><a href=\"domain-all.tcl?domain=$domain&by_category_p=f&wtb_p=f\">
All Ads Chronologically</a>
(<a href=\"domain-all.tcl?domain=$domain&by_category_p=f&wtb_p=t\">
including wanted to buy</a>)
</ul>
<h3>Ads by Category</h3>
<ul>"
# this is a very expensive database query using the GROUP BY feature
# of SQL.  We are asking the RDBMS to gather up the ads, group them 
# by PRIMARY_CATEGORY, count the ads in each category, then return
# the count and the category name, ordering the returned rows by
# UPPER( PRIMARY_CATEGORY ) so that capitalization doesn't affect
# sorting. This query necessarily involves the RDBMS scanning the
# entire table.  It can still be very fast if you have a smart RDBMS
# on a computer with a lot of RAM. It is even tolerably fast with
# Illustra (a very stupid RDBMS when it comes to caching) because my
# table is pretty small and my RDBMS server box has 4 CPUs
# SQL nerd note 1: if you don't have OR EXPIRES IS NULL then you won't
# get any rows where EXPIRES is NULL.  A profound concept, to be sure,
# but one that escaped me at first.
# SQL nerd note 2: if you don't have a good name for something in the 
# FROM list, you can just refer to it by position.  Here ORDER BY 3 
# says "order by the 3rd thing in the FROM list" which is to say
# ORDER BY UPPER(PRIMARY_CATEGORY)
set selection [ns_db select $db "select count(*) as count,
primary_category as category,
upper(primary_category)
from classified_ads
where domain = '$QQdomain'
and (current_date <= expires or expires is null)
group by primary_category
order by 3"]
while {[ns_db getrow $db $selection]} {
    set_variables_after_query
    if { $count == 1 } {
 set pretty_count "1 Ad"
    } else {
 set pretty_count "$count Ads"
    }
    ns_write $conn "<li>
<a href=\"view-category.tcl?domain=[ns_urlencode $domain]&primary_category=[ns_urlencode $category]\">
$category
</a> 
($pretty_count)
"
}
ns_write $conn "<p>
<li>
<a href=\"domain-all.tcl?domain=$domain&by_category_p=t&wtb_p=t\">
All Ads by Category
</a>
</ul>
<form method=post action=search.tcl>
<input type=hidden name=domain value=\"$domain\">
or ask for a full text search:  
<input type=text size=30 name=query_string>
</form>
$blurb_bottom
<hr>
Note: this service is vaguely part of 
<a href=index.tcl>Philip's Classifieds</a>,
all of which share software and a relational database.
<hr>
<a href=\"mailto:$maintainer_email\">
<address>$maintainer_email</address>
</a>
</body>
</html>"

Note how the returned document will adopt the personality of the ad domain. The signature at the bottom is $maintainer_email (from the database). The links up at the top are to $backlink_url (from the database). There are blurbs at the top and bottom that come straight from the RDBMS. Go over to http://photo.net/photo/ right now and see how responsive the pages are. It is surprisingly fast to pull the ad_domain row out of the RDBMS. I've not yet been tempted to do any clever caching in Tcl global variables.

The Auction System

Recording bids is straightforward:

create table classified_auction_bids (
 classified_ad_id integer not null references classified_ads,
 bid   numeric(9,2),
 currency  text default 'US dollars',
 bid_time  timestamp(0),
 location  text, -- e.g., 'New York City'
 email   text,
 name   text
);

Note the integrity constraint that this table can't contain bids for ads that don't exist in the classified_ads table. Assuming the RDBMS is behaving itself, we won't be able to delete an ad without first deleting bids for that ad. Here is a legal transaction in the augmented data model:

begin transaction;
delete from classified_auction_bids where classified_ad_id = 17;
delete from classified_ads where classified_ad_id = 17;
end transaction;

Summary

Here's what you should have learned from reading this chapter:

If you didn't like staring at my software, you'll be pleased to learn that the next two chapters take us back to the tone and level of the beginning of the book.

Note: If you like this book you can move on to Chapter 14.


philg@mit.edu