Who's Been Peeking At My Clickstream?

Kevin Florey


Paper for MIT 6.805/STS085: Ethics and Law on the Electronic Frontier, Fall 1995
As anyone who has watched internet-related stocks rocket into the stratosphere this past year can attest the WWW is big business. Modern-day prospectors line shoulder-to-shoulder anxious to strike gold on the Electronic Frontier. Yet as these businesses invest significant sums of money on both websites and sponsored-links to these websites they are left with many questions. Perhaps chief among these is: "Who are actually using these websites and what are these users doing when they are there?" To answer these questions, these modern-day prospectors are panning for information in the "clickstreams" which people leave behind as they surf the WWW. As various tools and services are being developed to increase the information gleaned from these "clickstreams", significant ethical, legal, and technical issues emerge.

This paper will first explore the general categories of commercial interests who are using the WWW and what information they desire to collect about WWW users. Next we will examine what information is currently being collected concerning an individual's WWW usage. We will also provide an overview of the software and services available to facilitate this collection. Next we will examine the collection of this information from a user's perspective: what are the potential benefits and risks of having this information available? We will then investigate different U.S. and EC laws and guidelines which are in place to regulate this information gathering. Finally, we will describe different protocols which are being proposed by the WWW Consortium and others to standardize this information collection--with a particular emphasis on their privacy protections. We will conclude by recommending an approach for responsible legislative and industry standards and principles to govern this information gathering.

COMMERCIAL USAGE OF THE WWW

Many businesses are setting up shop on the electronic frontier with the hope of striking gold. Hoffman, Novak, and Chatterjee identify six different categories of commercial websites: On-line Storefronts, Internet Presence (Flat Ad, Image and Information) Sites, Content (Fee-based, Sponsored, Searchable Database) Sites, Malls, Incentive Site, and Search Agents. (2) Each of these categories have distinct interests in collecting information about WWW usage.

On-line Storefronts and Malls

Online Storefronts are websites where users go in order to buy a product or service. Malls consists of collections of these Online Storefronts. Operators of these are interested in the both the demographics (income-level, race, social class, hometown, etc.) and identification of the WWW users who visit their site and the behavior (which pages they visit, what paths they take, etc.) of these users when they visit. By understanding the demographics of the users, the website owner can adjust the products being offered as well as alter the content of the mall based on the needs of the particular demographic segment which the site is attracting. The owner of a store-front would also need to know the identity of its users-- for example, it needs to know to whom to send the goods which have been purchased. By understanding the behavior of the users, the designer of the website can improve its layout; for example, it might eliminate pages which are not visited or which cause a use to leave the site. Finally, the operator of an on-line storefront would be very interested in how the user of the site found the Storefront (the referring page).

Internet Presence (Flat Ad, Image and Information) and Incentive Sites

Internet Presence Sites are websites put in place to inform (excite) consumers about a company's products or services. These sites serve as interactive, virtual advertisements for a company or product. In addition to building brand image, some of these sites can offer detailed information on a product. Incentive Sites are similar to Internet Presence Sites, but are put in place to "pull users to a commercial site behind it." (2) Again, knowledge of the demographics of the site's users is critical to the operator of internet presence and incentive sites. Knowing the demographics of the user base, not only would allow the operator of these sites to understand the site's effectiveness at reaching the targeted audience, but could also provide information to facilitate further targeted marketing efforts towards these users. These website owners -- for the same reason as the owners of On-line Storefronts -- would also be extremely interested in tracking their users' behavior and the identifying the referring page which they used to get there.

Content (Fee-based, Sponsored, Searchable Databases) Sites

Content sites are sites which have been set up to provide users with content which will hopefully make them return again and again. Examples include on-line newspapers (Wall Street Journal, etc.) and on-line stock-quote services. These sites can be fee-based or sponsored. Fee-based sites charge their users based on how much they use the site. Sponsored content sites "sell advertising space to reduce or eliminate the necessity of charging fees to visitors." (2) Advertisements on a sponsored site generally consist of catchy links to the advertiser's presence site. Both Fee-based and Sponsored sites have needs for demographic, behavior and referring page information for similar reasons as mentioned above. Sponsored content sites, however, have important additional needs for demographic and behavior information: to encourage sponsoring companies to advertise on their site.

The Wall Street Journal reports that in the fourth quarter of 1995 "270 companies spent $12.4 million to advertise on 175 sites." (25) This web-based advertising revenue is expected to grow to $1.4 billion a year by 1998. Traditional measures of advertising effectiveness, however, have been difficult to adapt to cyberspace. Sponsored Content sites are trying to create new measures for this new media which will map into traditional advertising measures such as Gross Exposures (total number of times anyone sees an advertisement), Reach (estimated percentage of the target audience exposed to the advertisement) and Demographics (traditional demographic profiles of the advertisement's readership). They would then use these measures to price their service to those wishing to place advertisements. We will discuss these possible new information sources in detail below in "Panning for Information".

Search Agent Sites

Search agent sites are sites which allow users to search the internet for websites which match their specific interest. Many of these sites are sponsored and thus have similar needs for demographic information as Content sites. By examining most frequently executed search strings, however, the search agent sites can potentially get a much better understanding of its user base. Clearly, such marketing profiles could become sources of income in and of-themselves. (e.g. How much would Playboy be willing to spend to get the e-mail addresses of all of those who searched for pornography on a daily basis for the past 6 months....Obviously, there are significant ethical and privacy concerns here. We will discuss these later.)

PANNING FOR INFORMATION IN THE CLICKSTREAM

We have seen that companies have specific business needs for various information concerning the usage of their website. The information required includes: demographics, behavior (tracking), identification, referring page, as well as specific measures for advertising effectiveness. Some of this information can be extracted from the date and time-stamped logs -- the "clickstream" -- which websites store in order to audit their users' mouse-clicks through the system. In general, however, these logs "are networking-centric, not marketing-centric. They are concerned with how many requests were made at what times, how many bytes were transferred and generally to where, and the like." (4) Marketers seek five measures from the clickstream: Hits, Pages, Visits, Users, and Identified Users.

Hits

Hits is an often used measure of how frequently a website is visited. "Hits equal the number of files accessed...'file accesses' has been defined...as 'all of the files available by hyperlink from the accessed page' ('clickable elements'), whether or not the user ever clicks on those hyperlinks and thereby opens those pages on his or her screen." (1) This measure has been widely derided as meaningless due to its counting of links which may or may not ever be followed. Nonetheless, some have used hits to establish advertising rates: "For now, the hit remains the main standard by which service providers gauge audience size and convey to advertisers the size of their market. Recently, for example, in a news release referring to its Web site, Time Warner Inc. said that 'last week, usage of Pathfinder service grew to more than one million hits per week, making it one of the most popular sites on the internet.'" (20) Hits are quite easy to obtain from most clickstreams.

Pages, Visits, Users and Identified Users

Getting an understanding of how many users visit a website, what their demographic (psychographic) profile is and how long they lingered over a given page is considered very important to most marketers. Pages is "defined as the gross number of pages downloaded from X site or domain during Y time." Pages, however, may be deceiving as well because: "downloaded pages may each occupy several screens, and there is no guarantee that the user scrolls down through the entirety of every page downloaded." (1)

Another measure, visits, is defined as the "gross number of occasions on which a user looked up X site or domain during Y time." The next measure, users, is defined as "the number of different people visiting X site or domain during Y time." Finally identified users is defined as "demographic measures of visits or users relating to X site or domain during Y time." (1) Visits, users, and identified users all depend on the fact that the website owner can uniquely identify each website visitor. This is not easy to do with existing clickstream data.

Identifying users is particularly difficult because users can log on anonymously from many different destinations points including proxy servers such as AOL. Many sites have tried to force users to register so to be able to uniquely identify them; but on the whole "web marketers are apprehensive about forced registration because it alienates some users or at least deters them from taking a peek at new areas. (And some folks have trouble recalling their gym locker combinations, not to mention dozens of Web passwords they don't even use that often.)" (15)

The MPA Internet Guidelines suggest some specific reports which could then be derived from the above metrics:

"1. General information for specific periods of time (by day, by week, or by month):

2. For Specific Ads

3. Demographics

Different software packages (at times combined with auditing services) have been developed in an attempt to address the issue of user-identification. These services use information concerning users, visits, and identified users to produce reports consistent with the above guidelines. Additionally, these services frequently offer the tracking and referring-page information which we identified earlier as being valuable to commercial website operators. These services utilize either active or passive approaches to information gathering.

COMMERCIAL SOFTWARE AND SERVICES

Conaghan suggest two categories of commercial services to assist websites operators in understanding their site's traffic: active and passive tracking. "Active tracking proponents (Newshare [Clickshare], Next Century Media) point to the potential for obtaining detailed demographics on end-users. By offering users something of value in return for their voluntary exchange of demographics, commercial services on the web can fine tune their product and service offerings." (17) Active tracking has not yet made much progress due to very few people willing to volunteer to register. Passive tracking is more easily accomplished in that it does not involve user registrations. Passive tracking solutions have been implemented utilizing software on the web servers themselves or solely on the browsers. It is accomplished through the use of Netscape "Cookies" (or other similar tokens) or the rewriting ("munging") of URL requests. (An overview of the different technical implementations will be discussed in the "Alternative Standards" section below.)

Active Tracking

Newshare's Clickshare product has information-gathering and reporting features which are representative of Active Tracking services. Businesses register to become a Clickshare site. A user then designates any one of these businesses as the "home-site". They register with this homesite only when they first use the system, after which they are free to surf any of the other Clickshare sites. When they first register, significant demographic information is solicited (name, address, income, marital status, news preferences, etc...) and stored at the home-site's secure database. After registering, the user is given a userid and password which they then use to log into any other Clickshare sites.

Clickshare will then track the details of each user's travels across all the Clickshare-member websites. Each user's detailed clickstream across all of Clickshare's member websites is then stored and maintained in Clickshare's secure database. Clickshare then uses these clickstreams combined with the demographic data which it retrieves (temporarily) from each site to developed reports on the use of each Clickshare website. (Note: Clickshare does not know which clickshare user-id corresponds with a real name, only the home-site maintains this link.) These reports contain the tracking, usage, and demographic information which was desired above. These trails are capable of being audited by third-party vendors to confirm site usage claims.

Passive Tracking

Cortex Group's Sitetrack 1.0 is representative of passive tracking solutions. According to the Sitetrack literature,

"Sitetrack is an add-on to the Netscape server [of whichever commercial website buys the software] which allows it to track users as they move through the website...the core of SiteTrack is a Netscape API library which the Netscape web server loads at start-up. This library wraps a 'shell' around the server in order to provide it with the ability to track users and dynamically alter pages." (19)

The Sitetrack literature then continues its description of how it tracks users in the system:

"There are two different ways the server can track users, either by using tokens or by using client-side cookies. SiteTrack is configured to use cookies or tokens, depending on which is more appropriate for a given situation. Presently, if a client supports cookies then cookies are used. In all other situations, tokens are used. SiteTrack makes all of these decisions transparently...The primary difference between cookies and tokens is that cookies are completely invisible to the user. The user will never see strange URLs or marked up links in pages. In fact, unless your pages indicate it, they will never know that they are being tracked at all." [emphasis added] (19)

While Passive Tracking can track user sessions, it is difficult to identify users. Sitetrack literature suggests one approach to this issue:

"There are two sorts of users in the SiteTrack system: short term and long term. The short term user exists for only one session and is identified by his session identification in the form of a client-side cookie or a token. A long term user identity is identified by a 'name' which allows SiteTrack to keep information about users from session to session... If you wish to keep the values of the variables that a user has until the next time that they visit, you give them a name. This 'name' is not their actual name [though it could be], but is instead any variable or piece of information which the webmaster chooses as a unique key for that user. On future visits, should the user choose to reveal his name by revealing his unique key, the user's persistent variables are recalled." (19)

RISKS AND BENEFITS TO THE WWW SURFER

Potential Benefits

We have seen how commercial websites can benefit greatly by increasing the amount of information which is collected about it users. We have also seen that increasingly sophisticated tools are being developed to meet this need for information. If the privacy rights of all users are respected, WWW users could also benefit greatly.

As a result of website operators knowing exactly who is visiting their website as well as

which pages they find useful and which they do not, websites will be continuously improved to match the interests of its users. Additionally, if site owners are able to learn the specific preferences of an individual user, they can present only those items of information which particularly interest the user. On-line newspapers, for example, could be tailored to the exact interests of each individual user which could save the user considerable search-time.

Furthermore, as operators of content sites gain a better understanding of its user-base, they will be able to secure more advertising revenue. This revenue can then be invested in keeping the website content up-to-date. Access to websites with valuable information can be free to all users as result of advertisements. Additionally, as advertisers increasingly understand their audience, they may be able to offer specific individuals offers which may result in lower-costs for the user.

In short, as the WWW is increasingly used to facilitate commerce, the specific needs of its users will be increasingly considered: "When today's infotainment consumers click our way through the web, advertisers have to win and retain attention in new ways. Having captured and retained our attention, the web advertiser has to give us, customers who have an extraordinary power to switch to other vendors at any moment, a continuous stream of good reasons to come back...the web-based market of the future...is one that is very closely attuned to, judged by, and dependent upon the perceptions, beliefs and decisions of consumers." (7)

Privacy Risks

Personal privacy advocates, however, are becoming increasingly concerned about potential violations of an individual's right to privacy. Privacy "has been characterized by Supreme Court Justic Louis Brandeis as 'the right to be left alone the most comprehensive of rights, and the right most valued by civilized me." (12) Privacy is a fundamental right which should not be bartered or sold. As more information about a specific individual is collected and gathered significant abuses could occur.

Abuses range from the severe such as blackmail (threatening to make public the number of times a politician accessed a pornographic website, for example.) to the more benign (but no less real) such as unsolicited junk mail. In essence individuals are concerned about having information which they consider sensitive or personal accessible to others. They are especially concerned when this information is pieced together to develop a portrait of their identity: "The real danger is the gradual erosion of individual liberties through the automation, integration, and interconnection of many small, separate record-keeping systems, each of which may seem innocuous, even benevolent and wholly justifiable." (12)

Individuals wish to control the amount of their personal information which is available to others. When individuals choose to release information about themselves they expect to know exactly how this information will be used as well as expect that it will be protected. Individuals expect not to have information about themselves surreptitiously collected without their knowledge or approval. Consumers are "frustrated by a lack of control they have over the use of their personal information and suffer from a lack of understanding about how information about them is collected, used, and distributed." (12) As tools are developed to make clickstreams map to increasingly personal data, these privacy concerns become increasingly justified.

LAWS REGULATING THE USE OF ELECTRONIC INFORMATION

Few U.S. Federal laws exist which directly protect a users right to a private clickstream.

For example The U.S. Department of Commerce's report "Privacy and the NII" states that "besides the Electronic Communications Privacy Act of 1986 (ECPA), which forbids only divulging the contents of a communication, no federal privacy laws apply to telecommunication related privacy information collected by those who provide Internet access." (12) Legally, it is still unclear how much information a website operator can or cannot collect concerning its users.

The "Privacy and the NII" paper does, however, lay out a general framework for the recommended collection and use of sensitive information. The two underlying principles of this framework are Provider Notice and Customer Consent.

Provider Notice

The Provider Notice principal basically means that:

"Information users who collect personal information directly from the individual should provide adequate, relevant information about:

1. Why they are collecting the information;

2. What the information is expected to be used for; [and]

3. What steps will be taken to protect its confidentiality; integrity; and quality." (12)

Customer Consent

The Customer Consent principal means that the customers who have private data requested from them must have a "meaningful opportunity to accept or reject the terms offered." (12) The "Privacy and the NII" paper then debates whether express or implied consent must be granted to those seeking to collect personal information.

The NII paper identifies two alternative approaches to obtaining customer consent. In the first ("opt-out") approach, a person's private information may "be used in an ancillary manner unless the individual affirmatively opts-out of such practices within some allotted time." Under the second ("Opt-in") alternative an individual's express consent must be given before the information may be used: "an individual's silence means that the information cannot be used." The paper suggests that in an interactive, electronic environment, the "opt-in" approach (i.e. the data is only used when explicit consent is give) should be the one which is followed. (12)

European Community Privacy Laws

The European Community has drafted laws which attempt to address the protection of electronic personal data. For example, Common Position No/95 concerning directives 94 and 95 of the EC European Parliament, Article 6 states that:

"1. Member states shall provide that personal data must be:

(a) processed fairly and lawfully;

(b) collected for specified, explicit and legitimate purposes and not further processed in a way incompatible with those purposes...

(c) adequate, relevant and not excessive in relation to the purposes for which they are collected and/or for which they are processed;

(d) accurate, and where necessary, kept up-to-date...

(e) kept in a form which permits identification of data subjects for no longer that is necessary...[with] appropriate safeguards..." (3)

Article 7 of the EC Paper continues:

"Member States shall provide that personal data may be processed only if:

(a) the data subject has given his consent unambiguously;" (3)

Industry Privacy Guidelines

Many of the industry bodies which represent the interests of advertisers and business include guidelines regarding the privacy of information. These guidelines result from both a genuine concern for privacy and a fear of legislation. The Coalition for Advertising Supported Information and Entertainment, for example, has published guidelines for interactive media audience measurements which state as a principle that "Consumer identities must not be revealed by audience measurement providers except if necessitated for audit purposes. Every effort should be made to maintain consumers' privacy." (1) This guideline continues: "Many of the interactive media have the capacity or potential capacity to identify some or all of those accessing their medium. Particular care needs to be taken...that user identities are kept secret and the user and user/respondent's privacy be respected by no more contact than is necessary for proper maintenance of equipment and production of accurate estimates." (1)

Service providers such as Microsoft, for example, have introduced similar principles. The Microsoft Network principals state that "a content provider may be asked 'to specify the legitimate business purpose for gathering information from a Member and to provide that Member with the opportunity to opt-out of the processing or use of that information for direct marketing purposes." (12)

STANDARDS FOR USAGE TRACKING AND DEMOGRAPHIC REPORTING

The WWW Consortium is considering different proposals for the gathering of Consumer Demographic information during HTTP sessions. (8, 13) HTTP is currently state-less: a web-server gets everything it needs to know about a request from the request itself. It doesn't need to know anything about any of the other previous transactions which may have taken place. This makes it difficult to track and identify a user as they move through a website because each click of the mouse is treated independently from the one before it. Netscape has introduced the concept of "Cookies" in order to work around this problem. Others have simply attempted to rewrite each URL request that comes into a web-server, adding session-identification information to the request. Both "Cookies" and "munging" have significant implementation issues, and no standard has yet to emerge for usage tracking. In order to address this the WWW consortium has developed a proposal (8,13) as has Dave Krsitol of AT&T Bell Laboratories(14).

Both proposals would standardize the collection of demographic information of WWW users as well as facilitate tracking of these users throughout websites. This would result in commercial website operators having a significantly improved ability to understand who is using their websites and for what. On the other hand, by standardizing the way usage tracking is implemented as well as how demographic information is collected and stored, these proposals could lead to improved privacy protections; assuming privacy is a design criteria of these approaches. Both proposals do in fact emphasize privacy protections. In this respect they represent improvements over both Netscape's "Magic-Cookie" approach and URL-Munging. The spirit of these proposals represents an improvement over the Netscape approach which currently can (and according to Sitetrack is) be used to track users throughout the system without their knowledge (at least URL-Munging is apparent to the watchful eye).

CONCLUSION AND RECOMMENDATION

We have seen that there are legitimate business needs for information concerning commercial website usage. We have also seen that currently no true standards exist to facilitate the gathering of this information. On the other hand, we have seen that significant risks are posed to WWW users' privacy by abusive collection of usage and demographic information. Furthermore, we have seen that in the United States no Federal Laws have yet been drafted which unambiguously protect these privacy rights. Some standards have been proposed by both industry and governmental working groups which outline the responsible collection of WWW usage and demographic information. Furthermore, technical standards are being proposed which would facilitate usage and demographic information collection while at the same time providing the disciplines and tools required to enforce their privacy rights.

The WWW community needs to take an active stance on specifying what sort of usage monitoring is appropriate and officially documenting this stance. They need to develop and approve of standards (such as Dave Kristol's) which support these beliefs. The principles must at minimal include the "Privacy and NII" notion of Notification and Consent. A user of the WWW should be made aware of when they are being tracked through a site, as well as aware of what information is being collected about their usage. The user of a site must be given the opportunity to opt-out of usage tracking as well as be given the option of not providing certain information. Any other behavior must be considered at least unethical if not illegal. Products which encourage passive tracking must enforce these privacy guidelines, rather than leaving such enforcement up to the discretion of the website operators.

BIBLIOGRAPHY

1) CASIE Guiding Principles of Interactive Media Audience Measurement

http://www.commercepark.com/AAAA/bc/casie/guide.html

2) Commercial Scenarios for the Web: Opportunities and Challenges, D. Hoffman, T. Novak, and P. Chatterjee

http://www2000.ogsm.vanderbilt.edu/patrali/jcmc.commercial.scenarios.html

3) ECO 291 CODEC 92, Common Position No /95 Adopted by the Council on 20 Feb. 95

http://cpsr.org/cpsr/privacy/privacy_international/international_laws/ec_data_protection_directive_1995.txt

4) Getting Real About Usage Statistics, T. Stehle, Knight-Ridder Inc.

http://www.infi.net/naa/stehle.html

5) Knowing Where you Browse?, Usenet Discussion Thread 9/21/95 - 10/05/95

6) Look Who's Surfing, A. Schurr,

http://www.zdnet.com/~pcweek/sr/1030/tsurf.html

7) The Medium Becomes a Market, H. Rheingold

http://www.well.com/user/hlr/tomorrow/mediamarket.html

8) More Proposals for Gathering Consumer Demographics, W3C, T. Berners-Lee

http://www.w3.org/pub/WWW/Demographics/Strawman.html

9) Persistent Client State HTTP Cookies

http://home.netscape.com/newsref/std/cookie_spec.html

10) Pricing Web Site Advertising, A. Wool

http://www.amic.com/amic_mem/research/hits.html

11) Privacy Guidelines for the National Information Infrastructure: A Review of Poposed Principles of the Privacy Working Group, EPIC

12) Privacy and the NII, U.S. Department of Commerce

http://ntiaunix2.ntia.doc.gov:70/0/policy/privwhitepaper.html

13) Proposals for Gathering Consumer Demographics, W3C, D. Connolly

http://www.w3.org/pub/WWW/Demographics/Proposals.html

14) Proposed HTTP State-Info Mechanism, D. Kristol, AT&T Bell Laboratories

ftp://ds.internic.net/internet-drafts/draft-kristol-http-state-info-01.txt

15) Research Firms Strive For Web Tracking That Counts, Interactive Marketing News No. 13 Volume 2

16) Security, Privacy, and Marketing on the Web, D. Maroney, Privacy Forum Digest Vol 04 Issue 20, Sep. 15, 1995

17) Tracking Audiences on the Web: The Conaghan Report,

http://www.infi.net/naa/webcount.html

18) Web Add-On Helps Identify Site Visitors, Communications Week, Oct. 30, 1995 pg. 27

19) Web User Tracking with the Sitetrack System, Group Cortex, November 1995

http://www.cortex.net/sitetrack/general/white/howitworks.html

20) What's in a Web Hit?, J. Chao, Austin-American Statesman page E2.

21) You get what you pay for? Advertising on the net not such a bargain. J. Ubois, Digital Media, Page 3

22) MPA Propsosed Standards for Internet Advertising Measurement,

http://www.infi.net/naa/mpa.html

23) Web Site Find Niche in Budgets Ruled by TV, Print Campaigns,

Wall Street Journal, Dec. 8, 1995