SurfWatch error rate for first 1,000 .com domains

8/2/2000
Bennett Haselton, bennett@peacefire.org

Introduction

Using "zone files" from Network Solutions (which list all .com domains in existence), we obtained a list of the first 1,000 active ".com" domains on the Internet as of June 14, 2000. From this sample, we determined how many of these sites were blocked by SurfWatch, and of those blocked sites, how many were actually pornographic.

Results

Of the first 1,000 working .com domains, 147 were blocked by SurfWatch as "Sexually Explicit". Of these 147 blocked sites, we eliminated 96 sites that were "Under construction" pages (the list of 96 non-functioning sites is here).

Of the remaining 51 sites, 42 were errors (i.e. non-pornographic sites blocked by SurfWatch as "Sexually Explicit"), and 9 were non-errors (i.e. sexually explicit sites). This is an error rate of 42/51 = 82%, or roughly four non-pornographic domains blocked for every one pornographic domain blocked.

Error rate for domains: 82% About four non-pornographic domains blocked by SurfWatch as "Sexually Explicit", for every one pornographic domains blocked.

SurfWatch's published category definitions define a "Sexually Explicit" site as follows:

Sexually Explicit NOTE: We do not block on the basis of sexual preference, nor do we block sites regarding sexual health, breast cancer, or sexually transmitted diseases (except in graphic examples).

However, there were no blocked sites that we considered to be "borderline" cases, e.g. non-pornographic black-and-white nude photography sites. The sites in this sample were either pornographic sites (e.g. http://a-1blowjobs.com) or completely innocuous sites (e.g. http://a-1-dogs.com/).

We considered the following blocked sites to be "errors":
http://a-1-dogs.com/
http://a-1b2b.com/
http://a-1beds.com/
http://a-1diamondlimousine.com/
http://a-1janitorial.com/
http://a-1lock.com/
http://a-1locksafes.com/
http://a-1ofakindlimo.com/
http://a-1pro.com/
http://a-1rental.com/
http://a-1sierrastorage.com/
http://a-1silvascouriersvc.com/
http://a-1system.com/
http://a-1telecomm.com/
http://a-1waterbed.com/
http://a-1waterbedparts.com/
http://a-1waterbedpartsdirect.com/
http://a-1waterbeds.com/
http://a-2flightjacket.com/
http://a-adcon.com/
http://a-advantage-auto.com/
http://a-aji.com/
http://a-ak.com/
http://a-antiques.com/
http://a-arco.com/
http://a-atta.com/
http://a-b-w.com/
http://a-b2b.com/
http://a-back.com/
http://a-better-you.com/
http://a-bfreight.com/
http://a-bi.com/
http://a-bmove.com/
http://a-bon.com/
http://a-boo.com/
http://a-broo.com/
http://a-bs.com/
http://a-bsafecorp.com/
http://a-bshipping.com/
http://a-builders.com/
http://a-business1.com/
http://a-c-f.com/

We considered the following blocked sites to be "non-errors":
http://a-1blowjobs.com/
http://a-1sextoys.com/
http://a-1sex.com/
http://a-a-asexpics.com/
http://a-amateurs.com/
http://a-anal-sex.com/
http://a-ass.com/
http://a-big-tits.com/
http://a-bondage-site.com/

Setup

We obtained these results with SurfWatch for Windows 98, using a blocked-site list that was current as of July 31, 2000. Only the "Sexually Explicit" category was enabled. The other four default categories ("Gambling", "Violence / Hate Speech", and "Drugs / Alcohol") were turned off.

How the list of 1,000 domains was constructed

We started with zone files from Network Solutions listing all .com domains in alphabetical order. Michael Sims supplied the first 10,000 domains in alphabetical order from that list, after eliminating sites at the top whose names started mostly with all "-" dashes. (A disproportionate number of these were pornographic sites that chose their domain name solely in order to show up at the top of an alphabetical listing, so a sample that included these sites would not be a representative cross-section.)

Jamie McCarthy supplied the perl script which isolated the first 1,000 domains that were actually "up":

gunzip -c com.20000614.entire.sorted.gz | grep '^a' | grep -v
'.*--\|-.*-.*-' | perl -ne 'chomp; $a = system("ping -c 1 -q
www.$_ >/dev/null 2>&1"); print "$_\n" if !$a;' | head -1000

We used this script to narrow down the list to the first 1,000 pingable domains sorted alphabetically by domain name.

We used the first 1,000 working domain names in our sample in order to make our sample "provably random". A truly random sample chosen from the entire list of domain names would have been better, but it would be impossible to prove that such a sample had really been chosen randomly; a third party could easily claim that we had "stacked the deck" by choosing a disproportionate number of sites blocked incorrectly by SurfWatch.

Potential sources of error

A sample of 51 "real" sites that are blocked by SurfWatch, is a small sample from which to draw any precise conclusions. The problem with the sample size is that we had to start with 1,000 randomly chosen Web domains just to get a sample of 51 blocked domains. The 82% figure should not be taken as being accurate to even two significant figures; across all .com domains in existence, the error rate for SurfWatch could be as high as 95% or as low as 65%. However, the test does establish that the likelihood of SurfWatch having an error rate of, say, less than 60% across all domains, is virtually zero.

A note on interpreting these results: the results are not weighted by Web site traffic, so some of the sites in this experiment may cause more "Access Denied" messages than others. The 82% error rate should also not be interpreted to apply across all domains, since we only used .com domains in our experiment, which are more likely to contain commercial pornography than, say, .org domains. (In other words, we should expect the error rate to be even higher for .org sites that are blocked.)

Conclusion

SurfWatch claims on their Web site:

Before adding any site to our database, each site 'candidate' is reviewed by a SurfWatch Content Specialist. Deciphering the gray areas is not something that we trust to technology; it requires thought and sometimes discussion. We use technology to help find site candidates, but rely on thoughtful analysis for the final decision. --http://www1.surfwatch.com/about/body-filter.html

Given the high error rate for sites blocked by SurfWatch in the "Sexually Explicit" category, we believe that SurfWatch's claim of "100% human review" is false.