SafeServer error rate for first 1,000 .com domains

10/23/2000
Bennett Haselton, bennett@peacefire.org

Introduction

Using "zone files" from Network Solutions (which list all .com domains in existence), we obtained a list of the first 1,000 active ".com" domains on the Internet as of June 14, 2000. From this sample, we determined how many of these sites were blocked by SafeServer, and of those blocked sites, how many actually met SafeServer's blocking criteria. (SafeServer is the blocking program made by SmartStuff Software, better known for making the FoolProof desktop-control program that restricts access to Windows applications.)

Results

Of the first 1,000 working .com domains, 44 were blocked by SafeServer. Of these 44 blocked sites, we eliminated 15 sites that were "Under construction" pages (the list of 15 non-functioning sites is here).

Of the remaining 29 sites, 10 were errors (i.e. sites blocked by SafeServer that did not meet any of their criteria), and 19 were non-errors (i.e. sites that met SafeServer's criteria). This is an error rate of 10/29 = 34%, or roughly one site blocked incorrectly for every two blocked sites that meet SafeServer's criteria.

Error rate for domains: 34% One site blocked incorrectly for about every two blocked sites that meet SafeServer's criteria.

The method by which SafeServer determines whether to block a site is described by SmartStuff software at:
http://www.smartstuff.com/products/fpi/fpiserverfaq.html#DCE:

SafeServer features iCRT: a leading-edge technology based on artificial intelligence and pattern recognition technologies. The technology is trained to detect English-language pornography. SafeServer evaluates each incoming web page for inappropriate material. If the page is unacceptable, the browser displays a block page explaining why the page was refused and suggesting alternative sites.

Filtering categories include: Hate, Pornography, Gambling, Weapons, Drugs, Job Search and Stock Trading.
Thus, unlike most blocking programs, SafeServer does not come with a built-in list of "bad sites"; it examines each site as it is downloaded.

There were no blocked sites that we considered to be "borderline" cases, e.g. non-pornographic black-and-white nude photography sites. The sites in this sample were either pornographic sites (e.g. http://a-1blowjobs.com), or met the criteria for one of the other SafeServer blocking categories (http://a-blackjack-club.com/), or completely innocuous sites (e.g. http://a-1autowrecking.com/).

We considered the following blocked sites to be "errors":
http://a-1autowrecking.com/
http://a-1coffee.com/
http://a-1security.com/
http://a-1upgrades.com/
http://a-2-r.com/
http://a-abacomputers.com/
http://a-artisticimages.com/
http://a-baby.com/
http://a-build.com/
http://a-c-r.com/

We considered the following blocked sites to be "non-errors":
http://a-1-sex.com/
http://a-1blowjobs.com/
http://a-1casino.com/
http://a-1firearms.com/
http://a-1service.com/
http://a-1sex.com/
http://a-1sexsites.com/
http://a-1sextoys.com/
http://a-1sportsadvisor.com/
http://a-a-asexpics.com/
http://a-amateurs.com/
http://a-anal-sex.com/
http://a-anal.com/
http://a-ass.com/
http://a-big-boobs.com/
http://a-big-tits.com/
http://a-bikini.com/
http://a-blackjack-club.com/
http://a-blackjack-gambling.com/
http://a-bondage-site.com/

Setup

We obtained these results with a SafeServer proxy used by a high school (the school is not named in this report, to protect the identity of the student who assisted in the research). The list of blocked sites was current as of October 2000. Only the "Hate", "Pornography", "Gambling", "Weapons" and "Drugs" categories were enabled.

How the list of 1,000 domains was constructed

We started with zone files from Network Solutions listing all .com domains in alphabetical order. Michael Sims supplied the first 10,000 domains in alphabetical order from that list, after eliminating sites at the top whose names started mostly with all "-" dashes. (A disproportionate number of these were pornographic sites that chose their domain name solely in order to show up at the top of an alphabetical listing, so a sample that included these sites would not be a representative cross-section.)

Jamie McCarthy supplied the perl script which isolated the first 1,000 domains that were actually "up":

gunzip -c com.20000614.entire.sorted.gz | grep '^a' | grep -v
'.*--\|-.*-.*-' | perl -ne 'chomp; $a = system("ping -c 1 -q
www.$_ >/dev/null 2>&1"); print "$_\n" if !$a;' | head -1000

We used this script to narrow down the list to the first 1,000 pingable domains sorted alphabetically by domain name.

We used the first 1,000 working domain names in our sample in order to make our sample "provably random". A truly random sample chosen from the entire list of domain names would have been better, but it would be impossible to prove that such a sample had really been chosen randomly; a third party could easily claim that we had "stacked the deck" by choosing a disproportionate number of sites blocked incorrectly by SafeServer.

Potential sources of error

A sample of 29 "real" sites that are blocked by SafeServer, is a small sample from which to draw any precise conclusions. The problem with the sample size is that we had to start with 1,000 randomly chosen Web domains just to get a sample of 29 blocked domains. The 34% figure should not be taken as being accurate to even two significant figures; across all .com domains in existence, the error rate for SurfWatch could be as low as 15%. However, the test does establish that the likelihood of SafeServer having an error rate of, say, less than 1% across all domains, is virtually zero.

A note on interpreting these results: the results are not weighted by Web site traffic, so some of the sites in this experiment may cause more "Access Denied" messages than others. The 34% error rate should also not be interpreted to apply across all domains, since we only used .com domains in our experiment, which are more likely to contain commercial pornography than, say, .org domains. (In other words, we should expect the error rate to be even higher for .org sites that are blocked.)

Conclusion

The SafeServer FAQ from SmartStuff.com does not make any claims about the precision achieved by the "Intelligent Content Recognition Technology" used by the software. Based on the error rate found in this experiment, we conclude that the overall accuracy rate is low, and that about one third of sites blocked by SafeServer do not meet their criteria.