Web Traffic, Vodoo And What It Means To You

Rob · February 16, 2010, 11:53:40 PM

Todays article comes from Rob Tracy of Remedial Comics.

Recently the subject of metric sites came up on the forums and in that discussion and in another topic there was some question of the value of traffic numbers and metric sites. This is something that used to concern me a great deal until I learned the truth about it so I wanted to share what I have learned and maybe clear up some confusion on the subject.

One thing I want to say right up front is that this is something I did not understand myself until weeks of pestering Chadm1n; our resident admin bore fruit in a long discussion of what is and what is not measurable on the internet. So while I'm writing this article the only reason I can share this knowledge with you is because of HIS expertise and I plan on having him look this article over before I post it. And hopefully he will field any questions that pop up on the forums.

Traffic numbers are VERY important to webcomic creators. As Howard Tayler pointed out in his amazing presentation on Open Source TV if you have verifiable traffic numbers (Howard had done a survey of his readers and that should ring some bells with you because you might recall that almost every popular web comic has been doing surveys for some time now) you can literally "take them to the bank."

But why, you might ask yourself, would these large webcomics do surveys? Well the practical use is to measure possible merchandise sales, set ad rates and try and establish some hard numbers on the size of their committed fan base (and probably a whole bunch of stuff some marketing people could tell us that don't know and don't really want to know). But the REASON these surveys have to be done is because traffic on the internet is largely unmeasurable.

Let's talk about the different ways that traffic is measured on the web and then I can tell you why they don't really work. If you are really into the technical jargon of this subject this post at Wikipedia does an excellent job of explaining it all but doesn't draw the conclusions I will.

Most metric sites require the site being measured to install a small amount of hidden code on the site for the purposes of measurement. As a Google Analytics member myself I know that there is a tiny bit of code that goes off like a roman candle every time my home page gets hit. The same goes for my Project Wonderful ads. The code has to be there or the hits don't get measured.

The big problem with this code is that it requires the user and his/her browser to cooperate and allow themselves to be counted. Many people on the web do not use Javascript and for the most part those folks will be uncountable by this kind of web analytic collection. Zach Weiner recently made a Reddit post about Ad-Block and the implications it has for web based businesses like webcomics that outlines some of the other potential problems with this. If he's right and 10-20% of his traffic is unrecordable because of Ad-Block that means that whatever you are using to gauge your traffic is already off by that much. Would you rely on statistics that are only 80% correct? When you have 1,000 unique visitors to your site and your metric is telling you 800 that might not be so bad. But what happens when you have 10,000 and your metric says 8,000? How about when you hit 100,000 and your metric is only reporting 80,000?

And that's only one instance of why that type of measurement may fail. Some measurement systems involve cookies. Many web browsers turn off cookies or manage which they accept and which they do not.

Then there are sub domains. You know thissite/atthatsite.com kind of thing. For example; Gamespy is an incredibly popular gaming website with tons and tons of traffic. They also offer lesser known features like webcomics in their humor section. If you were to copy the link to one of these webcomics and then do an Alexa search for traffic you would think these comics were some of the most popular on the internet. When in fact the traffic numbers are high because Alexa does not differentiate between the main site (Gamespy) and the sub domain at which those comics live.

There is also the problem with IP tracking. How many universities, hotels, businesses and homes have multiple users on the same IP address? Think about that for a moment? Then consider all the times your internet service provider has changed your IP address. IP addresses issues have the potential to either radically reduce your real numbers or in the case of a single unique user logging in twice under two different IP addresses slightly, falsely increase those numbers.

Not to mention multiple users on the same computer. Ever been in a university computer lab? Aren't college students some of our biggest readers? Two or more people loading the same page from the same computer might increase your page views. But it won't do a thing for your unique users.

Finally, the type of web analytic that has been around the longest and is often trotted out as the most accurate (also the most complex and difficult to interpret) is the log file. By actually measuring the exact number of times certain elements are pulled from the server you can say with absolute confidence that your page was loaded X number of times by a web browser somewhere.

Here's the problem. Caching of certain elements of your page will skew those numbers. So unless you know precisely what portion of your page is being cached by every computer that loaded your page initially you cannot say for sure how many times your page really got loaded. Then there are archive sites. Supposedly Google is now recording everything on the web at all times that isn't secured behind some kind of firewall. So people could be loading your older comics from an archive other than the one at your site. And lastly there are the bots.

We all know that the web is literally crawling with tons and tons of little computer programs that are running around recording, analyzing and indexing everything. I currently have a day in August of last year standing as my day with the most people on line at my forum. On that day 59 people were on line in my forum. On that day there was no comic update and no posts were made. I was the only user who logged in. The other 58 people were either guests who did not desire to post anything or bots. Most were undoubtedly bots.

Just measuring your server logs leaves you open to WILD traffic fluctuations based upon whether or not you say or do something that these little programs find interesting. And there is currently no easy way to tell the difference between man and machine. Look at how much trouble we all go through to keep these things out of our comments sections and forums. Thus far no one has figured out a way to stop their page loads from counting on the server logs.

There are some hybrid solutions and some web analytics are better than others. But, and this really is the message that I want you to take away from all of this; for the most part traffic numbers are voodoo. The absolutely best use for this type of traffic measuring is in establishing a baseline to make note of increases and decreases in traffic, trying to identify the patterns of behavior on your site that have contributed to these upticks or declines and reacting accordingly. Because as an actual, factual, measurable indication of your true traffic numbers they are completely worthless.

Magister · February 18, 2010, 02:18:44 AM

If you really want to know your stats, caching aside, you need to run an old-fashion log file analyzer (web trends, AWS, etc) on the actual logfiles, instead of using a javascript enabled online traffic service. Mind you, if you're marketing to mobile devices, you're going to largely untracked with anything involving javascript, but that's a whole 'nother discussion.

Rob · February 19, 2010, 05:13:38 PM

Actually there is no way to filter bots from a log file, as I said in the article, so no, a log file analysis will not give you accurate numbers either. :-\

Chadm1n · February 19, 2010, 05:58:17 PM

To be accurate... you can filter bots (you can filter anything, really). The question is: how much time do you want to spend trying to identify them all?

Once upon a time, bots and browsers played relatively nice and - for the most part - behaved themselves. They would report what they were (user agent) and all was (mostly) well. Today, the game has changed. For example: Firefox will allow a user to tweak the browser so it reports itself as Internet Explorer. Likewise, bots can be coded to emulate any browser that fits the developer's mood. And then there are all of the anonymous proxies. <sigh>

My advice is to use a range of tools to build a solid traffic baseline. Understand what your traffic "looks like" on a regular basis. Once you have a well-established baseline, it becomes much easier to measure the impact of an ad campaign.

The most accurate method of monitoring traffic is to utilize a well-designed web application that requires a user to login (forum software such as SMF doesn't meet my definition of such an app, FWIW). Each user gets a unique session identifier and their actions can be tracked using some basic page transition techniques. I might recommend this sort of thing for a banking website, but I would advise against it for most other sites.

Baseline, baseline, baseline! That's my motto.

Gibson · February 19, 2010, 06:10:30 PM

What I'm getting from this is there is really no way to know, with any degree of accuracy, how many people are reading my work, and that any site(s) claiming to be able to tell me with accuracy how many people are regular readers is also false, that everything of the sort is a best guess. That's very interesting.

Rob · February 19, 2010, 08:49:06 PM

That's pretty much true although they usually couch their claims very carefully in their EULA. They do what they specifically represent to do within the parameters of what they can measure but they also carefully account for not being totally accurate. In other words, they may claim to be able to tell you in a sort of implicit way but they can't really do that and they do accomplish what they specifically say they can do in their user agreements.

As for you comment in the Project Wonderful thread...

I'll just answer it here, this article does actually account for the differences in site to site experience in a sort of related way because of the links I put in the article. You also have to do a bit of deductive reasoning.

When it comes to Project Wonderful a HUGE skew in what one site read in comparison to another is specifically related to where the ads on the site are placed and what type of ads are on the site.

For example, with Questionable Content that I know you mentioned, Jeph often switches his PW ads to the archive page only and posts ads from IndieClick (which I believe is run by the Dumbrella guys but don't quote me on that). And then when the IndieClick ads start petering out he'll switch back to PW.

Also, some of the ads are below the fold and some are not (below the fold means that you cannot see the ad when the page first loads). Some metric sites can tell if the page is scrolled down so that the ad is actually seen.

And Jeph isn't the only one who does this. Some folks switch their ad boxes all the time from Palace in the Sky to Indieclick to PW to Google Adsense and so on. Stuff like that can really screw with your overall numbers.

Then you have the different kinds of ads. A Flash ad may not work for people who have Flash turned off. So if your site is running a Flash ad it may not register the hit.

Then if you look at the specific ways (and for this you have to have some computing knowledge with Java) that specific metric sites count hits, like say for example Google Analytics, you will find that it detects traffic in a different manner than even what another site that also uses Javascript code to detect hits.

The funny thing is, if someone were to spend the time and invest the effort into understanding how the web works and all the computer languages required to make it go you would understand why all the traffic measurement sites are only doing their best guess and why the numbers are so different from site to site.

Before I learned this I was scampering to my computer to check my traffic numbers every day and I just became more and more confused by all the differing numbers as time went on.

Fortunately, I was saved all the effort of learning all the computer languages and taking net management classes because I have Chadm1n (and he already speaks all the computer languages and took all the classes). All I did was ask him to explain it all to me like I was a five year old in very general terms. Once I understood why none of it worked I was able to fill in the details with a small amount of research on my part.

And that was the point of the article. To pass that knowledge on to everyone else at the site. So you too can stop sweating over your actual numbers and concentrate on establishing a baseline and monitoring changes. Because that's all its good for. ;)

Gibson · February 20, 2010, 01:36:09 PM

Quote from: Rob on February 19, 2010, 08:49:06 PMexplain it all to me like I was a five year old in very general terms

This is the secret to my success, friends who are smarter than I am, with patience.

I always knew, or suspected, that the numbers were vague and inaccurate, but I never figured they were so out there. Baseline is pretty much what I've always used, but comforted myself by thinking it was sort of accurate. Oh, internet, you've fooled me again!

Thanks for the perspective, though.

Magister · February 20, 2010, 03:50:58 PM

Some of the Dumbrella boys were using IndieClick, but they don't own it.

News:

Web Traffic, Vodoo And What It Means To You