For the last several months, life has been pretty good. I’ve been hanging out, relaxing, having fun and doing some good work. Except -unbeknownst to me, my website had been under attack the entire time.
The Call To Arms
I woke up one September morning to a mildly alarming email from my hosting provider, Site5.
Problem with Resource Usage on hypertransitory.com
Well, that didn’t look good.
We are contacting you to inform you that your account is currently consuming too many resources for shared web hosting and will either need to be optimized to reduce resources or will need dedicated hosting.
Correction, that definitely didn’t look good.
Site5 operates on a Resource Points system, which determines how much of the server’s resources each account is allowed to use on a daily basis. My account level is a shared-hosting plan called Site5 HostPro+Turbo for which I’m allotted 600 points, and I pay $13.95 per month.
The following screenshot shows the resource usage of my site over a 30-day period:
As you can see, I was way over pretty much every day.
Because of this, my site was using so much bandwidth and CPU power that Site5 wanted me to upgrade to one of their Managed Virtual Server Plans, which would cost me about $72.00 per month.
My initial response was: “HAHAHAHAHAHAHAHAHAAAAAAAAAAA!!”
Man, I’m not paying that. It’s just not gonna happen.
I love this site, and I’m proud of a lot of the things I’ve done over the past few years, but it’s not a $72/mo. site, and if that’s what it truly took to maintain it, I would shutter it. I would just have to catch up with all you guys on Twitter or something. Even though I knew I could play the hosting switcheroo game for awhile, the problem would just follow me wherever I went. So before I went that route, I tried to work it out with Site5.
The Back And Forth
Of course, we got into it a bit at this point, with them telling me sometimes sites get so much traffic they outgrow their shared hosting plan and it’s just that time.
Then there was me saying there’s no way my site was bringing in that kind of traffic and it’s definitely not that time.
Just to be clear: I don’t have a problem paying my fair share. Ideally, you want to be bringing in a lot of traffic because that means you have something people want to read/see/listen to or whatever. However, I do have a problem paying if that’s not the case.
It turns out we were both right, and both wrong.
The War Of The Stats
Google Analytics to the rescue?
There was a time a couple of years back when I was bringing in what I would consider to be a lot of traffic to this site. That time is not now.
Back then, I was blogging almost every day, commenting everywhere, participating in blog contests, posting YouTube videos. All of that was driving quite a bit of traffic to the site. At the height of it, I almost reached 18,000 unique visitors per month.
That’s not a lot in the grand scheme of things, but it is to me.
Well, after I stopped blogging regularly and got more involved in launching my freelance career I let the blog take a back seat, and the traffic dropped quite a bit.
At this point I would be lucky to get 2,500 uniques per month, and a screenshot of my Google Analytics like the following was my first line of defense to Site5:
So as you can see, there was only 2,100 uniques in September. So how exactly is that too much traffic?
As I soon discovered, most hosts don’t care about Google Analytics.
AWStats FTW (but I lost)…
Site5 countered my screenshot by informing me that Google Analytics was not as accurate as the stats package supplied with my hosting account, called AWStats
According to the stats in this program, my site was bringing in an average of almost 20,000 unique visitors per month. Holy discrepancies, Batman!
Here’s the AWStats screenshot of my site for September, 2013:
As far as Site5 was concerned, the red circled area of the screenshot shows about 17,000 people visited my site in September. Then, each one of those people came back almost 3 times -resulting in about 46,000 total visits.
That can’t be right…can it?
It is a mystery.
Ok, but even allowing for that – it’s a huge difference. I decided to do some digging.
Google themselves say that it’s common to see different values from other web analytics solutions, and some people believe that AWStats inflates the numbers, while Google Analytics deflates them – therefore neither is 100% accurate.
HostGator has a brief explanation of How AwStats and Google Analytics Operate, while others suggest that you don’t even think of comparing AWStats to Google Analytics.
Hmmmm. After doing my research, I figured I had no choice but to believe Site5, since they were using a log analyzer directly on the server itself. What that meant is that my site really was bringing in a lot of traffic – but that just didn’t seem right.
Something was running those numbers up and running up the resource usage on their server, but what? What, I say??.
(Side note: If you’re on a host that’s using AWStats, make sure to take a look, if you haven’t already. If you’ve been working off Google Analytics numbers this whole time, I think you’ll be surprised by the AWStats numbers).
The Usual Suspects
Since I was now a believer when it came to my site resource usage, I now had to figure out what was causing it.
I didn’t think it was true traffic, so I thought it might be a rogue script or some broken part of my site that was causing trouble.
Truthfully, I had noticed my site was getting slower and slower these days, so I attempted to clean up the performance issues and fix anything that was broken.
Long story short – I succeeded. Whereas my site was getting failing grades across the board in performance evaluators like YSlow and Google Pagespeed, I really dug into the guts of my site to bring the performance up, so now many pages get A’s and B’s and load very quickly.
However, this was not the problem.
Still, the lessons I learned while optimizing my site are worthwhile, so I’ll be writing up the specifics of how I cleaned it up next week in part 2 of this article.
Back to the matter of the day: Why was my site hogging server resources??
Dem Bots, Dem Bots, Dem…DAMN BOTS.
Final answer: Scrapers, rogue bots and over-aggressive search engine crawlers were responsible for constantly rummaging through my site and running up bandwidth and CPU power. The ultimate effect of this was to reach deep into my pockets and try to relieve me of my hard earned dollars, paper, scrilla, dead presidents, CHEDDAH, etc.
Oh, Hell no. This shit was not acceptable. Therefore, it was time for VENGEANCE…
…which I did not get, so I had to settle for blocking them.
The number one offender which was hammering my site without pity was none other than the infamous Yandex bot. Yandex is a russian search engine that for some reason enjoyed sending it’s bot to go through my site multiple times per day and load everything. I found many threads complaining about Yandex, so I was obviously not alone.
The thread I linked to was from 2010, and several people complained about the bot not following the instructions of the robots.txt file. I hoped that since so much time had passed, I’d have better luck. After placing a disallow directive in my robots file, I discovered that hope was in vain. The bot just kept on doing what it was doing.
Now my only choice was to block it using my site’s htaccess file. I added in the code from this Search Engine Watch article on blocking bots and the resource usage dropped dramatically.
I did feel bad, in case any Russian people had found anything interesting on my site. Now I would no longer be appearing in Yandex, but Google Analytics showed Russia to be 18th in total visitors to my site, and AWStats showed Russia in 20th in total visitors – each one sending about 15-20 visitors for September.
I actually do want those visitors. If I can figure out how to slow down the bot, I’ll unblock it, but that’s something for a later date. I still wasn’t finished reducing my overall resource usage.
The number two bot offender was your friend and mine Microsoft Bing.
Between the Bingbot, Msnbot and Msn Media bot they were blasting my site back into the stone age. Incredible amounts of bandwidth being used up incessantly. I did a lot of reading, but I still don’t understand why.
Microsoft must know the damn things are over-aggressive, because they do have a way to slow the crawl rate for them. I read through it, but decided to block them instead using robots.txt. They did obey the directives and immediately the resource usage plummeted.
You might be wondering how wise it is to block Bing, though. Checking my stats, Bing is my third search engine referrer behind Yahoo (which itself is a distant second behind Google). Alas, Bing now actually powers Yahoo’s search results in some countries, so I’m not sure if that means I’m actually de-listing myself from Yahoo as well? The stats will tell the tale.
To their credit, once I told them I was banning Bing, Site5 tried to warn me about blocking them and suggested slowing the crawl rate, but right now I’m going to get a couple of weeks of “normal” traffic under my belt, then check the stats on it to see what difference there is, and if it’s perceptible without relying on stats.
Anonymous Bots And IP Addresses
I did a lot of other blocking of bots that refused to identify themselves, but were using up a lot of bandwidth. I did this by blocking the IP addresses.
You can add a “deny” listing to your htaccess, but I used the webhost’s “IP Ban Tool” which does the same thing, just from the cPanel.
In addition to this, I also use a couple of WordPress plugins to help out. One is called Stop Spammers. This one checks IP addresses and emails against the StopForumSpam.com, Project Honeypot and BotScout databases, and blocks them if they appear on there. So it cuts down on comment spam as well – meaning they don’t even make it into your spam list. Nice!
I can harvest bad IP addresses from this plugin and put them in my htaccess file. I also go a step further though.
Since I use the Cloudflare service for this site, I can throw those bad addresses into their threat management tools. I use a plugin called CloudFlare Threat Management to make this a bit easier. With the plugin I can add the IP’s right from my WordPress dashboard without having to go log into Cloudflare each time.
The plugin isn’t made by Cloudflare, but by developers using their API. I’ve found it to work exactly as advertised so far.
The benefit of adding the bad addresses to Cloudflare is that they should never hit your webhost’s server at all, but be blocked by Cloudflare before they even get there. I’ve only just now turned this service on, so we’ll see if it helps out with the performance issues.
In fact, it’s probably overkill to keep the same numbers in my htaccess files if Cloudflare is going to block them anyway, so I’ll likely remove those to keep that htaccess slim and trim.
The bottom line with these is I’ll have to remain vigilant and keep an eye on any bots that are using up more than the normal amount of bandwidth and CPU power.
For convenience, here’s a quick round-up of the relevant links from this article.
Web Stats Links
Site Performance Measurement Links
Bot Issues Links
WordPress Spam/Bot Management Links
The Wrap Up
This whole experience has been extremely frustrating. Initially, I was upset with Site5 because I didn’t believe them. When I did some looking, I found other people who didn’t like Site5’s Resource Point system, and even some who claimed it was created with the sole purpose of pushing people to the higher tier hosting packages.
It took me about a week and a half to finally figure out the whole scenario, since I wasted the first week on optimizing my site. Although it did benefit from the work I did, the issues I fixed weren’t the actual cause of the problem.
Later, I calmed down and realized they actually gave me more slack than they could have. In reading up on this issue, I’ve found people at some hosts who have just had their site suspended without so much as a warning, and others who have had their accounts flat-out terminated with no refund. Imagine having to scramble to find a new host like that. Especially if you didn’t have a readily available backup. Ouch.
When Site5 initially contacted me, they gave me about a week to solve the problem, or else they were going to migrate my site to the higher tier VPS hosting solution that could handle that kind of traffic. Even then, they would give me a month free at that higher tier in case I could figure out the problem. If so, they were willing to migrate me back.
I have to admit, if someone else on my server were using up more than their fair share of resources and causing my site to have performance issues, I’d want them to be forced to move to a different server, too. Plus, looking at the screencap of my resource usage above, you can see they let me slide for a very long time.
So I do have to say that they were pretty fair in their dealings with me.
The only negative thing I will say is this: I’m disappointed that they let me twist in the wind like that for a week. With their level of access to the server, couldn’t they see that it was bots eating up all the bandwidth and running up the CPU?
Since their first message said “optimize the site”, that’s exactly what I did. I kept reporting in regularly with my efforts to improve the site, so why didn’t someone immediately say “you need to stop the Yandex bot and the Bing bot from hammering your site?
Admittedly, about a week in, one of the many people I was in contact with suggested checking for malicious traffic. This was what made me concentrate on the bots, but wouldn’t you think they could tell what exactly was causing the trouble? Why did I have to play detective? Even if they don’t do the leg work for me, there are other people on the server who are being affected.
They went as far as figuring out that it was my “account”. I have several domains on my account, so I had to look into all of them.
Maybe on the shared hosting level, there are just too many people to devote that kind of attention to. So I either figure it out, in which case the problem is solved, I don’t figure it out, in which case they move me to a higher cost VPS plan (double win!), or I leave and the problem leaves with me.
Again, I’m only paying $13.95 a month, so I guess you have to pay top dollar for top service. I knew the name of the game when I started playing…
At the time of this writing, I’m still on a “probation” of sorts at Site5. They’re keeping an eye on me to see if the resource usage spikes again. So the threat of server migration is still looming for now. It does look like it’s under control, but who knows what evil bot is lurking in the heart of some internet ne’er-do-well out there?
On that note, I guess I’m outta here for now guys, be back for part 2 next week, where I’ll go over some WordPress specific optimization tips. I will see you all next time!