When you run a prosthetic limb, you can feel ghost itch and pain… but when you run a website on the Internet, nothing is off the table!
Table Of Contents (T.O.C.):
- Prelude
- Checking it out
- Digging in
- A Cloudflare Wrongly
- Departure And Arrival
- The opposite of a foreword (the solutions)
6.1. How to tell if your website is affected?
1. Prelude
It was a rainy Autumn Sunday afternoon. R. Novovic was sitting at his computer desk, sipping some fine Italian caffè mocha he had smuggled that summer. Slowly recovering from the last punches that kung-flu threw at him, he was looking forward to spending a few hours with the Pathfinder: Wrath of the Righteous – Enhanced Edition game. It’s been on the to-play list for a while, and this was a perfect time to see if a goody-two-shoes-paladin can save the world, at least in fiction.
That’s when the email came. Emails get ignored on Sundays, unless they arrive at the ing-box, the only inbox that truly matters. This one was right there.
Apparently, a bandwidth-sucking deamon was spotted on a client’s website. The Medisite of all sites – the one worth helping with even if it were for free (which it wasn’t). “Fuck!” he cursed. “If it’s not a Friday night, it’s gotta be on a Sunday. Well, at least I’m sober… for now.”
2. Checking it out
There are three rules for solving any website-related problem:
- Don’t panic.
- Don’t panic.
- Don’t panic.
‘Cause it won’t help you. Server logs, and backups in the worst-case scenario, is what saves the day.
The Medisite had a portfolio site under development on its subdomain, on the same cPanel account. Perhaps it was just a false alarm and the client was just doing a ton of uploads, downloads, and playing with some poorly optimized WordPress plugins. That was a bit of a stretch but not unseen. It could also explain the enormous CPU usage email reports.
One look at the cPanel bandwidth usage stats killed that theory:
It’s the main site alright. The Medisite itself was in trouble. “You weren’t expecting to get off so easily, were you?” thought R. Novovic.
3. Digging in
Now was the time to use the power of Cloudflare. All the traffic goes through its firewall, so that was as good a place as any to have a look:
“Hmm, the visitor count is not changing nearly as drastically as the data (bandwidth) usage. So, it might not be a DDOS attack, looks more like some data-scraping bot operation.”
Without much thinking, and forgetting the three rules for solving any website-related problem, R. Novovic went through his standard, run-of-the-mill WordPress hardening checklist, implementing the latest & greatest Cloudflare WordPress anti-spam WAF rules.
That… that did not make a dent. “All right, handsome devil, have a sip of that mocha, take a deep breath, and… y’know… try using your brain… for a change” he said to himself.
After a shamefully long time spent on thinking (“remember this the next time you wish to brag about your ‘above average IQ‘!”), the obvious struck him right in the forehead: “I should check the damn logs already!!!“
This was about time when he though it’s a jolly good idea to message his favourite hosting provider on Discord, thinking “it’s Sunday, but he’ll read the message and log screenshots at his convenience – this looks like a damn interesting problem that I haven’t seen before, perhaps it might just be interesting to Mr M. D. as well. It’s definitely worth documenting… somewhere.”
That intuition to do the impolite thing had paid off a bit later…
Now, Cloudflare in its free tier wasn’t very helpful for this, but that’s why it’s good to have some server access (and that is one of the reasons why SAAS hosting solutions can suck). cPanel has a nice tool called “X-Ray.”
One minute of recording requests revealed almost 500 requests for different Medisite page URLs, ending with (or better to say followed by) either a “?_rstr_nocache=” or “?amp=” querry field name, paired with apparently random values. All those requests were coming from the same IP address network: 66.249.69.0/24.
A simple WHOIS lookup confirmed the IPs are owned by… wait for it… Google! Yes. No botnet, no scrapers, definitely not a DDOS attack – it was just Google bots, doing their job!
4. A Cloudflare Wrongly
The quickest way to test and confirm this was to create a Cloudflare WAF rule to block the bots:
Fortunately, having more guts than brains, R. Novovic relayed this info to Mr M.D. Having read the Discord messages, M.D. replied:
“I wouldn’t block. I would redirect to index or something. But that’s just me. LOL.”
It took R. Novovic a couple of minutes to realize he had just learned a very patient, very polite way to say:
“You realized you’ve outright blocked the damned Google from accessing those client website pages… you dumb fuck!?!”
Talk about not seeing the forest for the trees.
Of course, redirecting it all to the home page or similar is not perfect. “There must be a better solution to just get rid of the damn URL queries, and permanently (301) redirect to the original page – the one that can be cached.” His idea was to greatly reduce the server load, allow Google to freely index all that should be indexed, and to remove all the “fake” pages after a short while, thanks to permanent 301 redirects.
To iterate, the idea was to redirect all the:
medisite.com/anathomy/?amp=sdfgjosfad89wfu8934u
medisite.com/anathomy/?amp=48935gkjlfsdkjg444554
medisite.com/anathomy/?amp=gggkl,w203kjgkjdgfksd
to just:
medisite.com/anathomy/
And do the same for the countless “/?_rstr_nocache=” variants as well. There are many pages on the Medisite, but not millions.
The free Cloudflare tier was not perfect for this task. This was a job for .htaccess redirects. A few lines of code did the job quite elegantly:
For easier updates and reference, I’ve moved the .htaccess code to chapter 6 below.
– Author 🙂
5. Departure And Arrival
R. Novovic disabled the Cloudflare WAF rule he had hastily implemented, to test and confirm that the more elegant .htaccess redirects worked. It was fixed. For the most part.
Bandwidth and CPU usage was getting back to normal. But it was not over.
The “?amp=” stuff meant that some time in the past, the site might have had a Google AMP version. “But where did the ‘_rstr_’ stuff come from?!”
After some googling and searching, he concluded that the culprit was most probably the (Cyrillic to Latin – Serbian) Transliteration plugin.
As he was pouring some rakija before calling it a night, R. Novovic remembered his old shower thought: “shorter is better.”
6. The opposite of a foreword (the solutions)
The original solution to the problem described in this “story” (copy/pasting it here for easier reference and updates):
#BEGIN Redirect from AMP to non-AMP
RewriteCond %{QUERY_STRING} "amp=" [NC]
RewriteRule (.*) /$1? [R=301,L]
#END Redirect from AMP to non-AMP
#BEGIN Redirect _rstr_nocache querry to its origin page
RewriteCond %{QUERY_STRING} "_rstr_nocache=" [NC]
RewriteRule (.*) /$1? [R=301,L]
#END Redirect _rstr_nocache querry to its origin page
Update
According to this tweet by Lily Ray, this problem is bothering more than one website. The problem is shown in her screenshots:
This .htaccess redirect should help with that:
#BEGIN Redirect from msclkid to the canonical page
RewriteCond %{QUERY_STRING} "msclkid=" [NC]
RewriteRule (.*) /$1? [R=301,L]
#END Redirect from msclkid to the canonical page
It’s a good idea to test first with a temporary 302 redirect:
#BEGIN Redirect from msclkid to the canonical page
RewriteCond %{QUERY_STRING} "msclkid=" [NC]
RewriteRule (.*) /$1? [R=302,L]
#END Redirect from msclkid to the canonical page
If you are dealing with search (or comment) spam, see my article “Stopping WordPress comment and search spam with CloudFlare.”
In this case, I would say that Google fucked up. Plain and simple.
Google had disregarded the properly set up cannonical tags for every page, and started indexing each query parameter as a new, separate page. To add insult to injury, its bots hammered the hosting server with countless requests (though, that was what alerted me to the problem, so there’s a little good with every bad 🙂 ).
It will take months for this indexing problem to get fixed in the Google’s index. However, the configured 301 (permanent) redirects should make sure that the correct page version gets indexed eventually, disregarding all the other alternatives.
That redirect will also make sure that any visitors end up on the proper page (and URL).
6.1. How to tell if your website is affected?
Let’s say your website is “medisite.com” (you will use your own domain, of course).
Go to Google.com.
Search using a word that exists on many of your pages (for websites in English, I would search for the term “and”),
telling Google to show only results for your website.
How do you do that?
Here:
site:medisite.com and
Of course, replace the “medisite.com” with your domain, and the term “and” with whichever word is most often used in your articles.
Here’s an example for searching my site (“bikegremlin.com”) for the term “chain”:
See if any search results show URLs (pages) with the strange sausage added at the end of URLs.
Then, do a normal search using the terms that your site ranks well for. Let us use the Lily Ray’s example, and google the term “fried potatoes,” to see how the site she mentioned (“bakeitwithlove.com”) fares.
Use CTRL+F in your browser to find your domain quickly in the search results – in this case I searched for the “bakeitwithlove.com”).
A picture speaks for a thousand words:
If your site is affected, section “6. The opposite of a foreword” above explains how to fix it.