Blocking bots/scrapers - OVH domain

User37935

Neophyte
Joined
May 4, 2011
Messages
0
Wasn't sure whether to stick this specifically in Xenforo (which I use) or put it out as a general question. I've noticed a huge spike in guests and tonight they're all "viewing latest content" and nothing else, every miniute on a new IP, all within a range like 54.36.150.* (resolves to OVH France). Adding like 100-200 on-line guests.

Clearly not an actual user, and it happened last night on 54.36.149.* IPs which I blocked and in that case they were looking at threads too.

I've banned IPs using a wildcard as per above but this is going to possibly cause collateral damage to legitimate users, although I am not a French forum, and don't have any French users as such - but still, wildcards are by their nature indiscriminate.

First question, what the hell is this, some bot, scraper etc? Google has a number of forum hits on various places talking about it and it seems to be bad news, though nothing dated that recent so I thought I would ask again. Second question, is wildcard banning likely to be a bad idea (is this "whack a mole" and I might as well not bother?).

And lastly is there a better way to ban, maybe via resolving the host to include a.ahrefs.com which it comes up as and if so how do I go about this in a way that doesn't cripple the server, is this a htaccess job? Or is there a plugin for XF 1.5.x to handle screening out bots? They're not registering so none of my anti-spammer plugins even get used here.
 

we_are_borg

Tazmanian
Joined
Jan 25, 2011
Messages
5,964
Can you look in the log and see the user agent that would help to pin point the issue faster.
 

User37935

Neophyte
Joined
May 4, 2011
Messages
0
Their site says "Every day we crawl 6 BILLION web pages". Great.

I've blocked the above ranges.

Now... on to this one from China/Mongolia, what would that be doing harvesting user profiles?
 

mysiteguy

Fanatic
Joined
Feb 20, 2007
Messages
3,619
I block all OVH ranges (that I know of) in all countries. There are no legit users coming from those server farms. Amazon ranges are as bad, or worse, but there are a few IP's and user agent's I've carved out to allow their Silk browser cache system access to my servers. I have over 2400 IP ranges from various server farms blocked from my server, with no impact on legitimate users.

.htaccess is fine for a few rules, but not when you have a large number. At that point you want to use a combination of firewalling and moving some rules to conf files so they aren't reparsed with every fetch.
 

Alpha1

Administrator
Joined
May 28, 2007
Messages
4,268
Alternatively: cloudflare also stops a lot of bad bots in its tracks.
 

mysiteguy

Fanatic
Joined
Feb 20, 2007
Messages
3,619
When it's free, you are the product. And in this case, so are your visitors. Cloudflare doesn't give free SSL, CDN, and DNS out of the goodness of their heart.

Be aware that despite their claims of privacy, their founder was all too willing to sell data from ProjectHoneyPot, the service which he started prior Cloudflare. I don't know for a fact, but I'd venture to say they are selling blocking and traffic pattern information there as well. After all, it's a giant honeypot.

From their terms:
"Cloudflare may aggregate data we acquire about our Customers and their End Users, including the Log "Data described above. " and "Non-personally identifiable, aggregated data may be shared with third parties." - This means they won't share identifying info. There is nothing in their terms which prevents them from sharing your traffic sources, bounce rates, time spent on site, number of visitors, what pages visitors view, etc.

Also, they have a 5 year agree with Asian registry APNIC to share DNS query information.

Basically, they are certainly able to sell information about your business or hobby web site. They aren't end user spyware, but essentially they are server spyware in my opinion.

I have avoided and will continue to avoid the Cloudflare bandwagon.
 

two50v

Neophyte
Joined
Apr 14, 2019
Messages
7
Ahrefs. I hate them and their other competitors that leech off your bandwidth for no apparent reason... I suspect for anyone making a new site that blocking all spiders except perhaps Google and a few others via robots.txt as a first move. I noticed the bad bots seem to swarm in especially once Yandex knows of your site...

Several bad bots - I believe Ahrefs and MJ12 fall both into this category - probably do check robots.txt and if they're not blocked at the time they initially check it they don't care at checking it any other time.
 

Anton Chigurh

Ultimate Badass
Joined
Feb 22, 2015
Messages
1,393
Ahrefs. I hate them and their other competitors that leech off your bandwidth for no apparent reason... I suspect for anyone making a new site that blocking all spiders except perhaps Google and a few others via robots.txt as a first move. I noticed the bad bots seem to swarm in especially once Yandex knows of your site...

Several bad bots - I believe Ahrefs and MJ12 fall both into this category - probably do check robots.txt and if they're not blocked at the time they initially check it they don't care at checking it any other time.
Robots.txt doesn't block anything, never has, never will.
 

two50v

Neophyte
Joined
Apr 14, 2019
Messages
7
I worded incorrectly... I seem to suspect that they abide by robots.txt what's there before their first crawl, then don't care if any alterations are made...
 

Anton Chigurh

Ultimate Badass
Joined
Feb 22, 2015
Messages
1,393
I worded incorrectly... I seem to suspect that they abide by robots.txt what's there before their first crawl, then don't care if any alterations are made...
Each "bot" has its own rules, typically they either obey robots.txt or not. Nothing in between.

Really BAD bots are programmed to use your robots.txt to identify the sensitive areas of sites, where people often just identify their admincp and other backend areas, for the hackers. I keep that stuff out of robots.txt, if they're crawled it's either an error or a login form and Google, Yahoo, Bing and other "friendly" crawlers won't index those anyway.

Robots.txt really isn't all it's cracked up to be and not really all that useful. It's just a signpost that says basically, "please don't look in the following rooms."
 

DigNap15

Habitué
Joined
Sep 14, 2019
Messages
1,115
All I want to do is stop a few bots like Yandex, Majtic and that Soghou one to my XF forum
As I am in New Zealand, and my forum does not need members from many other countrres.
I am sure that ll those crawlers above do is use my bandwith and help spammers find my forum
Why is it so hard to block them?
There seems to be so many methods, and no one can agree which is the best and the easiest.
 

Xon

Developer
Joined
Feb 15, 2015
Messages
311
Alternatively: cloudflare also stops a lot of bad bots in its tracks.
I use CloudFlare's firewalls to inject javascript challanges for entire ISPs via the asnum matching rule. Doesn't matter about the actual IP, blocks the entire ISP :)
 

DigNap15

Habitué
Joined
Sep 14, 2019
Messages
1,115
Gee what a mess the internet is.
To block bots etc
Some of you say Robot.txt is good and some say its useless
Some of you say the Cloudfare is good or bad
 

MagicalAzareal

Magical Developer
Joined
Apr 25, 2019
Messages
758
Gee what a mess the internet is.
To block bots etc
Some of you say Robot.txt is good and some say its useless
Some of you say the Cloudfare is good or bad
robots.txt does stop a number of bots, as does Cloudflare, although they aren't perfect either.
Cloudflare can block a lot of bots, but they often end up blocking some legitimate users as-well for me, so I throttle the setting down low enough for some vulnerability scanners to get through.

P.S. Cloudflare looks like obvious NSA SIGINT infrastructure.
 

Paul M

Super Moderator
Joined
Jun 26, 2006
Messages
4,077
Never found the need to block crawlers, they are not hurting the site in any way I know of.
 

mysiteguy

Fanatic
Joined
Feb 20, 2007
Messages
3,619
Gee what a mess the internet is.
To block bots etc
Some of you say Robot.txt is good and some say its useless
Some of you say the Cloudfare is good or bad

It is both good and useless. Its good against bots which obey it, and useless against those which do not.
 
Top