[sf-lug] [on-list] site up, http[s] down: Re: Wierd problems trying to access linuxmafia.com

Tue Dec 11 01:11:47 PST 2018

Taking it back on-list, because ... well, why not?  ;-)

> From: "Rick Moen" <rick at linuxmafia.com>
> Subject: Re: Wierd problems trying to access linuxmafia.com
> Date: Mon, 10 Dec 2018 22:55:15 -0800

> Quoting aaronco36 (aaronco36 at SDF.ORG):
>
>> Rick and Michael,
>>
>> Having been having some difficulty attempting to access
>> linuxmafia.com and the SF-LUG mailing-list this afternoon and
>> evening.

Yes, ... I noticed something of this on Sunday, when I was at
BerkeleyLUG.  And from what I recalled from being at CABAL on
Saturday*, just a day earlier, I didn't find this exceedingly
surprising.  First bit I noticed ... "connection refused" - and
from that, I then thought hmmm, ... ICMP connection refused,
host up, nothing listening on TCP port 80 (and/or 443 but I think
I was mostly using and/or first noticed it on 80).  I seem to
recall ping looked fine, port 22 was open, ...
I have ssh login access, so logged in on TCP port 22.
A trace of looking around ... what TCP ports are listening - and
to Internet addressable IPs (either explicitly, or wildcard) ...
yes, 22, ... 80 and 443 not showing (but 8080 was, but I didn't poke
to see if that was also web server).  And I also noticed 53 (DNS),
and perhaps others.  And then a quick use of:
$ who -HTu
And I notice Rick is not only logged in, but has rather to quite active
sessions - one with (as I recall) activity within the last 7 minutes,
and peeking again a bit later, one with activity within the minute.
So I figured Rick was probably investigating/troubleshooting, or doing
some maintenance or whatever, and I wasn't going to worry about it, and
any more interesting bits would probably get some mention by Rick to
the conspire list and/or other applicable list(s) in near future.

*Most notably that from the CABAL Wi-Fi, access to The Internet was,
uhm, "quite slow" ... sure, it's not a high-bandwidth connection,
but most notably, latency was very high, but packets weren't being
dropped (at least for the most part).  This is commonly seen on a
saturated (or nearly saturated) connection - typically queues on the
sending and/or receiving side of ISP router get filled up, and, though
throughput is high (or about as high as it can be), latencies are
very high ... e.g. seeing ping times over 3000ms and to over 5000ms
(but not up to or over 6000ms).  As an aside, I'll also mention that's
often side effect of how ISPs "tune" their routers and such ... folks
most often and simply look at "speed" of ISPs - most such speed tests
look at throughput.  Optimizing for such a (simplistic) speed test
comes at some cost - latencies - as what generally gives highest
throughput involves large(r) queues, and with it, when link(s) are
saturated, higher latencies.  Yeah, most "Internet Speed tests" don't
show what the ISP's latency is like when the link is saturated, so, most
folks don't know/care/compare ... well, until they run into it - but that's
usually after they've already selected and are using an ISP, ... not while
they're doing some simplistic "speed tests".

I noticed a bit later, on conspire ... in at least relevant parts:

] Date: Sun, 9 Dec 2018 14:52:45 -0800
] From: Rick Moen <rick at linuxmafia.com>
] To: conspire at linuxmafia.com
] Message-ID: <20181209225245.GM3915 at linuxmafia.com>
] Content-Type: text/plain; charset=utf-8
]
] Did some work this afternoon on what's gobbling all of my bandwidth
] lately.  It's pretty much all Apache httpd.  I've now removed some
] unneeded junk, done tidying up such as moving a tree of docs about exim4
] into a tarball, removed some stupidly problematic symlinks (don't
] symlink something to '.' in an HTML tree, people), and basically tried
] to find obvious sources of trouble.
]
] I'll have to spend some time doing logfile analysis to see what's
] getting all this traffic, maybe revamp the robots.txt file, and maybe
] spank some particularly egregious offenders with iptables blocks.

> Yeah, you know?  I've had Apache httpd stopped recently because
> extravagantly large levels of Web requests, which I tentatively
> guesstimate to be from Web-spidering bots and/or extremely inconsiderate
> individuals recursively fetching everything on the site, have so
> thoroughly clobbered my aDSL line that almost nothing else can get
> through.

And that, I don't find surprising.

> I'm studying remedies and doing logfile analysis, to determine whom to
> spank and how, but cannot devote vast amounts of consecutive time to
> that task because I need to also live the rest of my life.
>
> The sf-lug mailing list is perfectly fine, as you perhaps were able to
> determine from its ongoing discussion.  Mailman's Web pages are
> unreachable at the moment because Apache http is temporarily not
> running.

Yes, the lists also have lovely List-... headers.
If we look on, e.g. an SF-LUG list posting, and ignore the
http[s] bits, we find:
List-Unsubscribe: <mailto:sf-lug-request at linuxmafia.com?subject=unsubscribe>
List-Post: <mailto:sf-lug at linuxmafia.com>
List-Help: <mailto:sf-lug-request at linuxmafia.com?subject=help>
List-Subscribe: <mailto:sf-lug-request at linuxmafia.com?subject=subscribe>

Certainly covers at least essential functionality.  The Help can of course
tell one lots more about what can be done via email and how.

> Persons wishing to help this situation could work out and send drop-in
> methods of throttling excessive Web traffic without penalising
> non-abusive traffic, or pointing me to good materials.  Otherwise, my
> figuring out how best to do that will just take as long as it takes.
>
> Michael, a copy of the logs is in /backup/tmp/ , if you wish to dig in.

And, what the heck, why not, peeking a bit ...
First I looked for the largest file in said directory,
then I stripped it to the IP address and User-Agent string (and with the
quotes (") around it, as Apache logs it ... just 'cause I was lazy and
that was simpler and faster:
:%s/ .*\("[^"]*"\)$/ \1/
)
Then on those, did, all within a scratch vi(1) session (after looking  
at filesystem free space and relative free RAM situation):
1G!Gsort | uniq -c | sort -bnr
And from that, we have some to counts, by IP address, and User-Agent
string ... keep in mind that User-Agent is totally under control of the
client, so if, e.g., it randomly or sequentially changed it on every
request, that wouldn't help us much.  But I guestimate that's probably
not likely to be an issue.  In my experience, most excessive http[s]
traffic to servers is overzealous/stupid bots/spiders crawling content,
and/or stupid bots trying to harvest email addresses or find some way(s)
to send spam (or do drive-by wiki spamvertising, etc.).  In general they're
more dumb/stupid/annoying, than stealthy ... can't be very stealthy with
the stupid volumes of traffic they do anyway.  So, we have, top 'o the list:
   37799 198.144.195.190 "Mozilla/5.0 (Macintosh; Intel Mac OS X  
10.12; rv:60.0)
   22845 66.160.140.183 "The Knowledge AI"
   20277 66.160.140.182 "The Knowledge AI"
   14982 46.229.168.71 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
   11990 64.62.252.164 "The Knowledge AI"
    4323 64.62.252.163 "The Knowledge AI"
    3770 52.23.177.140 "MauiBot (crawler.feedback+wc at gmail.com)"
    3763 34.204.61.93 "MauiBot (crawler.feedback+wc at gmail.com)"
    2604 141.8.143.129 "Mozilla/5.0 (compatible; YandexBot/3.0;  
+http://yandex.co
    2332 64.62.252.169 "The Knowledge AI"
    2297 46.229.168.75 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
    2033 46.229.168.83 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
    2028 46.229.168.78 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
    1958 46.229.168.80 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
    1937 46.229.168.84 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
    1937 46.229.168.81 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
    1935 46.229.168.79 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
    1923 46.229.168.73 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
    1915 46.229.168.82 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
    1905 46.229.168.85 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
    1897 46.229.168.69 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
    1886 46.229.168.66 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
    1883 46.229.168.74 "Mozilla/5.0 (compatible; SemrushBot/2~bl;  
+http://www.sem
And I'm showing that as I see it in my terminal (emulator) window, I did:
:se nowrap
had it been nvi (which I prefer), that would've been:
:se leftright
(aside - nvi is highly similar to ye olde classic vi - they fix some issues/
limitations/bugs, and add about 2 or 3 or so minor but key improvements,
but other than that, it's (generally considered) keystroke-for-keystroke,
bug-for-bug compatible with ye olde classic vi ... which works dang great
for me, as my fingers are highly experienced at that, and vim slows me down
a lot as it doesn't behave quite the same way ... vim adds a few bazzillion
"improvements" ... for certain definitions of "improvements" (and a whole
lot 'o bloat along with that)).
Anyway, in the above, one is just seeing the first 80 columns - sufficient
for my immediate purposes, and much easier to look at with the lines not
wrapped (classic vi has no way to turn off line wrapping ... but of course
there always is:
1G!Gcut -c-80
)
Anyway, from our listing we see/notice ... very top one is on same subnet,
that's not really an issue, as that's not over slow link, so we can,
at least mostly ignore that, and especially also since it's not orders
of magnitude larger than ones immediately after it that do traverse the
"slow" (DSL) link.  And a cursory review - including having earlier seen
the URLs/paths hit, looks like mostly dumb web crawlers that don't know
better.  The better behaved ones don't hit a site hard.  They'll hit
one page/link ... then wait typically minutes or more (at least on
slower sites) before hitting another link/page.  That way the traffic
is quite minimal.  Trying to crawl everything at once, sequentially or
in whatever order, is generally considered, at best, discourteous,
and certainly not best practice, and for many sites, it can be rather to
quite problematic.

So ... what can be done about it?  Shut off the web server.  ;-)  But that's
a self DoS (Denial-of-Service) ... typically not a useful long-term solution.
robots.txt?  That can help - tell the more problematic bots to bugger off.
If they follow the protocol, that stops the traffic problem for those
bots ... at the "cost" of not being indexed by search engines that ... well,
mostly about nobody cares about anyway.  Of course, too, that does nothing
against bad bots that ignore robots.txt ... that's a different problem -
which is at least not immediately seen here in the sample data I looked at.
Can also do "interesting" things with the web server ... e.g. based upon
client IP address(es)/range(s) and/or User-Agent ... oh, maybe give 'em a
web page that says something like "Sorry sucker, your crawler sucks way
too damn much bandwidth.  So now this is all you get to see."  There are
also traffic shaping tools/packages, ... those could be useful and more
generally solve the problem - including for lots of other types of
traffic.  Idea on something like that is to throttle traffic before
the queues fill up and lantencies climb annoyingly high - that keeps
most of the bandwidth available while holding down latencies ... but in
case such as this, that may not be how one wants to use most of that
bandwidth (notably to annoyingly overzealous crawlers).

I know on, most notably, the various BALUG sites and
(non-list) SF-LUG sites, I've had similar issues with such crawlers.
It's been "annoying enough" I've certainly thought of taking more
explicit action ... but I also noticed over some time (day(s)/weeks),
they seem to mostly gather up the content they want and back off ...
after a while, and generally less frequently be so annoying.
Of course I've had some exceptions too (notably bad bots registering
wiki accounts in very large volumes).

So various possible approaches.  One might also be able to code up
something that would (at least temporarily) block offenders.  Maybe there's
even package(s) to do that?  One could also contact the offending bots -
most that aren't intending to be malicious have URL with contact info. there,
or even provide contact info in the User-Agent string.  Don't think I've
quite contacted such yet, ... but I've been highly tempted to on occasion.

>> Might not make the most sense, but I also cc'd this msg to <rick at
>> linuxmafia.com>
>
> I appreciate your diligent investigation, but you might have notied that
> I've been actively corresponding on sf-lug@ and conspire@, and that your
> pings and traceroutes were successful.