My personal server doesn’t receive much email, and only hosts a few domains, as well as running as backup mail server for the SquirrelMail.org. It usually processes about 6500 emails on a daily basis, of that 4000 or so of that is rejected by RBL. After the Squeeze update, the amount of email rocketed to 230000, with 229000 rejected by RBL matches…
I get daily reports from my servers using LogWatch, and noticed the discrepancy when the daily report from my server came in. This is what my LogWatch report looked like on 02/24/2011, compared to 02/25/2011…
This is 02/24/2011…
2237 Accepted 34.74% 4202 Rejected 65.26% -------- ------------------------------------------------ 6439 Total 100.00% ======== ================================================ 60 5xx Reject relay denied 1.43% 16 5xx Reject HELO/EHLO 0.38% 510 5xx Reject unknown user 12.14% 260 5xx Reject recipient address 6.19% 29 5xx Reject client host 0.69% 3327 5xx Reject RBL 79.18% -------- ------------------------------------------------ 4202 Total 5xx Rejects 100.00% ======== ================================================
This is 02/25/2011
2372 Accepted 1.02% 229430 Rejected 98.98% -------- -------------------------------------------------- 231802 Total 100.00% ======== ================================================== 25 5xx Reject relay denied 0.01% 2381 5xx Reject HELO/EHLO 1.04% 1929 5xx Reject unknown user 0.84% 3805 5xx Reject recipient address 1.66% 8 5xx Reject client host 0.00% 221282 5xx Reject RBL 96.45% -------- -------------------------------------------------- 229430 Total 5xx Rejects 100.00% ======== ==================================================
As you can see, a substantial difference. So I started to look at the possible causes, did a Squeeze upgrade break something I didn’t see before? Did the Squeeze upgrade fix something that was broken before? Then I noticed the same spike in the primary mail server for SquirrelMail.org. This meant one thing, as I hadn’t upgraded the SquirrelMail.org server to Squeeze yet, this was a massive ramp up in spam. I also noticed that the number of accepted messages hadn’t really changed a great deal. 100 messages difference between 2 days is barely anything.
Tracking down the cause of the spam, or the reason things ramped up, is usually difficult to do, unless that’s all you do, and you have the resources, and knowhow. I don’t have much in the way of either, so I figured I’d do something else. Figure out where it was coming from.
This little project involved grabbing a few sources of information, such as the locations of IPs, and then some way to represent that data. IP to country matching is quite easy, there are quite a few sources that provide the data. I’ve used MaxMind in the past, and they provide a CSV file with the matches. As I only cared about country, I went with that product. I then imported that into a MySQL database, and normalized the data. Vincent de Lau has a great post on importing and normalizing the data here so I won’t go over that process.
The next step was to find the IPs and locations of all the attempts. As I figured out the ramp-up was only impacting the SquirrelMail.org domain, I could limit the search of the logs to just that domain name. I also only cared about the stuff that was not allowed to be delivered, and I only wanted the IP, so a little crafting with grep, sed, sort, uniq, and we got the information we wanted. I started by copying the log files to a different directory, and then looked at a single day to start with…
1 2 3 4
To break down what this is doing, the first part limits the search to just Feb 26, the second looks at messages that were marked as NOQUEUE, third limits to just squirrelmail.org. The sed statement takes the resulting line, and grabs just the connecting IP from the log file. The sort and uniq lines sort the resulting IP addresses, and then creates a count next to each IP for the number of times that it occurs. The output file ends up looking like this:
1 22.214.171.124 2 126.96.36.199 1 188.8.131.52 2 184.108.40.206
The first number is the counter, the second is the IP. As I wanted to get an idea over several days, I created a loop around the code, and I ended up with the following…
1 2 3 4 5 6 7 8 9
The seq command tells bash to loop from 20 through to 26. This gave me 6 days of output with IP addresses in it.
Now I had to figure out where each of these IP addresses was coming from. I also needed somewhere to store the information, so I created a new MySQL table. This table stored the country, the server affected, the date it happened, and the number of matches for that country…
1 2 3 4 5 6 7 8 9
I created a primary key on ci, svr, and rpt_date because there is a nice little trick you can do with an insert statement.
I used a quick, and dirty, PHP script to process the files, which you can download here. But here is a break down of the important parts.
The first function searches the database for country matches. It uses a function called inet_aton in MySQL. This takes the IP address, and turns it into a long value. The formula is actually pretty easy, a*2563 + b*2562 + c*256+d, where the IP address looks like this a.b.c.d. There is a similar function in PHP called ip2long, however the PHP function is impacted by being run on 32bit platforms which results in negative values, which causes problems with the above table. For example:
1 2 3
This returns –926365496 on 32bit systems, but 3368601800 on 64bit systems. The MySQL function inet_aton always returns the positive value.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Taking our 220.127.116.11 IP for an example, this is what the SQL looks like:
1 2 3 4 5 6 7
In my case, the ci value matches Brazil in my cc table.
1 2 3 4 5 6 7 8 9 10
This function uses the trick I mentioned a little earlier with the MySQL insert statement. Because I’d made ci, svr, and rpt_date part of the primary key, we’re not allowed to add duplicates of it, so the ON DUPLICATE part tells MySQL what to do if the key already exists. In our case, we’re adding the new value to the old value. The rest of the script is trivial file processing, and some string splitting work. Once I’d processed the log files, I did a quick check to see what the numbers looked like, for example, how was the spam level for the USA.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Okay, data looks good, now we get to have some fun with the data1. This is where I wanted to use heat maps. This is a graphical representation showing an increase in color representing an increase in data. This is a perfect type of representation for what we’re after as it’ll show which countries have the higher levels of spam sources. I recently saw an excellent example by Colin Walker on F5 DevCentral. He pretty much did all the hard work for me, all I needed to do was plug in my data. I bashed together another script to bundle all the data up, and craft the URL to download the images. However I had some issues with this, Google’s charting API would return a 400 error, but not explain what was wrong, so I poked around some more, and finally got a URL to work, and I ended up with a heat map that looks like the one to the right.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
As you can see by moving your mouse over the charts, China, and the USA, have the highest counts (represented by the darker color), with the USA at 2101 for both servers. Now compare the results against the data for the 25th Feburary here. The USA, and China, are still mostly darker than the rest of the world, but the shocking numbers are when you highlight Vietnam (44,000 hits), German (37,000 hits), and Saudi Arabia (43,700 hits).
While this is all pretty cool, and very nice visualization, I still don’t know why the spam host count increased so rapidly, nor do I know if it’ll stop (I’m hoping so). What I have discovered though, is some very cool visualization tools from Google, as well as playing around with GeoIP mapping.
What do you guys think? Why are we getting hammered with spam hosts? Any new viruses? How do you represent similar data?
As a side note, if you didn’t already know, I enjoy this kind of stuff. Back in 2007, I was trying to figure out if I fitted in where I worked (read it [here]) and I discussed data, information, and manipulation.↩