Spam Spike, and Google Visualizations…

My personal server doesn’t receive much email, and only hosts a few domains, as well as running as backup mail server for the SquirrelMail.org. It usually processes about 6500 emails on a daily basis, of that 4000 or so of that is rejected by RBL. After the Squeeze update, the amount of email rocketed to 230000, with 229000 rejected by RBL matches…

I get daily reports from my servers using LogWatch, and noticed the discrepancy when the daily report from my server came in. This is what my LogWatch report looked like on 02/24/2011, compared to 02/25/2011…

This is 02/24/2011…

     2237   Accepted                                  34.74%
     4202   Rejected                                  65.26%
 --------   ------------------------------------------------
     6439   Total                                    100.00%
 ========   ================================================
 
       60   5xx Reject relay denied                    1.43%
       16   5xx Reject HELO/EHLO                       0.38%
      510   5xx Reject unknown user                   12.14%
      260   5xx Reject recipient address               6.19%
       29   5xx Reject client host                     0.69%
     3327   5xx Reject RBL                            79.18%
 --------   ------------------------------------------------
     4202   Total 5xx Rejects                        100.00%
 ========   ================================================

This is 02/25/2011

     2372   Accepted                                     1.02%
   229430   Rejected                                    98.98%
 --------   --------------------------------------------------
   231802   Total                                      100.00%
 ========   ==================================================
 
       25   5xx Reject relay denied                      0.01%
     2381   5xx Reject HELO/EHLO                         1.04%
     1929   5xx Reject unknown user                      0.84%
     3805   5xx Reject recipient address                 1.66%
        8   5xx Reject client host                       0.00%
   221282   5xx Reject RBL                              96.45%
 --------   --------------------------------------------------
   229430   Total 5xx Rejects                          100.00%
 ========   ==================================================

As you can see, a substantial difference. So I started to look at the possible causes, did a Squeeze upgrade break something I didn’t see before? Did the Squeeze upgrade fix something that was broken before? Then I noticed the same spike in the primary mail server for SquirrelMail.org. This meant one thing, as I hadn’t upgraded the SquirrelMail.org server to Squeeze yet, this was a massive ramp up in spam. I also noticed that the number of accepted messages hadn’t really changed a great deal. 100 messages difference between 2 days is barely anything.

Tracking down the cause of the spam, or the reason things ramped up, is usually difficult to do, unless that’s all you do, and you have the resources, and knowhow. I don’t have much in the way of either, so I figured I’d do something else. Figure out where it was coming from.

This little project involved grabbing a few sources of information, such as the locations of IPs, and then some way to represent that data. IP to country matching is quite easy, there are quite a few sources that provide the data. I’ve used MaxMind in the past, and they provide a CSV file with the matches. As I only cared about country, I went with that product. I then imported that into a MySQL database, and normalized the data. Vincent de Lau has a great post on importing and normalizing the data here so I won’t go over that process.

The next step was to find the IPs and locations of all the attempts. As I figured out the ramp-up was only impacting the SquirrelMail.org domain, I could limit the search of the logs to just that domain name. I also only cared about the stuff that was not allowed to be delivered, and I only wanted the IP, so a little crafting with grep, sed, sort, uniq, and we got the information we wanted. I started by copying the log files to a different directory, and then looked at a single day to start with…

grep "Feb 26" mail.log* | grep "NOQUEUE" | \ 
 grep -i "squirrelmail.org" | \ 
 sed 's|.*\[\([0-9]\+\.[0-9]\+\.[0-9]\+\.\)[0-9]\+\]:.*|\1|g' | \ 
 sort | uniq -c > feb_26.log

To break down what this is doing, the first part limits the search to just Feb 26, the second looks at messages that were marked as NOQUEUE, third limits to just squirrelmail.org. The sed statement takes the resulting line, and grabs just the connecting IP from the log file. The sort and uniq lines sort the resulting IP addresses, and then creates a count next to each IP for the number of times that it occurs. The output file ends up looking like this:

  1 109.121.23.116 
  2 109.172.40.186 
  1 109.184.166.10 
  2 109.184.229.233

The first number is the counter, the second is the IP. As I wanted to get an idea over several days, I created a loop around the code, and I ended up with the following…

for i in `seq 20 26`; 
  do   
    grep "Feb ${i}" mail.log* | \ 
      grep "NOQUEUE" | \ 
      grep -i "squirrelmail.org" | \ 
      sed ‘s|.*\[\([0-9]\+\.[0-9]\+\.[0-9]\+\.\)[0-9]\+\]:.*|\1|g’ | \ 
      sort | \ 
      uniq -c > feb_${i}.log; 
done

The seq command tells bash to loop from 20 through to 26. This gave me 6 days of output with IP addresses in it.

Now I had to figure out where each of these IP addresses was coming from. I also needed somewhere to store the information, so I created a new MySQL table. This table stored the country, the server affected, the date it happened, and the number of matches for that country…

CREATE TABLE `mail_rpt` ( 
  `ci` tinyint(3) unsigned NOT NULL, 
  `svr` tinyint(1) unsigned NOT NULL, 
  `cnt` int(10) unsigned NOT NULL DEFAULT 0, 
  `rpt_date` date NOT NULL, 
  PRIMARY KEY (`ci`,`svr`,`rpt_date`), 
  KEY `ci` (`ci`), 
  KEY `rpt_date` (`rpt_date`) 
);

I created a primary key on ci, svr, and rpt_date because there is a nice little trick you can do with an insert statement.

I used a quick, and dirty, PHP script to process the files, which you can download here. But here is a break down of the important parts.

The first function searches the database for country matches. It uses a function called inet_aton in MySQL. This takes the IP address, and turns it into a long value. The formula is actually pretty easy, a*256^3 + b*256^2 + c*256+d, where the IP address looks like this a.b.c.d. There is a similar function in PHP called ip2long, however the PHP function is impacted by being run on 32bit platforms which results in negative values, which causes problems with the above table. For example:

<?php 
echo ip2long(’200.200.200.200'); 
?>

This returns –926365496 on 32bit systems, but 3368601800 on 64bit systems. The MySQL function inet_aton always returns the positive value.

function lookupIP($ip) { 
    global $db; 
    
    $return = false; 
    
    $query = 'select ci from ip where start < = inet_aton(\'%1$s\') and end >= inet_aton(\'%1$s\');'; 
    
    $res =& $db->query(sprintf($query, $ip)); 
    
    if (PEAR::isError($res)) { 
        return false; 
    } 
    
    while1) { 
        $return = $row['ci']; 
    } 
    
    return $return; 
    
}

Taking our 200.200.200.200 IP for an example, this is what the SQL looks like:

mysql> select ci from ip where start < = inet_aton('200.200.200.200') and end >= inet_aton('200.200.200.200'); 
+----+
| ci |
+----+
| 64 |
+----+
1 row in set (0.06 sec)

In my case, the ci value matches Brazil in my cc table.

function AddMailRecord($ci, $srv, $date, $count) {
	global $db;
	
	$query = 'INSERT INTO mail_rpt (ci, svr, rpt_date, cnt) values (%d, %d, \'%s\', %d) ON DUPLICATE KEY UPDATE cnt=cnt+%4$d';
	
	$res =& $db->query(sprintf($query,$ci, $srv, $date, $count));
	if (PEAR::isError($res)) {
		die($res->getMessage());
	}
}

This function uses the trick I mentioned a little earlier with the MySQL insert statement. Because I’d made ci, svr, and rpt_date part of the primary key, we’re not allowed to add duplicates of it, so the ON DUPLICATE part tells MySQL what to do if the key already exists. In our case, we’re adding the new value to the old value. The rest of the script is trivial file processing, and some string splitting work. Once I’d processed the log files, I did a quick check to see what the numbers looked like, for example, how was the spam level for the USA.

mysql> select * from mail_rpt natural join cc where cc = 'US';
+----+-----+------+------------+----+---------------+
| ci | svr | cnt  | rpt_date   | cc | cn            |
+----+-----+------+------------+----+---------------+
| 30 |   1 |  165 | 2011-02-20 | US | United States |
| 30 |   1 |  127 | 2011-02-21 | US | United States |
| 30 |   1 |  205 | 2011-02-22 | US | United States |
| 30 |   1 |  401 | 2011-02-23 | US | United States |
| 30 |   1 |  303 | 2011-02-24 | US | United States |
| 30 |   1 | 6915 | 2011-02-25 | US | United States |
| 30 |   1 | 3468 | 2011-02-26 | US | United States |
| 30 |   2 | 1001 | 2011-02-20 | US | United States |
| 30 |   2 | 1085 | 2011-02-21 | US | United States |
| 30 |   2 | 1482 | 2011-02-22 | US | United States |
| 30 |   2 | 1489 | 2011-02-23 | US | United States |
| 30 |   2 | 1798 | 2011-02-24 | US | United States |
| 30 |   2 | 3233 | 2011-02-25 | US | United States |
| 30 |   2 | 8517 | 2011-02-26 | US | United States |
+----+-----+------+------------+----+---------------+
14 rows in set (0.00 sec)

Okay, data looks good, now we get to have some fun with the data¹. This is where I wanted to use heat maps. This is a graphical representation showing an increase in color representing an increase in data. This is a perfect type of representation for what we’re after as it’ll show which countries have the higher levels of spam sources. I recently saw an excellent example by Colin Walker on F5 DevCentral. He pretty much did all the hard work for me, all I needed to do was plug in my data. I bashed together another script to bundle all the data up, and craft the URL to download the images. However I had some issues with this, Google’s charting API would return a 400 error, but not explain what was wrong, so I poked around some more, and finally got a URL to work, and I ended up with a heat map that looks like the one to the right.

If you look at the image, it’s not very descriptive, almost the entire map is the same color, with some slight variations. So I needed to tweak some more. After a few hours of playing, I couldn’t find a good combination of colors. After some more poking around, I stumbled across another Google visualization toolkit, which included GeoMaps, and it had some additional features that makes it more interesting, showing scale, and highlighted counters. This required calling a Javascript library to perform the visualization, but something that’s not too difficult to do either.

function createJS($country, $count, $file) {
	$html = < <<EHTML
<html>
<head>
  <script type='text/javascript' src='http://www.google.com/jsapi'></script>
  <script type='text/javascript'>
   google.load('visualization', '1', {'packages': ['geomap']});
   google.setOnLoadCallback(drawMap);

    function drawMap() {
      var data = new google.visualization.DataTable();
      data.addRows(%d);
      data.addColumn('string', 'Country');
      data.addColumn('number', 'Spam Count');
%s
      var options = {};
      options['dataMode'] = 'regions';

      var container = document.getElementById('map_canvas');
      var geomap = new google.visualization.GeoMap(container);
      geomap.draw(data, options);
  };
  </script>
</head>

<body>
  <div id='map_canvas'></div>
</body>

	
EHTML;

	$cnt = count($country);
	$str_data = < <<SD
      data.setValue(%d, %d, '%s');
SD;
	$str_value = <<<SD
      data.setValue(%d, %d, %d);
SD;

	$out_data = array();

	foreach($country as $idx=>$c_str) {
		$out_data[] = sprintf($str_data, $idx, 0, $c_str);
		$out_data[] = sprintf($str_value, $idx, 1, $count[$idx]);
	}

	$fh = fopen($file, 'w');

	fwrite($fh,sprintf($html, $cnt, implode("\n", array_values($out_data))));
	
	fclose($fh);
}

The above code is part of the script for creating chart files, which you can download here. It dynamically creates the HTML file, and the javascript required to create the chart below.

The sample screen shot of what the graph looks like is to the left, click it, and you can see the actual javascript, and the cool features of the charts. This chart, and data is for the 24th February.

As you can see by moving your mouse over the charts, China, and the USA, have the highest counts (represented by the darker color), with the USA at 2101 for both servers. Now compare the results against the data for the 25th Feburary here. The USA, and China, are still mostly darker than the rest of the world, but the shocking numbers are when you highlight Vietnam (44,000 hits), German (37,000 hits), and Saudi Arabia (43,700 hits).

While this is all pretty cool, and very nice visualization, I still don’t know why the spam host count increased so rapidly, nor do I know if it’ll stop (I’m hoping so). What I have discovered though, is some very cool visualization tools from Google, as well as playing around with GeoIP mapping.

What do you guys think? Why are we getting hammered with spam hosts? Any new viruses? How do you represent similar data?

As a side note, if you didn’t already know, I enjoy this kind of stuff. Back in 2007, I was trying to figure out if I fitted in where I worked (read it here) and I discussed data, information, and manipulation. ↩

TheGeekery

The Usual Tech Ramblings

Spam Spike, and Google Visualizations...

Comments