TheGeekery

The Usual Tech Ramblings

Bash Magic to Save You Time...

How do you turn a 12 (or more) hour job into 40 minutes? Script it…

Recently, I’ve come to use bash quite a bit in solving quick problems. Or more to the point, complex problems, with simple solutions. Today’s project revolved around an issue with our product.

We regularly get “book data”1 from various vendors, and import it into our system as part of our appraisal tool. This part is great, the issue comes with historical information. We don’t update any of them, which has lead to a minor concern regarding bookout values on vehicles. When reports run, they could report inventory having greater (or less) value than what they really are, so to update this, we simply have to touch the vehicles.

This “touching” of a vehicle simply involves clicking on it in our application, and we automagically update the information. All fairly simple there, until you come to understand that for one of our books, there are 144 customers using it. With that 144 customers, there are over 15,000 vehicles in stock that have to be touched, which averages out to about 105 vehicles per dealership.

This is where it starts to get complicated. A vehicle page load takes approximately 3-15 seconds, so for 15,000 vehicles that comes to 12 hours, 30 minutes at the low end of the page load. For a single person clicking vehicles, that’d take a good chunk of time, and probably induce RSI.

However, it’s a repetitive task. Something computers are incredibly good at… repetitive tasks.

The Script

The first thing I needed to do was figure out the best way to handle the page loads. I’d have to be able to call them from some script, multiple times, without having to do anything manually. This is where cURL comes into play. cURL is an application, and library, that allows you to transfer files over multiple protocols. It’s command based, so makes it easy to script, unlike a browser.

Now I’ve got my candidate for fetching the URLs, I need to figure out what the URL is going to be. This is actually easy, I view the source of a page on the application, and copy the link, and look at the arguments. This ended up being really easy, the VIN, the mileage, and the ID for that vehicle. The problem starts here because our site requires authentication. Not too bad at all with cURL. Combining the usage of 2 arguments, -c and -b, you can solve some of this problem.

-c will write any cookies the web server sends to you, into a file, while -b will either take a file name, or a name=value pair for cookies. These arguments are handy because when you login to the site, we send back a session cookie stating the session ID, and some other unimportant stuff. So the first step is to get the session cookies establish. This is done by first posting login information to the login form. Again, cURL can handle this too.

1
curl -c cookie.txt -d "LoginName=someuser" -d "password=somepass" https://oursite/index.asp

This wrote all received cookies into a file called cookie.txt. This can then later be used using curl with -b cookie.txt.

Now I have our URL format, and our session cookie built, I have the foundation for this script. The issue left is the data, and the loop from bash.

Getting the data ended up being relatively easy. We have a web service interface, that we are able to use an API to fetch data we need. This is handy in this case, as it meant I could skip having to deal with data imports and exports. So the first step was fetching the information from SQL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<?php

include_once('config.php');

$conn = new SoapClient($wsdl);
$res = $conn->ExecuteSQL('userkey','sql statement here');

$data = array();

foreach($res as $record) {
    $data[$record['cid']][] = array(
                  'vin' => $record['vin'],
                  'vid' => $record['vid'],
                  'mileage' => $record['mileage']
              );
}

foreach($data as $cid=>$record) {
    $fh = fopen('urls_' . $cid . '.txt', 'w');
    foreach($record as $data) {
      fwrite(sprintf($url, $data['vin'],$data['mileage'],$data['vid']) . "\n");
    }
    fclose($fh);
}

?>

This leaves us with a handful (144 in this case) of text files containing a url, with the VIN, mileage, and VID built in.

Next, we have to process the URLs. This is a simple case of looping through the files, like I did on processing log files. This starts with two simple loops, one to get the file, and the other to get the content.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/bin/sh
for file in urls_*.txt;
do
  if [ -f ${file} ]
  then
    NEWFILE="${file%%txt}prc"
    mv ${file} ${NEWFILE}

    for url in `cat ${NEWFILE}`;
    do
      curl -b cookie.txt ${url} -o /dev/null
    done
  fi
done

The above is the total script, and I do some slightly different lines in this one. Because I had realized I was going to be processing 144 companies, I thought I’d probably want to thread it. As with most of my emergency tasks, figuring threading out in a bash script I really didn’t have time. I do know you can append & on the end of any statement and it’ll background it, meaning the rest of the script will keep running. However, this isn’t entirely what I want, as it means it’s not controlled, and I don’t have an idea of which files I’m working on, or how well I’m doing. So I cheated and built a simple test into the code.

1
2
3
4
5
6
if [-f ${file} ]
  then
    NEWFILE="${file%%txt}prc"
    mv ${file} ${NEWFILE}

  fi

This section of code validates the file in the loop exists, then renames it. I do this so that I can cheat on the threading model. By doing this, I can call the above script multiple times, and not have it processing the same file at the same time. This gives the idea of threading, without really threading.

The next part of the code is the curl execution:

1
curl -b cookie.txt ${url} -o /dev/null

This makes the curl application load cookies from the cookie.txt file (explained earlier), uses the url from the file, and sends the output to /dev/null.

The Results

I wrote the above script into a file called run.sh so I could call it multiple times. Taking advantage of the time application, I was able to accurately watch the output:

1
2
3
4
5
#> time ./run.sh

  real    0m0.010s
  user    0m0.000s
  sys     0m0.008s

Calling the run.sh script 6 times, I got the following results

  • Total Execution time for all scripts was 3 hours, 51 minutes.
  • Average executiong time per thread was 38 minutes.
  • 15,197 vehicles across 144 companies.
  • Average of 105 vehicles per company.
  • Average execution per vehicle was 1 second.
  • Average execution per company was 1 minute, 37 seconds.
  • SQL server load remained at a nice 25%.

A very quick script, saving hours of work from several people.


  1. Data from third party vendors, containing recommended values for vehicles

Comments