While working on a file cleanup project today, I had to work with a text file containing close to 2 million lines. I had to extract file names with a specific string within it. I figured I could do it relatively quickly in PowerShell, but also realized it’d be a good opportunity to flex my Bash skills. To say the results were interesting is an understatement (to me at least). I picked up some tuning tricks while tinkering, which made some massive improvements.
So we’ll set the grounds for the battle.
- The script has to parse the file with ~2 million records
- It has to output to a file containing the file names to remove (all files with 02 in them)
- It has to output SQL statements to a file for cleaning a temporary table.
- The quickest processor wins.
An example of the data in the file:
Vehicle_ID,URL,DateDeleted,PhotoDeletedID
58331549,\3860\1111111174WC448544WC44854279176192.jpg,2012-03-03 11:37:35.080000000,224516659
58331549,\3860\1111111174WC448544WC44854279176194.jpg,2012-03-03 11:37:38.420000000,224516660
58331995,\2635\1111111199E1648779E164877279175843.jpg,2012-03-03 11:41:35.510000000,224516661
58050027,\5524\56840113_02_222222225WF265239_9.jpg,2012-03-03 12:42:41.537000000,224516931
So the basic idea is to loop through the file, and find all lines that have _02_ in them. I then want to extract that file name (URL column) and the PhotoDeletedID.
Before I started scripting this in PowerShell, I realized that using Get-Content was completely out of the question. This is because Get-Content reads the entire file into memory, creates objects for each line, and then continues with the script. Similarly, the use of Import-CSV will result in the same issue. Similarly, I couldn’t cat the entire file in Bash, as it’d do the same kind of thing. This meant reading the file line by line.
So here are my initial Bash, and PowerShell scripts…
$ck = [System.Diagnostics.Stopwatch]::StartNew()
$base = "E:\Photos"
$file = "C:\Temp\photos\deleted_imgs.csv"
$sql = "delete from photo_deleted where PhotoDeletedID = {0}"
$p_file = "C:\Temp\photos\ps_file.txt"
$p_sql = "C:\Temp\photos\ps_sql.txt"
$fh = [System.IO.File]::OpenText($file)
try {
for (;;)
{
$line = $fh.ReadLine()
if ($line -eq $null) { break; }
if ($line -like '*_02_*') {
$data = $line.Split(',')
$file = $base + $data[1]
$rem = $sql -f $data[3]
Add-Content -Path $p_file $file
Add-Content -Path $s_file $rem
}
}
}finally {
$fh.Close()
}
$ck.Stop()
$ck.Elapsed
On line 10, I am using .Net libraries to open the file, and loop through each line. I’m then searching for _02_ in the string (line 17), splitting the line, and writing it to the 2 files (line 23 and 24).
I used the following in bash…
#!/bin/sh
while read line
do
if [[ "${line}" == *_02_* ]];
then
ID=`echo "${line}" | cut -d, -f4`
URL=`echo "${line}" | cut -d, -f2,
echo "delete from photo_deleted where PhotoDeletedID = ${ID}" >> bash_sql.txt
echo "E:\images${URL}" >> bash_files.txt
fi
done < "deleted_imgs.csv"
This is the same kind of thing, it reads the file line by line (line 3 & 12), looks for _02_ line (line 5), uses the cut command to find the field it needs (line 7 & 8), and writes them out to a file.
I then did a head to head execution. I kicked them both off before I headed to lunch. I was gone for about an hour, and on return, the PowerShell script was still running, the bash script just finished. The results of the bash script:
$ time sh ./cleanup.sh
real 60m12.949s
user 8m55.905s
sys 6m22.320s
I gave the PowerShell script another 45 mins, and it still hadn’t finished. In fact, comparison of the file size output showed that the PowerShell script was well behind the Bash script. This shocked me, I was expecting better performance from PowerShell that this paltry showing. I stopped the script, and started looking for ways to improve the speed.
One of the things I pointed out at the beginning (from experience) is that I just couldn’t use Get-Content or Import-CSV because of the memory usage required to load the nearly 2 million rows from the file. I wondered if the issue was also reversed when writing out the file. The Add-Content command is opening the file, seeking the end of the file, writing the new line, and closing it… 2 million times. This is a very expensive operation, so we need to tweak the way this runs. Falling back to some .Net, I decided to use [System.IO.File] again, and create a file handle to use.
This is the modified PowerShell script…
$ck = [System.Diagnostics.Stopwatch]::StartNew()
$base = "E:\Photos"
$file = "C:\Temp\photos\deleted_imgs.csv"
$sql = "delete from photo_deleted where PhotoDeletedID = {0}"
$p_file = "C:\Temp\photos\ps_file.txt"
$p_sql = "C:\Temp\photos\ps_sql.txt"
$pf = [System.IO.File]::CreateText($p_file)
$ps = [System.IO.File]::CreateText($p_sql)
$fh = [System.IO.File]::OpenText($file)
try {
for (;;)
{
$line = $fh.ReadLine()
if ($line -eq $null) { break; }
if ($line -like '*_02_*') {
$data = $line.Split(',')
$file = $base + $data[1]
$rem = $sql -f $data[3]
$pf.WriteLine($file)
$ps.WriteLine($rem)
}
}
}finally {
$fh.Close()
}
$pf.Close()
$ps.Close()
$ck.Stop()
$ck.Elapsed
The changes I made were really simple, I added line 10-11, to open the text file in write mode. I then changed the Add-Content line to WriteLine on line 26-27. I finally closed the file handles on line 35-36. I then re-executed the script…
Days : 0
Hours : 0
Minutes : 1
Seconds : 59
Milliseconds : 742
Ticks : 1197426621
TotalDays : 0.00138591044097222
TotalHours : 0.0332618505833333
TotalMinutes : 1.995711035
TotalSeconds : 119.7426621
TotalMilliseconds : 119742.6621
When I saw the results I practically fell off my chair. I couldn’t believe the actual numbers. To compare, I verified the file sizes between the Linux batch…
$ wc -l ps_*
1018490 ps_file.txt
1018490 ps_sql.txt
$ wc -l bash_*
1018490 bash_files.txt
1018490 bash_sql.txt
Wow! The file sizes were the same, the row count was the same, and the execution time was about 1/30th the time. Could we say PowerShell wins this? I think so!
But, I stepped back and looked at the Bash script, and wondered if it too could be optimized. Maybe it was having the same issue with open/seek/write/close? How much of an impact is using echo/cut having on the script execution? Is there a better way to split strings?
I did some hunting around, and stumbled across 2 resources that gave me some ideas. The first was on string splitting. I’d completely forgotten about $IFS, and how strings get handled by bash, and this little post reminded me. $IFS is an internal variable in Bash, which tells bash on what character to consider a field delimiter. By default it’s white space, and new line. This can be changed, and Bash will consider a different field delimiter. This is a really quick change.
The second was file handles. This was a little harder to track down, but I stumbled across this gem. I’ve never used file descriptors before, but this post pointed me in the right direction. So the tweaked script now looks like this:
#!/bin/sh
exec 5<> "bash_sql2.txt"
exec 6<> "bash_files2.txt"
OIFS=${IFS}
IFS=","
while read line
do
# echo "${line}"
if [[ "${line}" == *_02_* ]];
then
DATA=(${line})
ID="${DATA[3]}"
URL="${DATA[1]}"
# ID=`echo "${line}" | cut -d, -f4`
# URL=`echo "${line}" | cut -d, -f2`
echo "delete from photo_deleted where PhotoDeletedID = ${ID}" >&5
echo "E:\Images${URL}" >&6
fi
done < "deleted_imgs.csv"
IFS=${OIFS}
exec 5>&-
exec 6>&-
I made a copy of the existing $IFS string, and then changed it to a “,” character (line6-7). I opened 2 new file handles descriptors to the files I wanted to write to (line 3-4). On lines 13-15, I use the power of the $IFS, and get Bash to split the string into an array. Lines 19-20 were tweaked to write to the file descriptors, and then they were closed in line 27-28.
So what’s the results?
$ time sh ./cleanup.sh
real 5m8.391s
user 4m42.090s
sys 0m18.369s
Another Wow!. Execution is down to about 8% of the original time, another massive improvement. Just to make sure I had the same results, I double checked the numbers:
$ wc -l bash_*.txt
1018490 bash_files2.txt
1018490 bash_files.txt
1018490 bash_sql2.txt
1018490 bash_sql.txt
So same results, 8% execution time. I’d say those changes were a good improvement.
Conclusion
So who do we call winner? Well, something I failed to disclose at the start was the platforms. Initially my plan was to use Cygwin to do the Linux execution side of this. When I attempted to run the initial Linux script, execution was so slow, I could have gone back home, had dinner, come back, and it still wouldn’t have been done.
The initial PowerShell script had gone through about 10x the file in half the time the Cygwin execution had taken. This meant I needed to run the Linux script else where, so I fell back to my personal Linux server. For comparison, my laptop (PowerShell) is running Intel Core2 Duo T7520 (2x 2Ghz) with 4GB RAM, while my server is running an older Intel Core2 6300 (2x 1.86Ghz) with 1GB RAM.
So, there is a little bit of difference. I’ll leave it up to you guys to decide who should be the winner, but I will probably call it a tie. The differences in the hardware probably allow for one script to execute faster than the other.
I don’t often spend a lot of time tinkering with some of the scripts I throw together, but this was a fun exercise in performance tweaking, and shows how simple 1-5 line changes can make a substantial improvement over execution. Next time you throw together a script, see what performance tweaks you can do.