Count dates with AWK

Someone on the public forum asked this question:

“Say I want to find out how many dates are there in 1000 lines of characters.Also date format is given (…in my case date format is of ‘DD-MON-YYYY’).”

When I read the question I thought, what a nice exercise for regex with awk. Here is part of the answer I gave:

#######################
# Countdates.awk script
#######################
BEGIN {i=0}
{
$0 = toupper($0)
i += gsub(/(0[1-9]|[1-2][0-9]|3[01])-(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)-((19[0-9][0-9])|(20[0-9][0-9]))/,"") 
}
END {print "Number of dates (DD-MON-YYYY) in the file: ", i}
#######################
# End Countdates.awk
#######################

Sample file with three lines (dates.txt):

xxxxx01-Jan-200002-Feb-2003 xxxx13-APR-1990xxxxx
xx32-OCT-2007---xxx00-dEC-1900 xxx 1-MAY-2003xxxxx
xxx2004-AUG-07 xxxxx-08-Aug-2007- 12-Jun-1890

Now, if you run the script:

G:\>gawk -f countdates.awk dates.txt
Number of dates (DD-MON-YYYY) in the file: 4</code>

It’ll catch three dates in the first line (01-Jan-2000, 02-Feb-2003,13-APR-1990), none in second line
and one in the third line (08-Aug-2007).

Short explanation of the script:

$0 = toupper($0) ... it's needed to catch Apr, APR, aPR

gsub() functions returns number of substitutions which is added to variable i

/ ...search pattern ... /

(0[1-9]|[1-2][0-9]|3[01]) ... trying to catch only valid days 01-31 (assuming leading zero)

(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) .... month names abbreviation in uppercase to compare against input line ($0) which we uppercased at he beggining

((19[0-9][0-9])|(20[0-9][0-9])) .... assuming only years that starts with 19xx and 20xx are valid (that's why we didn't count 12-Jun-1890)

You can take this example and fine “tune” it for the data you actually have. This script lacks true date validation (e.g. it’ll count 29-FEB-2007 as a date despite the fact that it’s really wrong), for this, perl/python etc. are better suited (first catching the date candidates with the power of regex pattern search, then validating those strings as a true dates with some date/time module.).

Advertisement

Posted on 21.08.2007, in Scripting and tagged . Bookmark the permalink. Comments Off on Count dates with AWK.

Comments are closed.