Friday, January 16, 2009

Email sanitizer-extractor in Lua

Yesterday a friend asked me to write a little script that reads a file and outputs every email it reads in another file, discarding any duplicates. After making the program he told me that he was searching the Internet and couldn't find something similar so I should upload it somewhere just in case someone else needs it.

To make it more interesting I managed to make it a single call program: everything is defined inside the arguments of a single call. In fact there are two calls: the first returns an object, a function of which is immediately called. Anyway here is the code:"output.txt","w"):write((string.gsub(" " or {})[1] or "input.txt")):read("*a").." ",".-([%w%.%-_]+@[%w%.%-_]+).-",function (email) email=string.lower(email) print("EMAIL: '""'") emails=emails or {} for index,emailseen in ipairs(emails) do if emailseen==email then return "" end end table.insert(emails,email) return email.."\n" end)))

Now let's break that up and put some comments, shall we?"output.txt","w"):write( opens a file in write mode and returns the file handle which instead of being stored in a variable is immediately used by calling it's filehandle:write function.

(string.gsub( --string.gsub will finally return two arguments: the email list, one per line, lowercase, and without duplicates and the number of replacements it did. The first argument is our output. I can discard the second by putting the function call, therefore the returned argument list, into parentheses. In Lua print((5,"kostas","klapatsimpala")) will just print 5.

" " or {})[1] or "input.txt")):read("*a").." " --This is the first argument to string.gsub. like we did before, we open a file in read mode and immediately use the returned handle to do a full read of the file. A tricky part is the "(arg or {})[1] or "input.txt"" part. If you call a lua script with extra arguments then the arg table will be created by Lua. If it exists then the "arg or {}" part will evaluate in "arg" (if on the left side of an "or" is a true value then "or" simply results in that) and then "(arg)[1]" will return the first variable which is a custom input filename. That filename "ORed" with "input.txt" will simply return that filename (since any strings are true values for Lua, so OR will evaluate in the left argument). If you didn't call the script with any arguments then the arg table will not exist, thus the "(arg or {})" part will result in a newly created empty table. Of course if you index it's first cell you'll find nothing, so the "({})[1] or "input.txt"" will result in "input.txt" (if "or" finds a false or nil value on it's left it will simply return the value on it's right). Finally I add two space characters: one to the start of the read data and one to the end. These are added so that the pattern matching I use will apply to any emails exactly at the beginning or exactly at the end of the read data.

,".-([%w%.%-_]+@[%w%.%-_]+).-" --The second argument is the pattern. I am breaking up the whole text in the following way: any number of any characters (as less as possible) followed by any number of email allowed characters (as much as possible), followed by @, followed by any number of email allowed characters (as much as possible), followed by any number of any characters (as less as possible). The "email allowed characters" are: alphanumerics, dot, dash, underscore). From this pattern I want to capture just the email part.

,function (email) --Now this is the best part. An anonymous function. It is created without being stored in a variable (which would give it a name) and immediately used as an argument to string.gsub. This function accepts a single argument: email. string.gsub will call it for every match with the capture (the email) as an argument.

email=string.lower(email) --First of all we turn the email to lowercase

print("EMAIL: '""'") --Debugging message...

emails=emails or {} --Remember what we said. If the left argument is true (not false and not nil) then it is returned, so if the emails variable has already been defined nothing will happen because emails=emails will be executed. If the emails variable is not defined (is nil) though, then "or" will return it's right argument therefore emails={} will be executed and emails will be initialized as an empty table.

for index,emailseen in ipairs(emails) do --For every already seen email do:

if emailseen==email then return "" end --If this already seen email is the same with the new capture then just return "" so that the whole match will be replaced by nothing. Remember that although the capture is just the email, the match includes the email as well as the preceding and the following characters.

end --end for.

table.insert(emails,email) --If we managed to get here then this email capture is seen for the first time. We insert it in the emails table.

return email.."\n" --and finally we return the email capture (lowercase) followed by a newline. This will replace the whole match.

end --end of the anonymous function

) --closing of string.gsub

) --that's the second parentheses for string.gsub (to discard the second returned argument)

) --closing of write.

That's all. I seem like it is working but I haven't done any extensive debugging.

1 comment:

Popular Posts