Search This Blog

Saturday, March 21, 2009

Quick script for extracting emails from unformatted text

Often, we face a need of extracting emails from some un-formatted text like or html tags etc. For this, the following script can come handy for extracting emails into simple text file, which can be uploaded to mailman or other mailing software contact lists:
$ more
sed -e 's/\,/\n/g' -e 's/ /\n/g' $1 | \
grep '@' | \
sed -e "s/[<>();]//g" -e 's/mailto://g' \
| sort -u > ${1}.extracted.txt

wc -l ${1}.extracted.txt


$ ./ emails_unformatted.txt

131 emails_formatted.extracted.txt

Hope it is useful for someone else for extracting emails in a single shot. Otherwise, it takes a lot of time for doing several passes by examining the post-processed output. Even with the above heuristic rules, the output may not have 100% proper email, so some proof reading would be needed.

If this is useful to you, please leave a comment here.

No comments:

Post a Comment