Search This Blog

Saturday, March 28, 2009

How to scrape emails for online marketing? -- for free

Prologue - Is harmless marketing always Spam?


Ok, so I had started a website and wanted to promote it. At some point, you might be in the same boat too.

Unfortunately, I did not have a while lot of contacts for sending emails to. Well, the website was not really spammy (it was actually a community related website for open source audio resources), and I was not going to solicit for money, so I did not have a lot of compunctions in telling people about it. If they wanted to listen to some audios, they could stick around, otherwise they could do anything else they pleased to.

I knew some owners of yahoo groups that had heavy audience. I thought maybe it was worth a shot explaining about the website and asking if they might be willing to share emails for targeted promotion. The results were quite disappointing, and somewhat expected.
After talking to a few of them, my general observation seemed that most group owners think that as long as they dont give out member emails to others, those emails would always be protected from spam. They tend to think of other groups/email distributions as spam! (as if those people WONT get any other SPAM in the future at all, huh?)

Whatever, I thought. As long as my intentions were pure, I did not need to have second thoughts.

So what are my options?


Hmph! What a Bummer. How long could I depend on google to make the website more popular? It was basically a deadlock.

So one fine day, I started thinking about how ELSE I could approach this problem. For example, could I not use a unix utility like wget or perl to simulate web clicks or use some robot utility to do clicks, web navigation and download information?

Then another thought was -- could I not get emailids from specific yahoo groups or gmail groups that I was subscribed to, and whose target audience might be interested in knowing about a website? After further digging on google, I found that there IS a perl module called WWW::Yahoo::Groups that can programmatically fetch group messages to local PC drive.

WWW::Yahoo::Groups -- Easier said than done..

I tried installing WWW::Yahoo::Groups on a unix box I had access to, but it seems that its easier said than done. It had a ton of dependent modules that wouln't install properly (even with the force option) -- I specifically had a lot of trouble with the Crypt::SSLeay module.

So what next?

This failure forced me to look for other options. And boy, there are quite a few of them available. fetchyahoo is one, then you have yahoo2mbox.
Now, An interesting point about yahoo2mbox is that there are a couple places where you can find it. Apparently, TT solutions has their own version at, while there is a debian version available at I had problems with the debian version on Ubuntu and tried using the one from tt-solutions, which worked better.

So while fetchyahoo can fetch the contents of a single yahoo user, yahoo2mbox can do the same for a yahoogroup.

Fine, I thought, lets go for the jugular and get yahoo2mbox to work. The results were inconsistent: if you are a moderator of the group or if the group has really loose security setting for not masking email ids, only then can you see the email ids in the downloaded messages. Also, you get weird messages/errors if the group home page has photo and text (got some tag related errors with a yahoo group FHRS_USA).

Email masking on Yahoogroups and Google groups..

Anyway, I realized that both yahoogroups and google groups had elaborate email masking on. True, without it, the online world would NOT be a safe place to be in. Email harvesters would retrieve emails left, right and center and make big bucks.

This is an example from yahoo groups content fetched through yahoo2mbox:


From Sat Aug 05 14:41:35 2000
Return-Path: <>
Received: (qmail 6062 invoked from network); 5 Aug 2000 21:41:35 -0000



This is an example of detailed header fetched using a similar utility called
Received: by with SMTP id e11mr6036070agb.27.1237961013457;
Tue, 24 Mar 2009 23:03:33 -0700 (PDT)
Return-Path: <>
Received: from ( [])
by with ESMTP id 15si1362316gxk.4.2009.;
Tue, 24 Mar 2009 23:03:32 -0700 (PDT)

Again, what are my options?

So a couple days into the quest and no clear solution yet. Water, water everywhere and not a drop to drink. What an irony.

With these thoughts in mind, I was looking at individual emails (of the yahoo group that I was interested in) in my Yahoo! inbox. Quite absent mindedly, I opened one of the messages a view source and realized that it did not have email masking turned on. In fact, I realized, it could be on, because what would the reply-to address be then? In other words, the From Email id HAD to be available in the mail header of yahoo group email in my inbox.



So then, the solution was plain and simple. I would extract the individual email messages for a yahoo group and then extract the email ids from there. Theoritically, it should work. In practice too, it did. Here is an example:
From Thu Feb 10 16:59:16 2005
Return-Path: <>
Received: (qmail 45652 invoked from network); 11 Feb 2005 00:59:15 -0000

So this meant that one would just need to be part of email distribution of a particular yahoo group that one is interested in. It is easier to extract emails if all yahoo group related emails are filtered into a specific folder (then you can utilize --folder option of fetchyahoo).

The quirks of making 'fetchyahoo' work..

For using perl, there are a couple options available, but the best option, in my humble opinion, seems to be Active Perl on windows, available at . The "other" options on windows are Cygwin or Virtualbox to simulate a unix environment in windows.

If you want to use Virtualbox, the easiest option is to install Ubuntu. You can find a lot of Virtualbox How to articles here. Another option is Look at

While Cygwin mostly works for simple unix stuff, I found that most of the perl package dependencies do not work out and you end up getting frustrated (I found that Crypt::SSLeay module had problems).

Installing ActivePerl and required modules for fetchyahoo

So activeperl .msi is installed, the perl executable is automatically aded to the windows PATH:
c:\emails\fetchyahoo-2.13.3>perl -version

This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)

Also, we will assume that the latest version of fetchyahoo is downloaded and extracted to c:\emails\fetchyahoo-2.13.3 folder:
Volume in drive C has no label.
Volume Serial Number is DA07-A231

Directory of c:\emails\fetchyahoo-2.13.3

03/27/2009 09:48 PM <DIR> .
03/27/2009 09:48 PM <DIR> ..
03/09/2009 12:09 PM 15,182 ChangeLog
03/09/2009 12:09 PM 17,992 COPYING
03/09/2009 12:09 PM 2,747 Credits
03/09/2009 12:09 PM 107,289 fetchyahoo
03/09/2009 12:09 PM 5,359 fetchyahoo.1
03/09/2009 12:09 PM 2,287 fetchyahoo.spec
03/09/2009 12:09 PM 4,907 fetchyahoorc
03/09/2009 12:09 PM 6,314 index.html
03/09/2009 12:09 PM 19,380 INSTALL
03/09/2009 12:09 PM 966 TODO
10 File(s) 182,423 bytes
2 Dir(s) 147,889,139,712 bytes free

You should now try to run it. it gave me this error initially:
c:\emails\fetchyahoo-2.13.3>perl fetchyahoo
Can't locate MIME/ in @INC (@INC contains: C:/Perl/site/lib C:/Perl/lib .
) at fetchyahoo line 59.
BEGIN failed--compilation aborted at fetchyahoo line 59.

Basically, it needs the MIME::Head module installed. For this, we will use CPAN module. You can also use the CPANPLUS module, which is more advanced and has the option of un-installing PERL modules too. CPAN module cannot un-install modules.

This is how you invoke CPAN (you can also just type c:/> cpan). Note that the overwriting the lockfile message might come if a previous session did not terminate properly:
c:\emails\fetchyahoo-2.13.3>perl -MCPAN -e shell

There seems to be running another CPAN process (pid 5220). Contacting...
Other job not responding. Shall I overwrite the lockfile 'C:\Perl\cpan\.lock'? (
Y/n) [y]

cpan shell -- CPAN exploration and modules installation (v1.9205)
ReadLine support enabled

cpan> install MIME::Head

Going to read C:\Perl\cpan\Metadata
Database was generated on Thu, 26 Mar 2009 10:26:54 GMT
Running install for module 'MIME::Head'
Running make for D/DO/DONEILL/MIME-tools-5.427.tar.gz
Fetching with LWP:
Fetching with LWP:
Checksum for C:\Perl\cpan\sources\authors\id\D\DO\DONEILL\MIME-tools-5.427.tar.g
z ok
Scanning cache C:\Perl/cpan/build for sizes
.. Going to build D/DO/DONEILL/MIME-tools-5.427.tar.gz

*** Module::AutoInstall version 1.03
*** Checking for Perl dependencies...
[Core Features]
- Test::More ...loaded. (0.72)
- Mail::Header ...missing. (would need 1.01)
- Mail::Internet ...missing. (would need 1.0203)
- Mail::Field ...missing. (would need 1.05)
- MIME::Base64 ...loaded. (3.07_01 >= 2.2)
- IO::File ...loaded. (1.14 >= 1.13)
- IO::Handle ...loaded. (1.27)
- IO::Stringy ...missing. (would need 2.11)
- File::Spec ...loaded. (3.2501 >= 0.6)
- File::Path ...loaded. (2.04 >= 1)
- File::Temp ...loaded. (0.18 >= 0.18)
==> Auto-install the 4 mandatory module(s) from CPAN? [y]
Appending installation info to C:\Perl\lib/perllocal.pod
nmake install -- OK
Running install for module 'Test::Pod'
Running make for P/PE/PETDANCE/Test-Pod-1.26.tar.gz
Fetching with LWP:
Fetching with LWP:
Checksum for C:\Perl\cpan\sources\authors\id\P\PE\PETDANCE\Test-Pod-1.26.tar.gz
All tests successful.
Files=8, Tests=127, 2 wallclock secs ( 0.00 cusr + 0.00 csys = 0.00 CPU)
nmake test -- OK
Running make install
Prepending C:\Perl\cpan\build\MailTools-2.04-fCc9N7/blib/arch C:\Perl\cpan\build
\MailTools-2.04-fCc9N7/blib/lib to PERL5LIB for 'install'
Module 'MIME::Head' installed successfully

No errors installing all modules

Interestingly, you can also check this using Perl Package Manager GUI Utility (invoked by typing ppm on the windows command prompt). It just takes lot of time to load the GUI's data:


Now, let us check if it works:
c:\emails\fetchyahoo-2.13.3>perl fetchyahoo --nodownload
No username specified.
Please enter your Yahoo! username: pooja33pandey
Please enter your Yahoo! password:
No mailbox or mailspool specified.
Please enter the path to and name of your mail spool or mailbox (eg /var/spool/m
ail/username): pooja33pandey.mbox
Logging in securely via SSL as poojagverma on Fri Mar 27 22:29:50 2009
Failed: Invalid ID or password entered (username: pooja33pandey )

If you are running Vista, you might see this infamous pop-up window, which you will need to unblock:

All right! So it works. Now, lets try a more comprehensive example:
c:\emails\fetchyahoo-2.13.3>perl fetchyahoo --onlylistmessages --username=pooja1
3pandey --password=<pwd> --spoolfile=pooja.mbox --logout
Use of uninitialized value $ENV{"HOME"} in concatenation (.) or string at fetchy
ahoo line 1992.
Logging in securely via SSL as pooja13pandey on Fri Mar 27 22:45:49 2009
Country Code 'in' not found. We will try the translation for 'us'.
Country code : in FetchYahoo! Version: 2.13.3
Successfully logged in as pooja13pandey.
Marking messages read on the server

Fetching mail from folder: Inbox
Getting Message ID(s) for message(s) 1 - 25.
1. new "Public Records " - Locate anyone. Search public records. 7:46 AM 6KB
2. new "Pooja Pandey <p" - online skype number (678) 534-2725 2:47 AM 3KB
3. old "Nimesh Bhuva <b" - Re: [GHPCSB_MCA_2k] Re: Happy Holi 27/3/09 35KB
4. old "Nimesh Bhuva <b" - Re: [GHPCSB_MCA_2k] Happy Holi 27/3/09 32KB
5. old "Birthday Remind" - First Reminder for Vibha Deshmukh's Birthday 26/3/09
6. old "Sharma, Ashish " - RE: [LIKELY JUNK]RE: [LIKELY JUNK]Re: So it 25/3/09 6
Got 90 Message IDs
Not downloading messages
Messages have not been deleted.
Logged out.

Note that fetchyahoo limits the messages fetched to 90 by default, because there is a download limit of 65mb per hour per user per IP address that is set by yahoo. You can use --safedownload option to give a gap of 5-10 seconds between each message fetch. This way, you can run a single command for a long time, without hitting the yahoo imposed download limit (per user, per IP).

Note that once you download the messages locally, they will be marked as read. If you want to terminate the download in between, you can do so and resume it later with the --newonly and --msgidarchivefile option. By defeault, the messages are appended to the archive/spool file:
D:\emails\fetchyahoo-2.13.3>perl fetchyahoo --folder=<foldername>  \
--username=<username> --password=<password> \
--safedownload  --spoolfile=<foldername>.mbox \
--msgidarchivefile=<foldername>_msgids  --newonly

Conclusion: The strategy in a nutshell

So there you have it. A simple mechanism to get targeted email ids for making your online marketing campaign successful:

1) Identify the Yahoo Group that you are interested in. This is a strategic decision. You want to limit your focus to people who would be interested in your idea. The demographics are important for high return on interest.

2) Become a member of the group and subscribe to individual emails.

3) Setup a filter to direct all Group emails to a specific folder. Free Yahoo account allows for 100 such filters now. Make sure the traffic is flowing in.

4) Sit on your ass for 6 months to 1 year to allow of significant volume of emails. If it is high activity/volume group, then your wait time would be lesser.

5) Fetch the Yahoo folder contents to your local PC. Now you are sitting on the goldmine.

6) Filter out the email ids using simple shell script provided here. Feel free to extent it to your needs. Always manually check the email ids retrieved.

7) Last, but not the least, input the contacts gathered to your mail broadcast software and reap the benefits by inviting them to your newsletter/broadcast.

Remember, the golden rule of thumb to retain the interest of your audience -- Do not send too many similar mails in too short a period of time. Start very moderately and hope that most of them would join your newsletter.

Conclusion - With great power comes great responsbility..

Well, I hope that this article was helpful to you, if your intentions are true and pure. I DO NOT support mis-use of this method for spamming people's inboxes and for immoral or lucrative purposes (that is NOT the intention with which this article has been written).

If you find this article useful or would like to discuss it further, please leave a comment here. Have a great day and Good luck.

No comments:

Post a Comment