I need a tool to extract email addresses and phone # from craigslist.
Usage:
clextract STARTURL [-n MAXADS] [-d]
STARTURL is the URL to start getting ads from, e.g. <[login to view URL]>
MAXADS is the max number ads to visit
The option -d deletes duplicate entries (see below)
The tool shall visit up to MAXADS extracted from STARTURL (note that you may need to paginate if n is big, e.g. 1000) and internally build a list of ads, each ad containing the following fields, directly extracted from the ad:
- Ad URL
- Ad headline
- Ad text
- Ad location
- Ad date
In addition, for each visited ad, the script shall try to extract the following:
- Ad email (only real emails, not emails ending in @[login to view URL])
- Ad phone #
- Ad contact person first name
Some ads do not have email or phone # - the script shall have two functions to find a list of US phone # inside an ad and a list of emails; and it shall search for them inside the ad text.
If the tool managed to find either an email or telephone, the script will add this ad to a list of internal found ads and it will attempt to find the first name contact person; initially this can be implemented by having a list of common first names in US English (e.g. Matthew, Robert, Bill, etc.) and searching for them inside the ad text; if found it will store the first name found, if not it will store "".
If the -d option is set, the script will first look if there is an ad in our list containing EITHER the same email OR telephone, and if so it will skip the ad; otherwise it will save the ad internally.
Each saved ad will be output as follows (CSV file):
adURL, adHeadline, adEmail, adPhone, adFirstName, AdLocation, AdDate
OPTIONALLY the tool may be expanded to output the information not to a CSV file but to a google doc spreadsheet using the following API:
<[login to view URL]>