Write a Python-script that downloads web crawling data (ARC-format) from the CommonCrawl.org-project.
The python script must use at least three arguments: aws private, aws public and the file extension to extract from the links.
Example usage:
$ python [login to view URL] secret public pdf
[login to view URL]
.. and so on..
The output from the script will be links containing the file extension. The script must also keep state in which ARC-file it's currently processing.
The script must use the requester pays S3 option and parse crawler data in the 2012-dataset.
Example file: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/[login to view URL]
About the ARC-file format: [login to view URL]
Exampel flow:
1. Download segment list
2. Download first ARC-file in segment and uncompress
3. Parse ARC-file to find links with user selected extensions
4. Print link/url
Hi,
I have 10 years experience in implementing various interfaces in both perl and python.
Since you have already chosen I will deliver it in Python as per the discussion we will have once you are free.
Thank You,
Vasundhar