Attached are 2 files for your reference with some dummy data
Files are fixed width files, columns start and end limit can be obtained from header.
Tasks -
1) Load 25 dec file in HDFS using SQOOP
2) Data cleansing and saving file to HDFS using PIG scripting
- Convert text file to UTF 8 format (*)
- Convert relevant fields to integer (*)
- Convert entire data set to lower
- Remove leading zeros
3) Load 26 dec file in HDFS using SQOOP
4) Data cleansing and saving file to HDFS using PIG scripting
5) Diff both the files and save new and updated records in HDFS - using pig script (*)
- It should save only new and updaed record from 26 dec file. -- Last but one record is a update and last entry is a new record
Tasks marked with (*) , is where I need your help.
pls review and advise.
--Code should be generic and not hardcoded
thanks,
Mani