I want to categorize a document into any of the pre-defined classes based on the corpora made. eg someone gave a document i want to use Term Frequency-Inverse Document Frequency(tf-idf) to categorize the document into pre-defined corpora like networking, OS.
Suppose someone submits papers at an [login to view URL] to categorize the papers according to domains is [login to view URL] thinking about automizing it. choose the document we need to categorize and as i have already made the data sets so we can check whether it is a networking domain paper or OS paper. 1st the the stop words are to be removed and then the stemming has to be [login to view URL] we have to count the no of times each word appears in the document and then match the highest appearing words with the data set to determine in which one it belong. I have already made the data set.
Need a GUI or console based Java program.
Input will be path of the document to be categorized and output will be the computed category(like OS, Networking, Database etc).
SCJP & OCPWCD certified professional here. ready to start right now.
already made similar projects where I fetch and store info. in excel files. I can guarantee working with docs
check your PMB for further contact.
We have experience with medical language interpretation and virtualization. I am a current C# programmer but I have gook command of qsl, java, php, js.
My job is to categorize articles and distribute keywords into categories.
In my job, I already try other better algorithms and implemented it. But it is in C# now.
I can convert it to java. So I think I can finish your job in 4 days.
Thanks, Tri.