Read .doc Files From Directory And Count Unique Words In Java

LEVEL: Intermediate

Whats in this post?

  1. Code
  2. Output
  3. Background and Explanation of the code

Code


To copy code: Click View Source “View Source” on the right of this code snippet.


Background and Explanation of the code

Before compiling this program,add the following two jar files

poi-scratchpad-3.7-20101029.jar

poi-3.7-20101029.jar

And then set the path of these libraries  by going to mycomputer->properties-> environment variable->path(edit and paste path e.g c:\test ).The purpose of these .jar files is to read the .doc files.

First set directory path where  the files are present(example c:, d:, e:),and assign it to variable name folder,using folder variable  fetch the list of files in given directory and save them in an array name as ‘listofFiles’. After this it prints the list of file names using for loop.

Then it just add files name to retrieved directory and assign it to variable ‘file-name’.This variable is passed as an argument to function named as readMyDocument, that is responsible for reading the doc file from a given directory.

‘Readmydocument’ function call to another function which reads the paragraph of any file. ‘word extractor’ object is created which is responseible for retrieving the paragraph from document and then print the lenghth of document.

After this the for loop runs until the paragraph length and within the loop,the whole paragraph is printed by using the function paragraph[].getstring.

tokenstream object is created which provide us the tokens of the given paragraph.This object is initialized by using new key word and one argument constructor is called,the argument is the paragraph of doc.

The while loop run after it which tokenized the paragraph and store it in a  function which calls treeset object.before adding this there is a condition a.contain(word)  which tells us either the word is contain in tree set or not.if word contains in the treeset then the count will not increment because count tells us the unique words in the document. otherwise count will be incremented and word will be added to treeset.

At the end, prints the unique words found in document,using System.out.println() function.

Output:

run:
File 1.doc
Total Paragraphs: 1
Model system Building a good simulation model can be a lot of work. We need to specify and model the system to be simulated, implement the model, collect data on the corresponding real system (if any), verify and validate the simulation system and run the simulation. In order to test different ideas, to learn about the system behavior in new situations and to make efficient decisions, decision makers need ways to explore the simulation outputs easily and rapidly. Geosimulation users need tools to analyse the spatial output data. In this paper we present an approach which combines spatial on-line analytical processing (SOLAP) techniques with multiagent geosimulation techniques in order to improve the exploration of spatial and non-spatial data resulting from geosimulations. As an example, we present the application of our approach to the simulation of the shopping behavior of customers in a shopping mall.

//unique word count is
count=89
BUILD SUCCESSFUL (total time: 0 seconds)