Last Updated:

Read a Word document in docx format using Apache POI

Reading a Word document using Apache POI

Let's look at brief theoretical information on working with the library, headers and footers and paragraphs. A document read in docx memory is an instance of the class , which we will parse into its components. To do this, we will need special classes:XWPFDocument

  • Separate classes and — for working (reading/creating) header and footer. They can be accessed through a special provider class. XWPFHeaderXWPFFooterXWPFHeaderFooterPolicy
  • Class – for working with paragraphs.XWPFParagraph
  • Class — to parse the contents of the entire docx page of the documentXWPFWordExtractor

Apache POI contains many other useful classes for working with tables and media objects inside a Word document, but in this introductory article we will limit ourselves to parsing headers and footers and parsing text information.

Example of reading a Word document in docx format using Apache POI

Now let's add the Apache POI library to the project to work with Word in docx format. I'm using maven, so I'll just add another dependency to the project.

If you're using gradle or want to manually add a library to your project, you can find it here.

I will parse/read the docx document received in the previous article – Creating a Word File. You can use your file. The content of my document is as follows:

 

Now let's write a simple class for reading data from headers and footers and paragraphs of the document:

Let's run and look at the console:

Novice Java programmers, note that we used the try-with-resources construct, a feature of Java 7. Read more in the special section Features of Java 7.

Another way to read the contents of a Word file

The example above first parses individual parts of the document and then prints their contents to the console. And what if we just want to see the entire contents of the file at once? To do this, Apache POI has a special class XWPFWordExtractor, with which we will do what we need in 2 lines.

Just uncomment the code in the listing above and run the project again. In the console, the output on the screen is simply duplicated.

Read more about the Apache POI library here, and also see an example of reading an Excel file, as well as creating an Excel (xls) document all using Apache POI.