com.nexwave.nquindexer
Class SaxHTMLIndex

java.lang.Object
  extended by org.xml.sax.helpers.DefaultHandler
      extended by com.nexwave.nquindexer.SaxDocFileParser
          extended by com.nexwave.nquindexer.SaxHTMLIndex
All Implemented Interfaces:
ContentHandler, DTDHandler, EntityResolver, ErrorHandler

public class SaxHTMLIndex
extends SaxDocFileParser

Parser for the html files generated by DITA-OT. Extracts the title, the shortdesc and the text within the "content" div tag.

NOTE: This indexes only the content under a tag with ID "content". Wrap html content with a div tag with id "content" to index relevant parts of your page.

Version:
2.0 2010
Author:
N. Quaine, Kasun Gajasinghe

Field Summary
 
Fields inherited from class com.nexwave.nquindexer.SaxDocFileParser
fileDesc, projectDir, strbf
 
Constructor Summary
SaxHTMLIndex()
          Constructor
SaxHTMLIndex(ArrayList<String> cleanUpStrings)
          Constructor
SaxHTMLIndex(ArrayList<String> cleanUpStrings, ArrayList<String> cleanUpChars)
          Constructor
 
Method Summary
 int init(Map<String,String> tempMap)
          Initializer
 DocFileInfo runExtractData(File file, String indexerLanguage)
          Parses the file to extract all the words for indexing and some data characterizing the file.
 
Methods inherited from class com.nexwave.nquindexer.SaxDocFileParser
characters, endElement, init, parseDocument, processingInstruction, RemoveValidationPI, resolveEntity, runExtractData, startElement
 
Methods inherited from class org.xml.sax.helpers.DefaultHandler
endDocument, endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, setDocumentLocator, skippedEntity, startDocument, startPrefixMapping, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SaxHTMLIndex

public SaxHTMLIndex()
Constructor


SaxHTMLIndex

public SaxHTMLIndex(ArrayList<String> cleanUpStrings)
Constructor


SaxHTMLIndex

public SaxHTMLIndex(ArrayList<String> cleanUpStrings,
                    ArrayList<String> cleanUpChars)
Constructor

Method Detail

init

public int init(Map<String,String> tempMap)
Initializer


runExtractData

public DocFileInfo runExtractData(File file,
                                  String indexerLanguage)
Parses the file to extract all the words for indexing and some data characterizing the file.

Parameters:
file - contains the fullpath of the document to parse
indexerLanguage - this will be used to tell the program which stemmer to be used.
Returns:
a DitaFileInfo object filled with data describing the file


Copyright © 2013. All Rights Reserved.