Parser for the html files generated by DITA-OT.
Extracts the title, the shortdesc and the text within the "content" div tag.
NOTE: This indexes only the content under a tag with ID "content".
Wrap html content with a div tag with id "content" to index relevant parts of your page.
- Version:
- 2.0 2010
- Author:
- N. Quaine, Kasun Gajasinghe
Methods inherited from class org.xml.sax.helpers.DefaultHandler |
endDocument, endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, setDocumentLocator, skippedEntity, startDocument, startPrefixMapping, unparsedEntityDecl, warning |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
SaxHTMLIndex
public SaxHTMLIndex()
- Constructor
SaxHTMLIndex
public SaxHTMLIndex(ArrayList<String> cleanUpStrings)
- Constructor
SaxHTMLIndex
public SaxHTMLIndex(ArrayList<String> cleanUpStrings,
ArrayList<String> cleanUpChars)
- Constructor
init
public int init(Map<String,String> tempMap)
- Initializer
runExtractData
public DocFileInfo runExtractData(File file,
String indexerLanguage)
- Parses the file to extract all the words for indexing and
some data characterizing the file.
- Parameters:
file
- contains the fullpath of the document to parseindexerLanguage
- this will be used to tell the program which stemmer to be used.
- Returns:
- a DitaFileInfo object filled with data describing the file
Copyright © 2013. All Rights Reserved.