org.biojavax.bio.seq.io
Class EMBLxmlFormat

java.lang.Object
  extended by org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
      extended by org.biojavax.bio.seq.io.EMBLxmlFormat
All Implemented Interfaces:
SequenceFormat, RichSequenceFormat

public class EMBLxmlFormat
extends RichSequenceFormat.BasicFormat

Format reader for EMBLxml files. This version of EMBLxml format will generate and write RichSequence objects. Loosely Based on code from the old, deprecated, org.biojava.bio.seq.io.GenbankXmlFormat object. Understands http://www.ebi.ac.uk/embl/dtd/EMBL_Services_V1.1.dtd

Since:
1.5
Author:
Alan Li (code based on his work), Richard Holland, Mark Schreiber

Nested Class Summary
static class EMBLxmlFormat.Terms
          Implements some EMBLxml-specific terms.
 
Nested classes/interfaces inherited from interface org.biojavax.bio.seq.io.RichSequenceFormat
RichSequenceFormat.BasicFormat, RichSequenceFormat.HeaderlessFormat
 
Field Summary
protected static String AUTHOR_TAG
           
protected static String BASEPOSITION_TAG
           
protected static String BASEPOSITION_TYPE_ATTR
           
protected static String CITATION_DATE_ATTR
           
protected static String CITATION_FIRST_ATTR
           
protected static String CITATION_ID_ATTR
           
protected static String CITATION_INSTITUTE_ATTR
           
protected static String CITATION_ISSUE_ATTR
           
protected static String CITATION_LAST_ATTR
           
protected static String CITATION_LOCATION_TAG
           
protected static String CITATION_NAME_ATTR
           
protected static String CITATION_PATENT_ATTR
           
protected static String CITATION_PUB_ATTR
           
protected static String CITATION_TAG
           
protected static String CITATION_TYPE_ATTR
           
protected static String CITATION_VOL_ATTR
           
protected static String CITATION_YEAR_ATTR
           
protected static String COMMENT_TAG
           
protected static String COMNAME_TAG
           
protected static String CONSORTIUM_TAG
           
protected static String CONTIG_TAG
           
protected static String DBREF_DB_ATTR
           
protected static String DBREF_PRIMARY_ATTR
           
protected static String DBREF_SEC_ATTR
           
protected static String DBREFERENCE_TAG
           
protected static String DESC_TAG
           
protected static String EDITOR_TAG
           
static String EMBLXML_FORMAT
          The name of this format
protected static String ENTRY_ACCESSION_ATTR
           
protected static String ENTRY_CREATED_ATTR
           
protected static String ENTRY_DATACLASS_ATTR
           
protected static String ENTRY_GROUP_TAG
           
protected static String ENTRY_RELCREATED_ATTR
           
protected static String ENTRY_RELUPDATED_ATTR
           
protected static String ENTRY_STATUS_ATTR
           
protected static String ENTRY_STATUS_DATE_ATTR
           
protected static String ENTRY_SUBACC_ATTR
           
protected static String ENTRY_SUBVER_ATTR
           
protected static String ENTRY_SUBWGSVER_ATTR
           
protected static String ENTRY_TAG
           
protected static String ENTRY_TAX_DIVISION_ATTR
           
protected static String ENTRY_UPDATED_ATTR
           
protected static String ENTRY_VER_ATTR
           
protected static String FEATURE_NAME_ATTR
           
protected static String FEATURE_TAG
           
protected static String KEYWORD_TAG
           
protected static String LINEAGE_TAG
           
protected static String LOC_ELEMENT_ACC_ATTR
           
protected static String LOC_ELEMENT_COMPL_ATTR
           
protected static String LOC_ELEMENT_TYPE_ATTR
           
protected static String LOC_ELEMENT_VER_ATTR
           
protected static String LOCATION_COMPL_ATTR
           
protected static String LOCATION_ELEMENT_TAG
           
protected static String LOCATION_TAG
           
protected static String LOCATION_TYPE_ATTR
           
protected static String LOCATOR_TAG
           
protected static String ORGANELLE_TAG
           
protected static String ORGANISM_TAG
           
protected static String PATENT_TAG
           
protected static String PROJ_ACC_TAG
           
protected static String QUALIFIER_NAME_ATTR
           
protected static String QUALIFIER_TAG
           
protected static String REF_POS_BEGIN_ATTR
           
protected static String REF_POS_END_ATTR
           
protected static String REFERENCE_TAG
           
protected static String SCINAME_TAG
           
protected static String SEC_ACC_TAG
           
protected static String SEQUENCE_LENGTH_ATTR
           
protected static String SEQUENCE_TAG
           
protected static String SEQUENCE_TOPOLOGY_ATTR
           
protected static String SEQUENCE_TYPE_ATTR
           
protected static String SEQUENCE_VER_ATTR
           
protected static String TAXID_TAG
           
protected static String TAXON_TAG
           
protected static String TITLE_TAG
           
protected static Pattern xmlSchema
           
 
Constructor Summary
EMBLxmlFormat()
           
 
Method Summary
 void beginWriting()
          Informs the writer that we want to start writing.
 boolean canRead(BufferedInputStream stream)
          Check to see if a given stream is in our format.
 boolean canRead(File file)
          Check to see if a given file is in our format.
 void finishWriting()
          Informs the writer that are done writing.
 String getDefaultFormat()
          getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.
 SymbolTokenization guessSymbolTokenization(BufferedInputStream stream)
          On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.
 SymbolTokenization guessSymbolTokenization(File file)
          On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.
 boolean readRichSequence(BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns)
          Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols.
 boolean readSequence(BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener)
          Read a sequence and pass data on to a SeqIOListener.
 void writeSequence(Sequence seq, Namespace ns)
          Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class.
 void writeSequence(Sequence seq, PrintStream os)
          writeSequence writes a sequence to the specified PrintStream, using the default format.
 void writeSequence(Sequence seq, String format, PrintStream os)
          writeSequence writes a sequence to the specified PrintStream, using the specified format.
 
Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
getElideComments, getElideFeatures, getElideReferences, getElideSymbols, getLineWidth, getPrintStream, setElideComments, setElideFeatures, setElideReferences, setElideSymbols, setLineWidth, setPrintStream
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

EMBLXML_FORMAT

public static final String EMBLXML_FORMAT
The name of this format

See Also:
Constant Field Values

ENTRY_GROUP_TAG

protected static final String ENTRY_GROUP_TAG
See Also:
Constant Field Values

ENTRY_TAG

protected static final String ENTRY_TAG
See Also:
Constant Field Values

ENTRY_ACCESSION_ATTR

protected static final String ENTRY_ACCESSION_ATTR
See Also:
Constant Field Values

ENTRY_TAX_DIVISION_ATTR

protected static final String ENTRY_TAX_DIVISION_ATTR
See Also:
Constant Field Values

ENTRY_DATACLASS_ATTR

protected static final String ENTRY_DATACLASS_ATTR
See Also:
Constant Field Values

ENTRY_CREATED_ATTR

protected static final String ENTRY_CREATED_ATTR
See Also:
Constant Field Values

ENTRY_RELCREATED_ATTR

protected static final String ENTRY_RELCREATED_ATTR
See Also:
Constant Field Values

ENTRY_UPDATED_ATTR

protected static final String ENTRY_UPDATED_ATTR
See Also:
Constant Field Values

ENTRY_RELUPDATED_ATTR

protected static final String ENTRY_RELUPDATED_ATTR
See Also:
Constant Field Values

ENTRY_VER_ATTR

protected static final String ENTRY_VER_ATTR
See Also:
Constant Field Values

ENTRY_SUBACC_ATTR

protected static final String ENTRY_SUBACC_ATTR
See Also:
Constant Field Values

ENTRY_SUBVER_ATTR

protected static final String ENTRY_SUBVER_ATTR
See Also:
Constant Field Values

ENTRY_SUBWGSVER_ATTR

protected static final String ENTRY_SUBWGSVER_ATTR
See Also:
Constant Field Values

ENTRY_STATUS_ATTR

protected static final String ENTRY_STATUS_ATTR
See Also:
Constant Field Values

ENTRY_STATUS_DATE_ATTR

protected static final String ENTRY_STATUS_DATE_ATTR
See Also:
Constant Field Values

SEC_ACC_TAG

protected static final String SEC_ACC_TAG
See Also:
Constant Field Values

PROJ_ACC_TAG

protected static final String PROJ_ACC_TAG
See Also:
Constant Field Values

DESC_TAG

protected static final String DESC_TAG
See Also:
Constant Field Values

KEYWORD_TAG

protected static final String KEYWORD_TAG
See Also:
Constant Field Values

REFERENCE_TAG

protected static final String REFERENCE_TAG
See Also:
Constant Field Values

CITATION_TAG

protected static final String CITATION_TAG
See Also:
Constant Field Values

CITATION_ID_ATTR

protected static final String CITATION_ID_ATTR
See Also:
Constant Field Values

CITATION_TYPE_ATTR

protected static final String CITATION_TYPE_ATTR
See Also:
Constant Field Values

CITATION_DATE_ATTR

protected static final String CITATION_DATE_ATTR
See Also:
Constant Field Values

CITATION_NAME_ATTR

protected static final String CITATION_NAME_ATTR
See Also:
Constant Field Values

CITATION_VOL_ATTR

protected static final String CITATION_VOL_ATTR
See Also:
Constant Field Values

CITATION_ISSUE_ATTR

protected static final String CITATION_ISSUE_ATTR
See Also:
Constant Field Values

CITATION_FIRST_ATTR

protected static final String CITATION_FIRST_ATTR
See Also:
Constant Field Values

CITATION_LAST_ATTR

protected static final String CITATION_LAST_ATTR
See Also:
Constant Field Values

CITATION_PUB_ATTR

protected static final String CITATION_PUB_ATTR
See Also:
Constant Field Values

CITATION_PATENT_ATTR

protected static final String CITATION_PATENT_ATTR
See Also:
Constant Field Values

CITATION_INSTITUTE_ATTR

protected static final String CITATION_INSTITUTE_ATTR
See Also:
Constant Field Values

CITATION_YEAR_ATTR

protected static final String CITATION_YEAR_ATTR
See Also:
Constant Field Values

DBREFERENCE_TAG

protected static final String DBREFERENCE_TAG
See Also:
Constant Field Values

DBREF_DB_ATTR

protected static final String DBREF_DB_ATTR
See Also:
Constant Field Values

DBREF_PRIMARY_ATTR

protected static final String DBREF_PRIMARY_ATTR
See Also:
Constant Field Values

DBREF_SEC_ATTR

protected static final String DBREF_SEC_ATTR
See Also:
Constant Field Values

CONSORTIUM_TAG

protected static final String CONSORTIUM_TAG
See Also:
Constant Field Values

TITLE_TAG

protected static final String TITLE_TAG
See Also:
Constant Field Values

EDITOR_TAG

protected static final String EDITOR_TAG
See Also:
Constant Field Values

AUTHOR_TAG

protected static final String AUTHOR_TAG
See Also:
Constant Field Values

PATENT_TAG

protected static final String PATENT_TAG
See Also:
Constant Field Values

LOCATOR_TAG

protected static final String LOCATOR_TAG
See Also:
Constant Field Values

CITATION_LOCATION_TAG

protected static final String CITATION_LOCATION_TAG
See Also:
Constant Field Values

REF_POS_BEGIN_ATTR

protected static final String REF_POS_BEGIN_ATTR
See Also:
Constant Field Values

REF_POS_END_ATTR

protected static final String REF_POS_END_ATTR
See Also:
Constant Field Values

COMMENT_TAG

protected static final String COMMENT_TAG
See Also:
Constant Field Values

FEATURE_TAG

protected static final String FEATURE_TAG
See Also:
Constant Field Values

FEATURE_NAME_ATTR

protected static final String FEATURE_NAME_ATTR
See Also:
Constant Field Values

ORGANISM_TAG

protected static final String ORGANISM_TAG
See Also:
Constant Field Values

SCINAME_TAG

protected static final String SCINAME_TAG
See Also:
Constant Field Values

COMNAME_TAG

protected static final String COMNAME_TAG
See Also:
Constant Field Values

TAXID_TAG

protected static final String TAXID_TAG
See Also:
Constant Field Values

LINEAGE_TAG

protected static final String LINEAGE_TAG
See Also:
Constant Field Values

TAXON_TAG

protected static final String TAXON_TAG
See Also:
Constant Field Values

ORGANELLE_TAG

protected static final String ORGANELLE_TAG
See Also:
Constant Field Values

QUALIFIER_TAG

protected static final String QUALIFIER_TAG
See Also:
Constant Field Values

QUALIFIER_NAME_ATTR

protected static final String QUALIFIER_NAME_ATTR
See Also:
Constant Field Values

LOCATION_TAG

protected static final String LOCATION_TAG
See Also:
Constant Field Values

LOCATION_TYPE_ATTR

protected static final String LOCATION_TYPE_ATTR
See Also:
Constant Field Values

LOCATION_COMPL_ATTR

protected static final String LOCATION_COMPL_ATTR
See Also:
Constant Field Values

LOCATION_ELEMENT_TAG

protected static final String LOCATION_ELEMENT_TAG
See Also:
Constant Field Values

LOC_ELEMENT_TYPE_ATTR

protected static final String LOC_ELEMENT_TYPE_ATTR
See Also:
Constant Field Values

LOC_ELEMENT_ACC_ATTR

protected static final String LOC_ELEMENT_ACC_ATTR
See Also:
Constant Field Values

LOC_ELEMENT_VER_ATTR

protected static final String LOC_ELEMENT_VER_ATTR
See Also:
Constant Field Values

LOC_ELEMENT_COMPL_ATTR

protected static final String LOC_ELEMENT_COMPL_ATTR
See Also:
Constant Field Values

BASEPOSITION_TAG

protected static final String BASEPOSITION_TAG
See Also:
Constant Field Values

BASEPOSITION_TYPE_ATTR

protected static final String BASEPOSITION_TYPE_ATTR
See Also:
Constant Field Values

CONTIG_TAG

protected static final String CONTIG_TAG
See Also:
Constant Field Values

SEQUENCE_TAG

protected static final String SEQUENCE_TAG
See Also:
Constant Field Values

SEQUENCE_TYPE_ATTR

protected static final String SEQUENCE_TYPE_ATTR
See Also:
Constant Field Values

SEQUENCE_LENGTH_ATTR

protected static final String SEQUENCE_LENGTH_ATTR
See Also:
Constant Field Values

SEQUENCE_TOPOLOGY_ATTR

protected static final String SEQUENCE_TOPOLOGY_ATTR
See Also:
Constant Field Values

SEQUENCE_VER_ATTR

protected static final String SEQUENCE_VER_ATTR
See Also:
Constant Field Values

xmlSchema

protected static final Pattern xmlSchema
Constructor Detail

EMBLxmlFormat

public EMBLxmlFormat()
Method Detail

canRead

public boolean canRead(File file)
                throws IOException
Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in EMBLxml format if the second XML line contains the phrase "http://www.ebi.ac.uk/schema/EMBL_schema.xsd".

Specified by:
canRead in interface RichSequenceFormat
Overrides:
canRead in class RichSequenceFormat.BasicFormat
Parameters:
file - the File to check.
Returns:
true if the file is readable by this format, false if not.
Throws:
IOException - in case the file is inaccessible.

guessSymbolTokenization

public SymbolTokenization guessSymbolTokenization(File file)
                                           throws IOException
On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a DNA tokenizer.

Specified by:
guessSymbolTokenization in interface RichSequenceFormat
Overrides:
guessSymbolTokenization in class RichSequenceFormat.BasicFormat
Parameters:
file - the File object to guess the format of.
Returns:
a SymbolTokenization to read the file with.
Throws:
IOException - if the file is unrecognisable or inaccessible.

canRead

public boolean canRead(BufferedInputStream stream)
                throws IOException
Check to see if a given stream is in our format. A stream is in EMBLxml format if the second XML line contains the phrase "http://www.ebi.ac.uk/schema/EMBL_schema.xsd".

Parameters:
stream - the BufferedInputStream to check.
Returns:
true if the stream is readable by this format, false if not.
Throws:
IOException - in case the stream is inaccessible.

guessSymbolTokenization

public SymbolTokenization guessSymbolTokenization(BufferedInputStream stream)
                                           throws IOException
On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the stream. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a DNA tokenizer.

Parameters:
stream - the BufferedInputStream object to guess the format of.
Returns:
a SymbolTokenization to read the stream with.
Throws:
IOException - if the stream is unrecognisable or inaccessible.

readSequence

public boolean readSequence(BufferedReader reader,
                            SymbolTokenization symParser,
                            SeqIOListener listener)
                     throws IllegalSymbolException,
                            IOException,
                            ParseException
Read a sequence and pass data on to a SeqIOListener.

Parameters:
reader - The stream of data to parse.
symParser - A SymbolParser defining a mapping from character data to Symbols.
listener - A listener to notify when data is extracted from the stream.
Returns:
a boolean indicating whether or not the stream contains any more sequences.
Throws:
IllegalSymbolException - if it is not possible to translate character data from the stream into valid BioJava symbols.
IOException - if an error occurs while reading from the stream.
ParseException

readRichSequence

public boolean readRichSequence(BufferedReader reader,
                                SymbolTokenization symParser,
                                RichSeqIOListener rlistener,
                                Namespace ns)
                         throws IllegalSymbolException,
                                IOException,
                                ParseException
Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. Events are passed to the listener, and the namespace used for sequences read is the one given. If the namespace is null, then the default namespace for the parser is used, which may depend on individual implementations of this interface.

Parameters:
reader - the input source
symParser - the tokenizer which understands the sequence being read
rlistener - the listener to send sequence events to
ns - the namespace to read sequences into.
Returns:
true if there is more to read after this, false otherwise.
Throws:
IllegalSymbolException - if the tokenizer couldn't understand one of the sequence symbols in the file.
IOException - if there was a read error.
ParseException

beginWriting

public void beginWriting()
                  throws IOException
Informs the writer that we want to start writing. This will do any initialisation required, such as writing the opening tags of an XML file that groups sequences together.

Throws:
IOException - if writing fails.

finishWriting

public void finishWriting()
                   throws IOException
Informs the writer that are done writing. This will do any finalisation required, such as writing the closing tags of an XML file that groups sequences together.

Throws:
IOException - if writing fails.

writeSequence

public void writeSequence(Sequence seq,
                          PrintStream os)
                   throws IOException
writeSequence writes a sequence to the specified PrintStream, using the default format.

Parameters:
seq - the sequence to write out.
os - the printstream to write to.
Throws:
IOException

writeSequence

public void writeSequence(Sequence seq,
                          String format,
                          PrintStream os)
                   throws IOException
writeSequence writes a sequence to the specified PrintStream, using the specified format.

Parameters:
seq - a Sequence to write out.
format - a String indicating which sub-format of those available from a particular SequenceFormat implemention to use when writing.
os - a PrintStream object.
Throws:
IOException - if an error occurs.

writeSequence

public void writeSequence(Sequence seq,
                          Namespace ns)
                   throws IOException
Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is given, sequences will be written with that namespace, otherwise they will be written with the default namespace of the implementing class (which is usually the namespace of the sequence itself). If you pass this method a sequence which is not a RichSequence, it will attempt to convert it using RichSequence.Tools.enrich(). Obviously this is not going to guarantee a perfect conversion, so it's better if you just use RichSequences to start with! Namespace is ignored as EMBLxml has no concept of it.

Parameters:
seq - the sequence to write
ns - the namespace to write it with
Throws:
IOException - in case it couldn't write something

getDefaultFormat

public String getDefaultFormat()
getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.

Returns:
a String.