Detailed Description

A Html5 capable lexer.

Inheritance diagram for Dom.HtmlLexer:
Dom.StringReader

Public Member Functions

 HtmlLexer (string str, Node context)
 
void CheckAfterBodyStack ()
 Checks if all elements on the stack are ok to be open in the AfterBody mode. More...
 
void Reset ()
 Resets the insertion mode. http://www.w3.org/html/wg/drafts/html/master/syntax.html#the-insertion-mode More...
 
void Parse ()
 Parses the whole string. More...
 
void Push (Element el, bool stack)
 Pushes a new open element. More...
 
void Process (Node node, string close)
 
void Process (Node node, string close, int mode)
 
void SkipNewline ()
 used by e.g. pre; skips a newline if there is one. More...
 
void CloseParagraph ()
 
void CloseParagraphThenAdd (Element el)
 Closes a paragraph in button scope then pushes the given element. More...
 
void CloseParagraphButtonScope ()
 
void InTableElse (Node node, string close)
 The all other nodes route when in the 'in table' mode. More...
 
void CloseTableZoneInCell (string close)
 Closes the cell if the given close tag is in scope, then reprocesses it. More...
 
void AfterHeadHeadTag (Node node)
 Handles head-favouring tags when in the 'after head' mode. Base, link, meta etc are examples of favouring tags; they prefer to be in the head. More...
 
void CloseToTableBodyIfBody (Node node, string close)
 Closes to table body context if tbody, head or foot are in scope. More...
 
void CloseCaption (Node node, string close)
 Closes a caption (if it's in scope) and reprocesses the node in table mode. More...
 
void CloseIfThOrTr (Node node, string close)
 Triggers CloseCell if th or td are in scope. More...
 
void TableBodyIfTrInScope (Node node, string close)
 Closes to a table context and switches to table body if a tr is in scope. More...
 
void CloseSelect (bool skipScopeCheck, Node node, string close)
 
void CloseCell ()
 Closes a table cell. More...
 
void BeforeHtmlElse (Node node, string close)
 
void AdoptionAgencyAlgorithm (string tag)
 This attempts to recover mis-nested tags. For example Hi! is relatively common. This is aka the Heisenburg algorithm, but it's named 'adoption agency' in HTML5. More...
 
Element FormattingCurrentlyOpen (string tagName)
 Checks if the named tag is currently open on the formatting stack. More...
 
void AddFormatting (Element element)
 Adds a formatting element. More...
 
void ClearFormatting ()
 Clears formatting info to the last marker. More...
 
void AddScopeMarker ()
 Adds a formatting scope marker. More...
 
void ReconstructFormatting ()
 Reconstruct the list of active formatting elements, if any. More...
 
void CloseMarkedFormattingElement (string close)
 Closes a marked formatting element like object or applet. More...
 
void AddMarkedFormattingElement (Element el)
 Adds a marked formatting element like object or applet. More...
 
void AddFormattingElement (Element el)
 
bool IsInListItemScope (string tagName)
 True if the given tag is in list item scope. More...
 
bool IsInScope (string tagName)
 True if the given tag is in element scope. More...
 
bool IsInButtonScope (string tagName)
 True if the given tag is in button scope. More...
 
bool IsInTableScope (string tagName)
 True if the given tag is in table scope. More...
 
bool IsInSelectScope (string tagName)
 True if the given tag is in select scope. More...
 
void CloseInclusive (string tag)
 
void CloseNodesFrom (int index)
 Closes all nodes from the given open element stack index. Inclusive. More...
 
void CloseToTableRowContext ()
 Close to a table body context. thead, tfoot, tbody, html and template. More...
 
void CloseToTableBodyContext ()
 Close to a table body context. thead, tfoot, tbody, html and template. More...
 
void CloseToTableContext ()
 Close to a table context. More...
 
void InputOrTextareaInSelect (Element el)
 Input or textarea in select mode. More...
 
void RawTextOrRcDataAlgorithm (Element el, HtmlParseMode stateAfter)
 'Generic raw text element parsing algorithm'. Adds the current node then switches to the given state, whilst also changing the mode to Text. More...
 
void AfterHeadElse (Node node, string close)
 Anything else in the 'after head' mode. More...
 
void InHeadElse (Node node, string close)
 Anything else in the 'in head' mode. More...
 
void CombineInto (Element el, Element target)
 Combines the attribs of the given element into target. Adds the attributes to target if they don't exist (doesn't overwrite). More...
 
void BlockClose (string close)
 Attempts to close a block element. More...
 
bool TagCurrentlyOpen (string tagName)
 Checks if the named tag is currently open. More...
 
void TemplateStep (Node node, string close, int mode)
 Inserting something in the template. More...
 
void CloseTemplate ()
 Closes the template element. More...
 
void Finish ()
 Generate implicit end tags. More...
 
void GenerateImpliedEndTags ()
 Generate implicit end tags. More...
 
void GenerateImpliedEndTagsThorough ()
 Generate implicit end tags. More...
 
void GenerateImpliedEndTagsExceptFor (string tagName)
 Generate implicit end tags. More...
 
void CloseNode (Element el)
 
void CloseCurrentNode ()
 Pops the last node from the stack of open nodes. More...
 
bool CallCloseMethod (string tag, int mode)
 Calls Element.OnLexerCloseNode. Note that it's an instance method but it can be called without an instance when the DOM isn't balanced. For example, a balanced DOM will have a 'div' on the open element stack, and we want to handle its /div tag when it shows up. This would directly invoke close on that open element. If we're not balanced, it obtains SupportedTagMeta.CloseMethod and invokes it with a null instance. See SupportedTagMeta.CloseMethod for more. More...
 
Element CreateTag (string tag, bool callLoad)
 Creates an element from the given namespace/ tag name. More...
 
- Public Member Functions inherited from Dom.StringReader
 StringReader (byte[] str)
 Creates a new reader for the raw single-byte encoded string. Useful if you're talking to e.g. a webserver with a binary protocol. More...
 
 StringReader (string str)
 Creates a new reader for the given string. More...
 
bool More ()
 Checks if there is anything left to read. More...
 
bool Peek (string str)
 Checks if the given string is next. More...
 
bool PeekLower (string str)
 Checks if the given string is next; it checks by lowercasing the target character. More...
 
char Peek ()
 Takes a peek at the next character in the stream without reading it. More...
 
char Peek (int delta)
 Takes a peek at the character that is a number of characters away from the next one without actually reading it. Peek(0) is the next character, Peek(1) is the one after that etc. More...
 
void StepBack ()
 Steps back one place in the stream. More...
 
void Advance ()
 Steps forward one place in the stream. More...
 
void Advance (int places)
 Steps forward the given number of places in the stream. More...
 
int Length ()
 The length of the string. More...
 
string ReadString (int length)
 Reads a substring of the given length. Note that this does not do bounds checking. More...
 
virtual char Read ()
 Reads a character from the stream and advances the stream one place. More...
 
void ReadUntil (char character)
 Keeps reading the given character from the stream until it's no longer next. Used for e.g. stripping an unknown length block of whitespaces in the stream. More...
 
void ReadOff (char[] chars)
 Keeps reading from the stream until no characters in the given set are next. Used for e.g. stripping an unknown number of newlines (
or ) from this stream. More...
 
void ReadOff (char[] chars, out int count)
 Keeps reading from the stream until no characters in the given set are next. Used for e.g. stripping an unknown number of newlines (
or ) from this stream. More...
 
int NextIndexOf (char character)
 Gets the next index of the given character. The length is returned if it wasn't found at all. More...
 
int NextIndexOf (char character, int limit)
 Gets the next index of the given character, up to limit. Limit is returned if it wasn't found at all. More...
 
virtual int GetLineNumber ()
 Gets the line number that the pointer is currently at. More...
 
int GetLineNumber (out int charOnLine)
 Gets the line number and character number that the pointer is currently at. More...
 
string ReadLine (int lineNumber)
 Reads the numbered line from this stream. More...
 

Static Public Member Functions

static bool IsAsciiLetter (char c)
 Determines if the given character is an upper/lowercase character. More...
 
static bool IsSpaceCharacter (char c)
 True if the given char is any of the HTML5 space characters (includes newlines etc). More...
 
- Static Public Member Functions inherited from Dom.StringReader
static int NextIndexOf (int position, string input, char character, int limit)
 Gets the next index of the given character, up to limit. Limit is returned if it wasn't found at all. More...
 
static int NextIndexOf (int position, string input, char character)
 Gets the next index of the given character. The length is returned if it wasn't found at all. More...
 

Public Attributes

HtmlParseMode State
 Gets or sets the current parse mode. More...
 
MLNamespace Namespace
 Current namespace. Defaults to XHTML (for all our HTML tags). More...
 
Document Document
 Document we're adding to. More...
 
readonly List< ElementOpenElements
 
readonly Stack< int > TemplateModes
 
readonly List< ElementFormattingElements
 
int PreviousMode = HtmlTreeMode.Initial
 The current tree mode. More...
 
int CurrentMode = HtmlTreeMode.Initial
 The current tree mode. More...
 
int TextBlockLength
 The length of the current text buffer. More...
 
System.Text.StringBuilder Builder =new System.Text.StringBuilder()
 A string builder used for constructing tokens. More...
 
Element head
 The head pointer. More...
 
Element form
 The form pointer. More...
 
TextNode PendingTableCharacters
 The pending table chars 'list' (we only ever add one to it). More...
 
string LastStartTag
 The last created start tag name (lowercase). More...
 
bool FramesetOk =true
 Frameset-ok flag More...
 
bool _foster =false
 Table foster parenting. Occurs when tables are mis-nested and affects how elements are added. More...
 
- Public Attributes inherited from Dom.StringReader
string Input
 The original string. More...
 
int Position
 The current position this reader is at in the string. More...
 
int InputLength
 The length of the input string. More...
 

Properties

static MLNamespace XHTMLNamespace [get]
 The XML namespace for XHTML. More...
 
string CurrentTag [get]
 The current tag on the top of the stack. More...
 
Node CurrentNode [get]
 The current open node. More...
 
Element CurrentElement [get]
 The current open element. More...
 

Private Member Functions

int GetAppropriateEnd (out bool closing)
 Keeps reading until </lastStartTag> is seen. More...
 
string ReadRawTag (bool open, bool withName)
 Reads the contents of an open/close tag. More...
 
void EndTag ()
 
void OpenPCTag ()
 
Comment FlushCommentNode (int positionDelta)
 
void LoadComment ()
 Reads a comments body. More...
 
bool CommentDashEnd ()
 See 8.2.4.49 Comment end dash state More...
 
bool CommentEnd ()
 Checks if the comment has ended. More...
 
void OpenRCTag ()
 
bool CreateIfAppropriate (char c)
 Creates a close tag if one is appropriate. More...
 
void BogusComment ()
 See 8.2.4.44 Bogus comment state More...
 
void HandleText (bool stopAtTag, bool allowVars)
 Creates a text content block. More...
 
void AddElementWithFoster (Element element)
 
void InBodyEndTagElse (string close)
 Any other end tag has been found in the InBody state. More...
 
void FlushComment ()
 Writes out any pending text as a comment node. More...
 
TextNode FlushTextNode ()
 Writes out any pending text to a text element. More...
 
TextNode AppendText (TextNode node, string text)
 Appends text to the given node or creates a new node if it's null. More...
 
void AddVariable ()
 Reads out a (as used by PowerUI for localization purposes). More...
 

Private Attributes

TextNode text_
 The latest added text node. Gets cleared whenever Process is called. More...
 

Static Private Attributes

static MLNamespace _XHTMLNamespace
 Cached reference for the XHTML namespace. More...
 

Additional Inherited Members

- Static Public Attributes inherited from Dom.StringReader
static char NULL ='\0'
 The null character. This is returned when operations are working beyond the end of the stream. More...
 

Constructor & Destructor Documentation

Dom.HtmlLexer.HtmlLexer ( string  str,
Node  context 
)
inline

Member Function Documentation

void Dom.HtmlLexer.AddElementWithFoster ( Element  element)
inlineprivate
void Dom.HtmlLexer.AddFormatting ( Element  element)
inline

Adds a formatting element.

void Dom.HtmlLexer.AddFormattingElement ( Element  el)
inline
void Dom.HtmlLexer.AddMarkedFormattingElement ( Element  el)
inline

Adds a marked formatting element like object or applet.

void Dom.HtmlLexer.AddScopeMarker ( )
inline

Adds a formatting scope marker.

void Dom.HtmlLexer.AddVariable ( )
inlineprivate

Reads out a (as used by PowerUI for localization purposes).

void Dom.HtmlLexer.AdoptionAgencyAlgorithm ( string  tag)
inline

This attempts to recover mis-nested tags. For example Hi! is relatively common. This is aka the Heisenburg algorithm, but it's named 'adoption agency' in HTML5.

Parameters
tagThe actual tag given.
void Dom.HtmlLexer.AfterHeadElse ( Node  node,
string  close 
)
inline

Anything else in the 'after head' mode.

void Dom.HtmlLexer.AfterHeadHeadTag ( Node  node)
inline

Handles head-favouring tags when in the 'after head' mode. Base, link, meta etc are examples of favouring tags; they prefer to be in the head.

TextNode Dom.HtmlLexer.AppendText ( TextNode  node,
string  text 
)
inlineprivate

Appends text to the given node or creates a new node if it's null.

void Dom.HtmlLexer.BeforeHtmlElse ( Node  node,
string  close 
)
inline
void Dom.HtmlLexer.BlockClose ( string  close)
inline

Attempts to close a block element.

void Dom.HtmlLexer.BogusComment ( )
inlineprivate

See 8.2.4.44 Bogus comment state

Parameters
cThe current character.
bool Dom.HtmlLexer.CallCloseMethod ( string  tag,
int  mode 
)
inline

Calls Element.OnLexerCloseNode. Note that it's an instance method but it can be called without an instance when the DOM isn't balanced. For example, a balanced DOM will have a 'div' on the open element stack, and we want to handle its /div tag when it shows up. This would directly invoke close on that open element. If we're not balanced, it obtains SupportedTagMeta.CloseMethod and invokes it with a null instance. See SupportedTagMeta.CloseMethod for more.

void Dom.HtmlLexer.CheckAfterBodyStack ( )
inline

Checks if all elements on the stack are ok to be open in the AfterBody mode.

void Dom.HtmlLexer.ClearFormatting ( )
inline

Clears formatting info to the last marker.

void Dom.HtmlLexer.CloseCaption ( Node  node,
string  close 
)
inline

Closes a caption (if it's in scope) and reprocesses the node in table mode.

void Dom.HtmlLexer.CloseCell ( )
inline

Closes a table cell.

void Dom.HtmlLexer.CloseCurrentNode ( )
inline

Pops the last node from the stack of open nodes.

void Dom.HtmlLexer.CloseIfThOrTr ( Node  node,
string  close 
)
inline

Triggers CloseCell if th or td are in scope.

void Dom.HtmlLexer.CloseInclusive ( string  tag)
inline
void Dom.HtmlLexer.CloseMarkedFormattingElement ( string  close)
inline

Closes a marked formatting element like object or applet.

void Dom.HtmlLexer.CloseNode ( Element  el)
inline
void Dom.HtmlLexer.CloseNodesFrom ( int  index)
inline

Closes all nodes from the given open element stack index. Inclusive.

void Dom.HtmlLexer.CloseParagraph ( )
inline
void Dom.HtmlLexer.CloseParagraphButtonScope ( )
inline
void Dom.HtmlLexer.CloseParagraphThenAdd ( Element  el)
inline

Closes a paragraph in button scope then pushes the given element.

void Dom.HtmlLexer.CloseSelect ( bool  skipScopeCheck,
Node  node,
string  close 
)
inline
void Dom.HtmlLexer.CloseTableZoneInCell ( string  close)
inline

Closes the cell if the given close tag is in scope, then reprocesses it.

void Dom.HtmlLexer.CloseTemplate ( )
inline

Closes the template element.

void Dom.HtmlLexer.CloseToTableBodyContext ( )
inline

Close to a table body context. thead, tfoot, tbody, html and template.

void Dom.HtmlLexer.CloseToTableBodyIfBody ( Node  node,
string  close 
)
inline

Closes to table body context if tbody, head or foot are in scope.

void Dom.HtmlLexer.CloseToTableContext ( )
inline

Close to a table context.

void Dom.HtmlLexer.CloseToTableRowContext ( )
inline

Close to a table body context. thead, tfoot, tbody, html and template.

void Dom.HtmlLexer.CombineInto ( Element  el,
Element  target 
)
inline

Combines the attribs of the given element into target. Adds the attributes to target if they don't exist (doesn't overwrite).

bool Dom.HtmlLexer.CommentDashEnd ( )
inlineprivate

See 8.2.4.49 Comment end dash state

bool Dom.HtmlLexer.CommentEnd ( )
inlineprivate

Checks if the comment has ended.

bool Dom.HtmlLexer.CreateIfAppropriate ( char  c)
inlineprivate

Creates a close tag if one is appropriate.

Element Dom.HtmlLexer.CreateTag ( string  tag,
bool  callLoad 
)
inline

Creates an element from the given namespace/ tag name.

void Dom.HtmlLexer.EndTag ( )
inlineprivate
void Dom.HtmlLexer.Finish ( )
inline

Generate implicit end tags.

void Dom.HtmlLexer.FlushComment ( )
inlineprivate

Writes out any pending text as a comment node.

Comment Dom.HtmlLexer.FlushCommentNode ( int  positionDelta)
inlineprivate
TextNode Dom.HtmlLexer.FlushTextNode ( )
inlineprivate

Writes out any pending text to a text element.

Element Dom.HtmlLexer.FormattingCurrentlyOpen ( string  tagName)
inline

Checks if the named tag is currently open on the formatting stack.

void Dom.HtmlLexer.GenerateImpliedEndTags ( )
inline

Generate implicit end tags.

void Dom.HtmlLexer.GenerateImpliedEndTagsExceptFor ( string  tagName)
inline

Generate implicit end tags.

void Dom.HtmlLexer.GenerateImpliedEndTagsThorough ( )
inline

Generate implicit end tags.

int Dom.HtmlLexer.GetAppropriateEnd ( out bool  closing)
inlineprivate

Keeps reading until </lastStartTag> is seen.

void Dom.HtmlLexer.HandleText ( bool  stopAtTag,
bool  allowVars 
)
inlineprivate

Creates a text content block.

void Dom.HtmlLexer.InBodyEndTagElse ( string  close)
inlineprivate

Any other end tag has been found in the InBody state.

Parameters
tagThe actual tag found.
void Dom.HtmlLexer.InHeadElse ( Node  node,
string  close 
)
inline

Anything else in the 'in head' mode.

void Dom.HtmlLexer.InputOrTextareaInSelect ( Element  el)
inline

Input or textarea in select mode.

void Dom.HtmlLexer.InTableElse ( Node  node,
string  close 
)
inline

The all other nodes route when in the 'in table' mode.

static bool Dom.HtmlLexer.IsAsciiLetter ( char  c)
inlinestatic

Determines if the given character is an upper/lowercase character.

Parameters
cThe character to examine.
bool Dom.HtmlLexer.IsInButtonScope ( string  tagName)
inline

True if the given tag is in button scope.

bool Dom.HtmlLexer.IsInListItemScope ( string  tagName)
inline

True if the given tag is in list item scope.

bool Dom.HtmlLexer.IsInScope ( string  tagName)
inline

True if the given tag is in element scope.

bool Dom.HtmlLexer.IsInSelectScope ( string  tagName)
inline

True if the given tag is in select scope.

bool Dom.HtmlLexer.IsInTableScope ( string  tagName)
inline

True if the given tag is in table scope.

static bool Dom.HtmlLexer.IsSpaceCharacter ( char  c)
inlinestatic

True if the given char is any of the HTML5 space characters (includes newlines etc).

void Dom.HtmlLexer.LoadComment ( )
inlineprivate

Reads a comments body.

void Dom.HtmlLexer.OpenPCTag ( )
inlineprivate
void Dom.HtmlLexer.OpenRCTag ( )
inlineprivate
void Dom.HtmlLexer.Parse ( )
inline

Parses the whole string.

void Dom.HtmlLexer.Process ( Node  node,
string  close 
)
inline
void Dom.HtmlLexer.Process ( Node  node,
string  close,
int  mode 
)
inline
void Dom.HtmlLexer.Push ( Element  el,
bool  stack 
)
inline

Pushes a new open element.

void Dom.HtmlLexer.RawTextOrRcDataAlgorithm ( Element  el,
HtmlParseMode  stateAfter 
)
inline

'Generic raw text element parsing algorithm'. Adds the current node then switches to the given state, whilst also changing the mode to Text.

string Dom.HtmlLexer.ReadRawTag ( bool  open,
bool  withName 
)
inlineprivate

Reads the contents of an open/close tag.

void Dom.HtmlLexer.ReconstructFormatting ( )
inline

Reconstruct the list of active formatting elements, if any.

void Dom.HtmlLexer.Reset ( )
inline
void Dom.HtmlLexer.SkipNewline ( )
inline

used by e.g. pre; skips a newline if there is one.

void Dom.HtmlLexer.TableBodyIfTrInScope ( Node  node,
string  close 
)
inline

Closes to a table context and switches to table body if a tr is in scope.

bool Dom.HtmlLexer.TagCurrentlyOpen ( string  tagName)
inline

Checks if the named tag is currently open.

void Dom.HtmlLexer.TemplateStep ( Node  node,
string  close,
int  mode 
)
inline

Inserting something in the template.

Parameters
tokenThe token to insert.
modeThe mode to push.

Member Data Documentation

bool Dom.HtmlLexer._foster =false

Table foster parenting. Occurs when tables are mis-nested and affects how elements are added.

MLNamespace Dom.HtmlLexer._XHTMLNamespace
staticprivate

Cached reference for the XHTML namespace.

System.Text.StringBuilder Dom.HtmlLexer.Builder =new System.Text.StringBuilder()

A string builder used for constructing tokens.

int Dom.HtmlLexer.CurrentMode = HtmlTreeMode.Initial

The current tree mode.

Document Dom.HtmlLexer.Document

Document we're adding to.

Element Dom.HtmlLexer.form

The form pointer.

readonly List<Element> Dom.HtmlLexer.FormattingElements
bool Dom.HtmlLexer.FramesetOk =true

Frameset-ok flag

Element Dom.HtmlLexer.head

The head pointer.

string Dom.HtmlLexer.LastStartTag

The last created start tag name (lowercase).

MLNamespace Dom.HtmlLexer.Namespace

Current namespace. Defaults to XHTML (for all our HTML tags).

readonly List<Element> Dom.HtmlLexer.OpenElements
TextNode Dom.HtmlLexer.PendingTableCharacters

The pending table chars 'list' (we only ever add one to it).

int Dom.HtmlLexer.PreviousMode = HtmlTreeMode.Initial

The current tree mode.

HtmlParseMode Dom.HtmlLexer.State

Gets or sets the current parse mode.

readonly Stack<int> Dom.HtmlLexer.TemplateModes
TextNode Dom.HtmlLexer.text_
private

The latest added text node. Gets cleared whenever Process is called.

int Dom.HtmlLexer.TextBlockLength

The length of the current text buffer.

Property Documentation

Element Dom.HtmlLexer.CurrentElement
get

The current open element.

Node Dom.HtmlLexer.CurrentNode
get

The current open node.

string Dom.HtmlLexer.CurrentTag
get

The current tag on the top of the stack.

MLNamespace Dom.HtmlLexer.XHTMLNamespace
staticget

The XML namespace for XHTML.