这是一个能帮你从HTML生成有效XHTML的经典库。它还提供对标签以及属性过滤的支持。你可以指定允许哪些标签和属性可在出现在输出中,而其他的标签过滤掉。你也可以使用这个库清理Microsoft Word文档转化成HTML时生成的臃肿的HTML。你也在将HTML发布到博客网站前清理一下,否则像WordPress、b2evolution等博客引擎会拒绝的。
里面有两个类:HtmlReader和HtmlWriter
HtmlReader拓展了著名的由Chris Clovett开发的SgmlReader。当它读取HTML时,它跳过所有有前缀的节点。其中,所有像<o:p>、<o:Document>、<st1:personname>等上百的无用标签被滤除了。这样你读取的HTML就剩下核心的HTML标签了。
HtmlWriter拓展了常规的XmlWriter,XmlWriter生成XML。XHTML本质上是XML格式的HTML。所有你熟悉使用的标签——比如<img>、<br>和<hr>,都不是闭合的标签——在XHTML中必需是空元素形式,像<img .. />、<br/>和<hr/>。由于XHTML是常见的XML格式,你可以方便的使用XML解析器读取XHTML文档。这使得有了应用XPath搜索的机会。
HtmlReader很简单,下面是完整的类:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34////// This class skips all nodes which has some
/// kind of prefix. This trick does the job
/// to clean up MS Word/Outlook HTML markups.
///public class HtmlReader : Sgml.SgmlReader
{
public
HtmlReader( TextReader reader ) : base( )
{
base.InputStream = reader;
base.DocType =
"HTML"
;
}
public
HtmlReader( string content ) : base( )
{
base.InputStream =
new
StringReader( content );
base.DocType =
"HTML"
;
}
public
override bool Read()
{
bool status = base.Read();
if
( status )
{
if
( base.NodeType == XmlNodeType.Element )
{
// Got a node with prefix. This must be one
// of those "" or something else.
// Skip this node entirely. We want prefix
// less nodes so that the resultant XML
// requires not namespace.
if
( base.Name.IndexOf(
':'
) >
0
)
base.Skip();
}
}
return
status;
}
}
这个类是有点麻烦。下面是使用技巧:
重写WriteString方法并避免使用常规的XML编码。对HTML文件手动更改编码。
重写WriteStartElementis以避免不被允许的标签写到输出中。
重写WriteAttributesis以避免不需求的属性。
让我们分部分来看下整个类:
你可以通过修改下面的部分配置HtmlWriter:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28public
class
HtmlWriter : XmlTextWriter
{
////// If set to true, it will filter the output
/// by using tag and attribute filtering,
/// space reduce etc
///public bool FilterOutput = false;
////// If true, it will reduce consecutive with one instance
///public bool ReduceConsecutiveSpace = true;
////// Set the tag names in lower case which are allowed to go to output
///public string [] AllowedTags =
new
string[] {
"p"
,
"b"
,
"i"
,
"u"
,
"em"
,
"big"
,
"small"
,
"div"
,
"img"
,
"span"
,
"blockquote"
,
"code"
,
"pre"
,
"br"
,
"hr"
,
"ul"
,
"ol"
,
"li"
,
"del"
,
"ins"
,
"strong"
,
"a"
,
"font"
,
"dd"
,
"dt"
};
////// If any tag found which is not allowed, it is replaced by this tag.
/// Specify a tag which has least impact on output
///public string ReplacementTag = "dd";
////// New lines \r\n are replaced with space
/// which saves space and makes the
/// output compact
///public bool RemoveNewlines = true;
////// Specify which attributes are allowed.
/// Any other attribute will be discarded
///public string [] AllowedAttributes = new string[]
{
"class"
,
"href"
,
"target"
,
"border"
,
"src"
,
"align"
,
"width"
,
"height"
,
"color"
,
"size"
};
}
////// The reason why we are overriding
/// this method is, we do not want the output to be
/// encoded for texts inside attribute
/// and inside node elements. For example, all the
/// gets converted to   in output. But this does not
/// apply to HTML. In HTML, we need to have as it is.
//////public override void WriteString(string text)
{
// Change all non-breaking space to normal space
text = text.Replace(
" "
,
" "
);
/// When you are reading RSS feed and writing Html,
/// this line helps remove those CDATA tags
text = text.Replace(
""
,
""
);
// Do some encoding of our own because
// we are going to use WriteRaw which won't
// do any of the necessary encoding
text = text.Replace(
"<"
,
"<"
);
text = text.Replace(
">"
,
">"
);
text = text.Replace(
"'"
,
"'"
);
text = text.Replace(
"\""
,
""
e;" );
if
(
this
.FilterOutput )
{
text = text.Trim();
// We want to replace consecutive spaces
// to one space in order to save horizontal width
if
(
this
.ReduceConsecutiveSpace )
text = text.Replace(
" "
,
" "
);
if
(
this
.RemoveNewlines )
text = text.Replace(Environment.NewLine,
" "
);
base.WriteRaw( text );
}
else
{
base.WriteRaw( text );
}
}
public
override
void
WriteStartElement(string prefix,
string localName, string ns)
{
if
(
this
.FilterOutput )
{
bool canWrite =
false
;
string tagLocalName = localName.ToLower();
foreach( string name in
this
.AllowedTags )
{
if
( name == tagLocalName )
{
canWrite =
true
;
break
;
}
}
if
( !canWrite )
localName =
"dd"
;
}
base.WriteStartElement(prefix, localName, ns);
}
bool canWrite =
false
;
string attributeLocalName = reader.LocalName.ToLower();
foreach( string name in
this
.AllowedAttributes )
{
if
( name == attributeLocalName )
{
canWrite =
true
;
break
;
}
}
// If allowed, write the attribute
if
( canWrite )
this
.WriteStartAttribute(reader.Prefix,
attributeLocalName, reader.NamespaceURI);
while
(reader.ReadAttributeValue())
{
if
(reader.NodeType == XmlNodeType.EntityReference)
{
if
( canWrite )
this
.WriteEntityRef(reader.Name);
continue
;
}
if
( canWrite )
this
.WriteString(reader.Value);
}
if
( canWrite )
this
.WriteEndAttribute();
示例应用是一个你可以立即用来清理HTML文件的实用工具。你可以将这个类应用在像博客等需要发布一些HTML到Web服务的工具中。
原文地址:http://www.codeproject.com/Articles/10792/Convert-HTML-to-XHTML-and-Clean-Unnecessary-Tags-a