Beautify HtmlAgilityPack

HtmlAgilityPackAs can be read on the internet: HtmlAgilityPack is not for beautiful, aka human readable, html files.

“[…] it’s a ‘by design’ choice.” [https://stackoverflow.com/a/5969074]

So everyone redirects you to some other library.

Now, I am a bit stubborn. I want to use HtmlAgilityPack and I want to have indented, human-readable html files. The magic is within text nodes in the DOM. So, I wrote two utility functions to help me out.

First, to get rid of all unwanted whitespaces. This one might be a bit aggressiv, but it was ok for me:

static private void removeWhitespace(HtmlNode node) {
  foreach (HtmlNode n in node.ChildNodes.ToArray()) {
    if (n.NodeType == HtmlNodeType.Text) {
      if (string.IsNullOrWhiteSpace(n.InnerHtml)) {
        node.RemoveChild(n);
      }
    } else removeWhitespace(n);
  }
}

And, second, to create white spaces for line breaks and indentions:

internal static void beautify(HtmlDocument doc) {
  foreach (var topNode in doc.DocumentNode.ChildNodes.ToArray()) {
    switch (topNode.NodeType) {
      case HtmlNodeType.Comment: {
          HtmlCommentNode cn = (HtmlCommentNode)topNode;
          if (string.IsNullOrEmpty(cn.Comment)) continue;
          if (!cn.Comment.EndsWith("\n")) cn.Comment += "\n";
        } break;
      case HtmlNodeType.Element: {
          beautify(topNode, 0);
          topNode.AppendChild(doc.CreateTextNode("\n"));
          //doc.DocumentNode.InsertAfter(doc.CreateTextNode("\n"), topNode);
        } break;
      case HtmlNodeType.Text:
        break;
      default:
        break;
    }
  }
}

private static bool beautify(HtmlNode node, int level) {
  if (!node.HasChildNodes) return false;

  var children = node.ChildNodes.ToArray();
  bool onlyText = true;
  foreach (var c in children) {
    if (c.NodeType != HtmlNodeType.Text) onlyText = false;
  }
  if (onlyText) return false;

  string nli = "\n" + new string('\t', level);

  foreach (var c in children) {
    node.InsertBefore(node.OwnerDocument.CreateTextNode(nli), c);
    if (c.NodeType == HtmlNodeType.Element) {
      if (c.HasChildNodes) {
        if (beautify(c, level + 1)) {
          c.AppendChild(c.OwnerDocument.CreateTextNode(nli));
        }
      }
    }
  }
  return true;
}

As you might see, the code is pretty hacky. But, it works for me. Maybe, it also works for you, or it can be a starting point.

Leave a Reply

Your email address will not be published.

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.