(English) Beautify HtmlAgilityPack

Leider ist der Eintrag nur auf Amerikanisches Englisch verfügbar. Der Inhalt wird unten in einer verfügbaren Sprache angezeigt. Klicken Sie auf den Link, um die aktuelle Sprache zu ändern.

HtmlAgilityPackAs can be read on the internet: HtmlAgilityPack is not for beautiful, aka human readable, html files.

“[…] it’s a ‘by design’ choice.” [https://stackoverflow.com/a/5969074]

So everyone redirects you to some other library.

Now, I am a bit stubborn. I want to use HtmlAgilityPack and I want to have indented, human-readable html files. The magic is within text nodes in the DOM. So, I wrote two utility functions to help me out.

First, to get rid of all unwanted whitespaces. This one might be a bit aggressiv, but it was ok for me:

static private void removeWhitespace(HtmlNode node) {
  foreach (HtmlNode n in node.ChildNodes.ToArray()) {
    if (n.NodeType == HtmlNodeType.Text) {
      if (string.IsNullOrWhiteSpace(n.InnerHtml)) {
        node.RemoveChild(n);
      }
    } else removeWhitespace(n);
  }
}

And, second, to create white spaces for line breaks and indentions:

internal static void beautify(HtmlDocument doc) {
  foreach (var topNode in doc.DocumentNode.ChildNodes.ToArray()) {
    switch (topNode.NodeType) {
      case HtmlNodeType.Comment: {
          HtmlCommentNode cn = (HtmlCommentNode)topNode;
          if (string.IsNullOrEmpty(cn.Comment)) continue;
          if (!cn.Comment.EndsWith("\n")) cn.Comment += "\n";
        } break;
      case HtmlNodeType.Element: {
          beautify(topNode, 0);
          topNode.AppendChild(doc.CreateTextNode("\n"));
          //doc.DocumentNode.InsertAfter(doc.CreateTextNode("\n"), topNode);
        } break;
      case HtmlNodeType.Text:
        break;
      default:
        break;
    }
  }
}

private static bool beautify(HtmlNode node, int level) {
  if (!node.HasChildNodes) return false;

  var children = node.ChildNodes.ToArray();
  bool onlyText = true;
  foreach (var c in children) {
    if (c.NodeType != HtmlNodeType.Text) onlyText = false;
  }
  if (onlyText) return false;

  string nli = "\n" + new string('\t', level);

  foreach (var c in children) {
    node.InsertBefore(node.OwnerDocument.CreateTextNode(nli), c);
    if (c.NodeType == HtmlNodeType.Element) {
      if (c.HasChildNodes) {
        if (beautify(c, level + 1)) {
          c.AppendChild(c.OwnerDocument.CreateTextNode(nli));
        }
      }
    }
  }
  return true;
}

As you might see, the code is pretty hacky. But, it works for me. Maybe, it also works for you, or it can be a starting point.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht.

*

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.