Learn the Tidy extension library in PHP together

It is estimated that many students may not have heard of this extension. This is not a teddy bear, but an extension for dealing with HTML related operations. It can be mainly used for formatting and displaying data format contents such as HTML, XHTML and XML.

About Tidy Library

The tidy library extension is released with PHP, that is, we can add -- with tidy to install this extension when compiling and installing PHP, or we can install it later through the source code in the tidy directory under the ext / folder in the source package. At the same time, the tidy extension also needs to rely on a tidy function library. We need to install it on the operating system. If it is CentOS, just use Yum install libtidy devel.

Tidy formatting

First, let's take a look at how to format a piece of HTML code through this Tidy extension library.

$content = <<<EOF
<html><head><title>test</title></head> <body><p>error<br>another line</i></body>
</html>
EOF;

$tidy = new Tidy();
$config = [
        'indent'=>true,
        'output-xhtml'=>true,
];
$tidy->parseString($content, $config);
$tidy->cleanRepair();

echo $tidy, PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

The HTML code in $content we defined is a very nonstandard HTML code without any format. By instantiating a tidy object, using the parseString() method, executing the cleanRepair() method, and then directly printing the $tidy object, we get the formatted HTML code. It seems that it is not very standard. Whether it is xmlns or indentation format, it is very standard.

The parseString() method has two parameters. The first parameter is the string to be formatted. The second parameter is the formatted configuration. This configuration receives an array, and its internal content must also be the configuration information defined in the Tidy component. These configuration information can be queried in the second link at the end of the article. Here, we only configure two contents. Indent indicates whether to apply indent block level, and output xhtml indicates whether to output xhtml.

The cleanRepair() method is used to clear and repair the parsed content, which is actually a formatted cleanup.

Note that we print the Tidy object directly in the test code, that is, this object implements__ toString(), which actually looks like this.

var_dump($tidy);
// object(tidy)#1 (2) {
//     ["errorBuffer"]=>
//     string(112) "line 1 column 1 - Warning: missing <!DOCTYPE> declaration
//   line 1 column 70 - Warning: discarding unexpected </i>"
//     ["value"]=>
//     string(195) "<html xmlns="http://www.w3.org/1999/xhtml">
//     <head>
//       <title>
//         test
//       </title>
//     </head>
//     <body>
//       <p>
//         error<br />
//         another line
//       </p>
//     </body>
//   </html>"
//   }

Acquisition of various attribute information

var_dump($tidy->isXml()); // bool(false)

var_dump($tidy->isXhtml()); // bool(false)

var_dump($tidy->getStatus()); // int(1)

var_dump($tidy->getRelease());  // string(10) "2017/11/25"

var_dump($tidy->getHtmlVer()); // int(500)

We can get some information about the document to be processed through the attributes of the Tidy object, such as whether it is XML or XHTML content.

getStatus() returns the status information of the Tidy object. The current 1 indicates that there is a warning or auxiliary function error. From the contents of the Tidy object printed above, we can see that there is a warning alarm in the errorBuffer attribute of the object.

getRelease() returns the version information of the current tidy component, that is, the tidy component you installed on the operating system. getHtmlVer() returns the detected HTML version. There is no more description or introduction to the 500 here. I don't know what the 500 means.

In addition to the above, we can also get the configuration information and related instructions in the previous $config.

var_dump($tidy->getOpt('indent')); // int(1)

var_dump($tidy->getOptDoc('output-xhtml'));
// string(489) "This option specifies if Tidy should generate pretty printed output, writing it as extensible HTML. <br/>This option causes Tidy to set the DOCTYPE and default namespace as appropriate to XHTML, and will use the corrected value in output regardless of other sources. <br/>For XHTML, entities can be written as named or numeric entities according to the setting of <code>numeric-entities</code>. <br/>The original case of tags and attributes will be preserved, regardless of other options. "

The getOpt() method requires a parameter, that is, the information content configured in $config to be queried. If we view the parameters we do not configure in $config, the returned values are the default configuration values. getOptDoc() is very considerate. It returns a description document about a parameter.

Finally, there are some more dry methods that can directly operate nodes.

echo $tidy->head(), PHP_EOL;
// <head>
//   <title>
//   test
// </title>
// </head>

$body = $tidy->body();

var_dump($body);
// object(tidyNode)#2 (9) {
//     ["value"]=>
//     string(60) "<body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>"
//     ["name"]=>
//     string(4) "body"
//     ["type"]=>
//     int(5)
//     ["line"]=>
//     int(1)
//     ["column"]=>
//     int(40)
//     ["proprietary"]=>
//     bool(false)
//     ["id"]=>
//     int(16)
//     ["attribute"]=>
//     NULL
//     ["child"]=>
//     array(1) {
//       [0]=>
//       object(tidyNode)#3 (9) {
//         ["value"]=>
//         string(37) "<p>
// ..................
// ..................

echo $tidy->html(), PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

echo $tidy->root(), PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

I believe it can be seen without too much explanation that head() returns the content in the < head > tag, while body() and html() are the corresponding related tags, and root() returns all the content of the root node, which can be regarded as the content of the whole document.

The contents returned by these methods and functions are actually a TidyNode object, which will be described in detail later.

Convert directly to string

The above operation code is based on the parseString() method. It does not return a value, or it only returns a Boolean success or failure identifier. If we need to get the formatted content, we can only directly treat the object as a string or use root() to get all the content. In fact, another method is to return a formatted string directly.

$tidy = new Tidy();
$repair = $tidy->repairString($content, $config);

echo $repair, PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

The parameters as like as two peas (repairString()) are exactly the same as those of parseString(), and the only difference is that it is a string returned, rather than an operation inside the Tidy object.

Conversion error message

In the initial test code, we used var_ When dump () prints the Tidy object, you can see that there is an error message in the errorBuffer variable. This time we'll have another HTML code fragment with more problems.

$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<p>paragraph</p>
HTML;
$tidy = new Tidy();
$tidy->parseString($html);
$tidy->cleanRepair();

echo $tidy->errorBuffer, PHP_EOL;
// line 4 column 1 - Warning: <p> isn't allowed in <head> elements
// line 4 column 1 - Info: <head> previously mentioned
// line 4 column 1 - Warning: inserting implicit <body>
// line 4 column 1 - Warning: inserting missing 'title' element

$tidy ->diagnose();
echo $tidy->errorBuffer, PHP_EOL;
// line 4 column 1 - Warning: <p> isn't allowed in <head> elements
// line 4 column 1 - Info: <head> previously mentioned
// line 4 column 1 - Warning: inserting implicit <body>
// line 4 column 1 - Warning: inserting missing 'title' element
// Info: Doctype given is "-//W3C//DTD XHTML 1.0 Strict//EN"
// Info: Document content looks like XHTML 1.0 Strict
// Tidy found 3 warnings and 0 errors!

In this test code, we use a new diagnose() method to diagnose the document and add more information about the document in the errorBuffer object variable.

TidyNode operation

As we mentioned earlier, the methods head(), html(), body(), and root() return a TidyNode object. Is there anything special about this object?

$html = <<<EOF
<html><head>
<?php echo '<title>title</title>'; ?>
<#
  /* JSTE code */
  alert('Hello World');
#>
</head>
<body>

<?php
  // PHP code
  echo 'hello world!';
?>

<%
  /* ASP code */
  response.write("Hello World!")
%>

<!-- Comments -->
Hello World
</body></html>
Outside HTML
EOF;

$tidy = new Tidy();
$tidy->parseString($html);

$tidyNode = $tidy->html();

showNodes($tidyNode);

function showNodes($node){

    if($node->isComment()){
        echo '========', PHP_EOL,'This is Comment Node :"', $node->value, '"', PHP_EOL;
    }
    if($node->isText()){
        echo '--------', PHP_EOL,'This is Text Node :"', $node->value, '"', PHP_EOL;
        }
    if($node->isAsp()){
        echo '++++++++', PHP_EOL,'This is Asp Script :"', $node->value, '"', PHP_EOL;
        }
    if($node->isHtml()){
        echo '********', PHP_EOL,'This is HTML Node :"', $node->value, '"', PHP_EOL;
        }
    if($node->isPhp()){
        echo '########', PHP_EOL,'This is PHP Script :"', $node->value, '"', PHP_EOL;
        }
    if($node->isJste()){
        echo '@@@@@@@@', PHP_EOL,'This is JSTE Script :"', $node->value, '"', PHP_EOL;
    }

    if($node->name){
        // getParent()
        if($node->getParent()){
            echo '&&&&&&&& ', $node->name ,' getParent is : ', $node->getParent()->name, PHP_EOL;
        }

        // hasSiblings
        echo '^^^^^^^^ ', $node->name, ' has siblings is : ';
        var_dump($node->hasSiblings());
        echo PHP_EOL;
    }

    if($node->hasChildren()){
        foreach($node->child as $child){
            showNodes($child);
        }
    }
}

// ..................
// ..................
// ********
// This is HTML Node :"<head>
// <?php echo '<title>title</title>'; ><#
//   /* JSTE code */
//   alert('Hello World');
// #>
// <title></title>
// </head>
// "
// &&&&&&&& head getParent is : html
// ^^^^^^^^ head has siblings is : bool(true)
// ..................
// ..................
// ++++++++
// This is Asp Script :"<%
//   /* ASP code */
//   response.write("Hello World!")
// %>" 
// ..................
// ..................

The specific test steps of this code and the explanation of each function are not listed in detail. You can see from the code that our TidyNode object can judge the content of each node, such as whether there are child nodes and brother nodes. Object node content, you can judge the format of the node, whether it is annotation, text, JS code, PHP code, ASP code and so on. I don't know how you feel when you see here. Anyway, I think this thing is very interesting, especially the method of judging PHP code.

Information statistics function

Finally, let's take a look at some statistical functions in the Tidy extension library.

$html = <<<EOF
<p>test</i>
<bogustag>bogus</bogustag>
EOF;
$config = array('accessibility-check' => 3,'doctype'=>'bogus');
$tidy = new Tidy();
$tidy->parseString($html, $config);

echo 'tidy access count: ', tidy_access_count($tidy), PHP_EOL;
echo 'tidy config count: ', tidy_config_count($tidy), PHP_EOL;
echo 'tidy error count: ', tidy_error_count($tidy), PHP_EOL;
echo 'tidy warning count: ', tidy_warning_count($tidy), PHP_EOL;

// tidy access count: 4
// tidy config count: 2
// tidy error count: 1
// tidy warning count: 6

In fact, these numbers they return are the number of error messages. tidy_access_count() indicates the number of auxiliary function warnings encountered, tidy_config_count() is the number of configuration information errors. The other two can be seen from the name, so I don't need to say more.

summary

In short, the Tidy extension library is a less common but very interesting library. For some scenarios, such as template development, there is still some use. You can have a good and in-depth understanding of your learning attitude. Maybe it can solve your most difficult problem right now!

Test code:

https://github.com/zhangyue0503/dev-blog/blob/master/php/2021/01/source/8. Learn the Tidy extension library.php in PHP

Reference documents:

https://www.php.net/manual/zh/book.tidy.php

http://tidy.sourceforge.net/docs/quickref.html

Posted by zushiba on Mon, 08 Nov 2021 22:35:20 -0800