Skip to content Skip to sidebar Skip to footer

Php - Detect Html In String And Wrap With Code Tag

I'm in a trouble with treating HTML in text content. I'm thinking about a method that detects those tags and wrap all consecutive one inside code tags. Don't wrap me

Hello&

Solution 1:

It is hard to use DOMDocument in this specific case, since it wraps automatically text nodes with <p> tags (and add doctype, head, html). A way is to construct a pattern as a lexer using the (?(DEFINE)...) feature and named subpatterns:

$html = <<<EODDon'twrapme<p>Hello</p><divclass="text">wrap me please!</div><spanclass="title">wrap me either!</span> Don't wrap me <h1>End</h1>
EOD;

$pattern = <<<'EOD'
~
(?(DEFINE)
    (?<self>    < [^\W_]++ [^>]* > )
    (?<comment><!-- (?>[^-]++|-(?!->))* -->)
    (?<cdata>   \Q<![CDATA[\E (?>[^]]++|](?!]>))* ]]> )
    (?<text>    [^<]++ )
    (?<tag>
        < ([^\W_]++) [^>]* >
        (?> \g<text> | \g<tag> | \g<self> | \g<comment> | \g<cdata> )*
        </ \g{-1} >
    )
)
# main pattern
(?: \g<tag> | \g<self> | \g<comment> | \g<cdata> )+
~x
EOD;

$html = preg_replace($pattern, '<code>$0</code>', $html);

echo htmlspecialchars($html);

The (?(DEFINE)..) feature allows to put a definition section inside a regex pattern. This definition section and the named subpatterns inside don't match nothing, they are here to be used later in the main pattern.

(?<abcd> ...) defines a subpattern you can reuse later with \g<abcd>. In the above pattern, subpatterns defined in this way are:

  • self: that describes a self-closing tag
  • comment: for html comments
  • cdata: for cdata
  • text: for text (all that is not a tag, a comment, or cdata)
  • tag: for html tags that are not self-closed

self: [^\W_] is a trick to obtain \w without the underscore. [^\W]++ represents the tag name and is used too in the tag subpattern. [^>]* means all that is not a > zero or more times.

comment: (?>[^-]++|-(?!->))* describes all the possible content inside an html comment:

(?># open an atomic group
    [^-]++   # all that is not a literal -, one or more times (possessive)
  |          # OR
    -        # a literal -
    (?!->)   # not followed by -> (negative lookahead)
)*           # close and repeat the group zero or more times 

cdata: All characters between \Q..\E are seen as literal characters, special characters like [ don't need to be escaped. (This only a trick to make the pattern more readable).The content allowed in CDATA is described in the same way than the content in html comments.

text:[^<]++ all characters until an opening angle bracket or the end of the string.

tag:This is the most insteresting subpattern. Lines 1 and 3 are the opening and the closing tag. Note that, in line 1, the tag name is captured with a capturing group. In line 3, \g{-1} refers to the content matched by the last defined capturing group ("-1" means "one on the left").The line 2 describes the possible content between an opening and a closing tag. You can see that this description use not only subpatterns defined before but the current subpattern itself to allow nested tags.

Once all items have been set and the definition section closed, you can easily write the main pattern.

Solution 2:

I'm in a trouble with treating HTML in text content.

then just escape that text:

echo htmlspecialchars($your_text_that_may_contain_html_code);

parsing html with regex is a well-known-big-NO!

Solution 3:

This will find tags along with their closing tags, and everything in between:

<[A-Z][A-Z0-9]*\b[^>]*>.*?</\1>

You might be able to capture those tags and replace them with the tags around them. It may not work with every case, but you might find it sufficient for your needs if the html is fairly static.

Post a Comment for "Php - Detect Html In String And Wrap With Code Tag"