In the comments of Do Not... DO NOT! Parse HTML with Regex's and Example of Hard to Parse HTML, a couple of people piped up about how there are legitimate reasons to use regex's when your input is HTML.
Let me reiterate, if all you can say (in the biz we call this "specification") about the input to your program is that "it will be HTML"; there is no reason on God's green Earth that you should be using regular expressions to parse it. I can't think of a way to make that any more clearer.
Patrick Walton was first up and offered these two regex's:
s/<\s*script.*?>//gis; s/on[a-zA-Z]+\s*=//gis;
This will, of course, mutilate the contents of a page like:
<html> <body> onMyBirthday=FUN!! onHolidays=FUN!! ontology=FUN!! batons=FUN!! </body> </html>
Also, there are additional places other than just <script> tags and on* attributes where one can hide JavaScript code (via Integrating JavaScript into Stylesheets):
<html>
<body>
<style>
background: url("
javascript:
document.body.onload = function(){
...custom js here...
}
");
</style>
...
...
</body>
</html>
There are more ways to break Patrick's regular expressions, but I think that is enough for now.
The next guy to chime in was one d.furuta, who says:
Assuming you know what you're doing, it's fine to use regexes if appropriate. The problem is not experienced programmers using regexes; it's inexperienced programmers not knowing when regexes are acceptable and when a parser is needed.
I have heard this argument before. Usually, I hear it as justification for seeing something like the following code:
($table_data) = $html =~ /<td>(.*?)<\/td>/gis; # pull out data between <td> tags
"But, it works!" they say.
"It's easy!"
"It's quick!"
"It will do the job just fine!"
I berate them for not being lazy. (I, also, berate them for using .* in a regex — see Death to Dot Star! — but that's the subject of a different post.) You need to be lazy as a programmer. Parsing HTML is a solved problem. You do not need to solve it. You just need to be lazy. You have CPAN. You will never reach Perl-Guru until you have mastered CPAN. Be lazy, use CPAN and use HTML::Sanitizer. It will make your coding easier. It will leave your code more maintainable. You won't have to sit there hand-coding regular expressions. Your code will be more robust. You won't have to bug fix every time the HTML breaks your crappy regex. This is true laziness:
#!/usr/bin/perl
use warnings;
use strict;
use HTML::Sanitizer;
use LWP::Simple;
my $html = get('http://alpha-geek.com/example/crazy_html.html');
my $sanity = new HTML::Sanitizer;
# pick your poisons
# in the documentation, learn how to specify permissible attributes, also
$sanity->permit_only(qw/ html head title body a p h1 div strong em /);
print $sanity->filter_html($html);
__END__
# the above code outputs
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>Crazy HTML -- Can Your R
egex Parse This?</title></head><body><h1>Did The Javascript Execute?</h1><div> I
will execute here, too, if you mouse over me </div></body></html>
There now, doesn't that code seem so much nicer than 97 regex's and various special case post-regex and pre-regex string manipulations? Isn't that nice, clean, clear, and concise? Aren't robust libraries wonderful?
One last time:
If you still need convincing I can go on and on and on about this topic. Just let me know. And, the best way to let me know is to continue posting regex's in the comments. You set'em up. I'll knock'em down.
Comments
To make no mistakes is not in the power of man; but from their errors and mistakes the wise and good learn wisdom for the future.
Yea, I should point out that, a long time ago, I parsed HTML with regex's. Well, actually, I parsed everything with regex's. I used regex's everywhere I possibly could. I was a regex over-user.
But, I learned better. And, now, I can help out others by pointing out that it is better, easier, lazier, and nicer to use libraries for parsing HTML.
That was not my point. If you're extracting information, yes, you should use a parser. If you're trying to mangle something beyond browser recognition, regexes are ok, as long as you know the different forms the malicious code can take.
Furthermore, if you are a good enough programmer to know what you're doing (which I am not claiming to be), and if you want to parse HTML using substr, more power to you. Pedantry and insisting that there's only one way (TIMTOWTDI, eh?) serve no useful purpose.
I had no idea my quick regexes would be subjected to such scrutiny; I'm flattered! I apologize that I'm not going to play your game further, but those regexes weren't designed to be a perfect solution; for this, I agree entirely that a module is needed. In fact, I *use* a module in my web project for this (unless the module isn't available, in which regexes are used as a fallback). But for day-to-day work, using a bulky module that builds up an entire parse tree is swatting a fly with a sledgehammer.
I don't dispute that modules are quite often the right tool for the job, but I do dispute that "there's no reason on God's green earth" to use regexes when dealing with HTML. There's actually a library on CPAN which uses regexes to destroy JavaScript: HTML::StripScripts::Regex. HTML::Sanitizer actually advocates using regexes to catch JavaScript embedded in URLs. Its sister module HTML::Scrubber, which is also based on parsing the document, recommends the same method.
(Speaking of which, wouldn't a module better suited to the job - HTML::StripScripts - be a better choice for removing JavaScript?)
Whats you point in "mastering CPAN"? Is it the art of seperating the junk modules from the good ones in it? It also always depends on the enviroment where you use your software! do you know what happens to the startup-time of your perl interpreter after installing the twentieth CPAN-module? have you made benchmarks comparing the speed of a well crafted regex and HTML::Sanitizer?
That should be just some pointers that programming is not always about style and code-reuse.