Jump to content

sanitizing html


gidmeister2

Recommended Posts

I created a program that downloads web-pages, then parses html, removes dangerous tags and attributes, and lets the user specify to some extent what he wants to keep. There are "html sanitizers" out there, but the ones I used (with asp.net pages) were defective.

Unfortunately I wrote the program without a parsing guide to html, I just took the tags I knew, and put them in a stack, and then popped the stack etc.

I would think w3schools would have a guide to all html, so that I can just feed that guide into a parser, perhaps indicating which tags are dangerous. So my question is - where is that guide?

Thanks

Link to comment
Share on other sites

There's a reference of every single HTML tag here: http://www.w3schools.com/tags/default.asp

 

The rules to parse HTML are pretty complicated because it allows so many unusual syntaxes. Some tags don't have closing tags, element and attribute names aren't case sensitive, attributes may or may not be wrapped in quotation marks. Building an HTML parser from scratch would require a lot of work.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...