Jump to content

sanitizing html

Recommended Posts

I created a program that downloads web-pages, then parses html, removes dangerous tags and attributes, and lets the user specify to some extent what he wants to keep. There are "html sanitizers" out there, but the ones I used (with asp.net pages) were defective.

Unfortunately I wrote the program without a parsing guide to html, I just took the tags I knew, and put them in a stack, and then popped the stack etc.

I would think w3schools would have a guide to all html, so that I can just feed that guide into a parser, perhaps indicating which tags are dangerous. So my question is - where is that guide?


Link to post
Share on other sites

There's a reference of every single HTML tag here: http://www.w3schools.com/tags/default.asp


The rules to parse HTML are pretty complicated because it allows so many unusual syntaxes. Some tags don't have closing tags, element and attribute names aren't case sensitive, attributes may or may not be wrapped in quotation marks. Building an HTML parser from scratch would require a lot of work.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...