Jump to content

Regex Book?


davej

Recommended Posts

I'm beginning to think that I need to buy a book on regular expressions. The little overviews never really seem to cover it in enough depth and detail, and they always taint it with a particular language or application. My Javascript book lists the example;

var pattern = /sherlock/;pattern.ignoreCase = true;pattern.global = true;

But isn't it possible to define case-insensitivity directly into the pattern along with making it global? If you're reading about regex in JS it is presented one way and if you are reading about regex and grep it is presented a different way. I'm thinking that regex is used to; return strings that match a patternreturn strings that contain matches to a patternreturn strings that contain substitutions or reformatting defined by a patternreturn a boolean-equivalent when it finds a match to a pattern What else? Thanks.

Link to comment
Share on other sites

For the literal regex syntax, modifiers go after the pattern: var pattern = /sherlock/gi; Regular expressions are all about working with patterns. You can verify if a pattern exists in a string, return the matched parts, replace them, etc.

Link to comment
Share on other sites

Regular expressions are all about working with patterns. You can verify if a pattern exists in a string, return the matched parts, replace them, etc.
Well, the thing that annoys me is that the basic regex rules are presented as being fairly simple, but then when you try to use them they don't work on anything but the simplest possible situations. The real code examples are often very long and totally indecipherable (to me).
Link to comment
Share on other sites

Regular expressions are complicated until you learn the syntax. As for the ignoreCase and global properties, I think they're there for reading rather than writing:

if(pattern.ignoreCase) {  alert("Case-insensitive search.");} else {  alert("Case-sensitive search.");}

Link to comment
Share on other sites

Well, the thing that annoys me is that the basic regex rules are presented as being fairly simple, but then when you try to use them they don't work on anything but the simplest possible situations. The real code examples are often very long and totally indecipherable (to me).
Well, that's true. The basic rules are simple, but the fact is that regular expressions are typically not used for simple situations. You don't need the overhead of the regular expression engine if you're just trying to find a particular string, for example. The power of regular expressions comes in being able to match complex patterns, which are in fact made up of a series of simple rules. It's just that when you put all of the rules together it looks complex, but you can break down any complex pattern into basic parts. Look at this example: http://www.regular-e...uddy/email.html They use this pattern as an example: \b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b Which may look complex, but it's only a series of basic rules. On that page you can highlight any character to see the description of what it's for. Here's a real-world example, I use this Javascript function to clean the HTML that gets pasted when people paste text from Word. You can break down each pattern to figure out what it's looking for.
function(str){  str = String(str);  str = str.replace(/<!--[\s\S]*?-->/gi, '' ) ;  str = str.replace(/<META[^>]*>/gi, '' ) ;  str = str.replace(/<LINK[^>]*>/gi, '' ) ;  str = str.replace(/<script[^>]*>[\s\S]*?<\/SCRIPT>/gi, '' ) ;  str = str.replace(/<STYLE[^>]*>[\s\S]*?<\/STYLE>/gi, '' ) ;  str = str.replace(/<o:p>[\s\S]*?<\/o:p>/g, "") ;  str = str.replace(/\s*mso-[^:]+:[^;"]+;?/gi, "" ) ;  str = str.replace(/\s*MARGIN: 0cm 0cm 0pt\s*;/gi, "" ) ;  str = str.replace(/\s*MARGIN: 0cm 0cm 0pt\s*"/gi, "\"" ) ;  str = str.replace(/\s*TEXT-INDENT: 0cm\s*;/gi, "" ) ;  str = str.replace(/\s*TEXT-INDENT: 0cm\s*"/gi, "\"" ) ;  str = str.replace(/\s*PAGE-BREAK-BEFORE: [^\s;]+;?"/gi, "\"" ) ;  str = str.replace(/\s*FONT-VARIANT: [^\s;]+;?"/gi, "\"" ) ;  str = str.replace(/\s*tab-stops:[^;"]*;?/gi, "" ) ;  str = str.replace(/\s*tab-stops:[^"]*/gi, "" ) ;  str = str.replace(/\s*face="[^"]*"/gi, "" ) ;  str = str.replace(/\s*face=[^ >]*/gi, "" ) ;  str = str.replace(/\s*FONT-FAMILY:[^;"]*;?/gi, "" ) ;  str = str.replace(/<(\w[^>]*) class=([^ |>]*)([^>]*)/gi, "<$1$3") ;  // to accomodate ExtJS HTML editor, replace text-align style attributes of divs  str = str.replace(/<div[\s]* style="(text-align:[\s]*[^;"]*">/gi, '<div x-style="$1">');  //  remove style attributes  str  = str.replace(/<(\w[^>]*) style="([^\"]*)"([^>]*)/gi, "<$1$3" ) ;  // switch divs back  str = str.replace(/<div[\s]* x-style="(text-align:[\s]*[^;"]*">/gi, '<div style="$1">');  str = str.replace(/\s*style="\s*"/gi, '' ) ;  str = str.replace(/<SPAN\s*[^>]*>\s* \s*<\/SPAN>/gi, ' ' ) ;  str = str.replace(/<SPAN\s*[^>]*><\/SPAN>/gi, '' ) ;  str = str.replace(/<(\w[^>]*) lang=([^ |>]*)([^>]*)/gi, "<$1$3") ;  str = str.replace(/<\\?\?xml[^>]*>/gi, "") ;  str = str.replace(/<\/?\w+:[^>]*>/gi, "") ;  str = str.replace(/<H\d>\s*<\/H\d>/gi, '' ) ;  str = str.replace(/<H1([^>]*)>/gi, '' ) ;  str = str.replace(/<H2([^>]*)>/gi, '' ) ;  str = str.replace(/<H3([^>]*)>/gi, '' ) ;  str = str.replace(/<H4([^>]*)>/gi, '' ) ;  str = str.replace(/<H5([^>]*)>/gi, '' ) ;  str = str.replace(/<H6([^>]*)>/gi, '' ) ;  str = str.replace(/<\/H\d>/gi, '<br>' ) ; //remove this to take out breaks where Heading tags were  str = str.replace(/<SPAN\s*>([\s\S]*?)<\/SPAN>/gi, '$1' ) ;  str = str.replace(/<FONT[^>]*>([\s\S]*?)<\/FONT>/gi, '$1' ) ;  str = str.replace(/<SPAN\s*>([\s\S]*?)<\/SPAN>/gi, '$1' ) ;  str = str.replace(/<FONT[^>]*>([\s\S]*?)<\/FONT>/gi, '$1' ) ;  str = str.replace(/<SPAN\s*>([\s\S]*?)<\/SPAN>/gi, '$1' ) ;  str = str.replace(/<FONT[^>]*>([\s\S]*?)<\/FONT>/gi, '$1' ) ;  str = str.replace(/<(U|I|STRIKE|B|P)> <\/\1>/gi, ' ' ) ;  str = str.replace(/<(U|I|STRIKE|B|P)> <\/\1>/gi, ' ' ) ;  str = str.replace(/<(U|I|STRIKE|B|P)> <\/\1>/gi, ' ' ) ;  str = str.replace(/<([^\s>]+)[^>]*>\s*<\/\1>/gi, '' ) ;  str = str.replace(/<([^\s>]+)[^>]*>\s*<\/\1>/gi, '' ) ;  str = str.replace(/<([^\s>]+)[^>]*>\s*<\/\1>/gi, '' ) ;  str = str.replace(/<(U|I|STRIKE|B|P)> <\/\1>/gi, ' ' ) ;  str = str.replace(/<(U|I|STRIKE|B|P)> <\/\1>/gi, ' ' ) ;  str = str.replace(/<(U|I|STRIKE|B|P)> <\/\1>/gi, ' ' ) ;  str = str.replace(/<SPAN\s*>([\s\S]*?)<\/SPAN>/gi, '$1' ) ;  str = str.replace(/<FONT[^>]*>([\s\S]*?)<\/FONT>/gi, '$1' ) ;  str = str.replace( /size\s*=\s*([\d]{1})/gi, '' ) ;   return str ;}

Link to comment
Share on other sites

If you really want to learn about what regular expressions actually are (including why they are called "regular expressions"), then you should look for a book on discrete structures, or regular languages, such as "Introduction to the Theory of Computation" by Michael Sipser. This might be a bit of an overkill for your purposes though... P.S.: the technical way to make that regular expression case-insensitive is to do:

(s|S)(h|H)(e|E)(r|R)(l|L)(o|O)(c|C)(k|K)

Of course, Javascript has features that make this easier for you :P. P.P.S.: I doubt you'll find a whole book on regular expressions, though.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...