Regular Expression

iwato · May 1, 2010

QUESTION: The following set of Find and Replace commands are used to perform the below transformation. What I do not understand is how these two commands achieve their goal. Could someone provide a clear explanation as to how the transformation takes place?What I understand is that the \2 refers to the second sub-pattern, and \1 refers to the second. I also understand that the first sub-pattern refers to the entirety of one line excepting the end-of-line marker. What I do not understand is how the second sub-pattern is used to obtain the transformation.By the way, the transformation has been tested and clearly works.REGULAR EXPRESSIONS:The set of find and replace commands used to achieve the task are the following:FIND: (.+) ([^ ]+)$REPLACE: \2, \1 THE DATA:Consider briefly the following set of data:Junior X. PotterJill SafaiDylan Schuyler GoodeWalter WangAssume that a line break appears at the end of each line except for the last line.THE TRANSFORMED DATA:The goal is to produce the following reordering of each name.Potter, Junior X.Safai, JillGoode, Dylan SchuylerWang, WalterOnce again, it works.Roddy

jeffman · May 1, 2010

FIND: (.+) ([^ ]+)$REPLACE: \2, \1 FIND: Any character (the dot) repeated any number of times (the +). Capture that (the parens) and store it in back-reference 1. Those characters must be followed by a space. The space must be followed by, and the string must end with (that's the $), a class of characters (the []) that DOES NOT (^ when it's inside []) include a space, repeated any number of times (the +). Capture that last bit and store it in back-reference 2.REPLACE: The entire string we matched with back-reference 2 comma space back-reference 1

iwato · May 2, 2010

FIND: (.+) ([^ ]+)$FIND: Any character (the dot) repeated any number of times (the +). Capture that (the parens) and store it in back-reference 1. Those characters must be followed by a space. The space must be followed by, and the string must end with (that's the $), a class of characters (the []) that DOES NOT (^ when it's inside []) include a space, repeated any number of times (the +). Capture that last bit and store it in back-reference 2.

FOLLOW-UP QUESTIONS:Please review the following statements and state whether each is true or false. If it is false, please explain why?1) The space between the two sets of parentheses in the Find expression is neither arbitrary nor convenient; rather, it is necessary, and would lead to a different result, if there were, say three spaces, instead of one.2) Although the dot refers to any character except line breaks, it does not refer to characters that create white space.RoddyP.S. I got this example from TextWrangler user's manual and am now able to recommend TextWrangler to others who do a lot of work with highly structured data and code. Thank you for turning me on to it.

jeffman · May 2, 2010

1. Yes. When you see a character without some kind of modifier (like an escape, or a + or a ?, etc) it is a character literal. One space = one space. It must be there, and it must be one. The fact that it is a space changes nothing. In the guts of this thing, it's an ASCII 32, no different from ASCII 33 or ASCII 34.2. Hmm. You are correct that a dot is defined as any character except a line break (\r or \n). White space can mean many things. Normally it is specified by \s. Boundary characters, which include white space, tabs, commas, and null characters specified by \b. So there are a lot of variations to get just what you need. And of course a literal \r means a character return (ASCII 13) and no other character, \n is a linefeed, and that applies generally. But my experience is that a dot will match a space (ASCII 32). So I guess that's a no?

iwato · May 2, 2010

1. Yes.2. So I guess that's a no?

OK. Now to question three, and my understanding should be complete.QUESTION THREE: Does the $ symbol after the second sub-pattern indicate the end of the sub-pattern or the end of the line from which the matched sub-pattern is drawn? I ask this question because the replace operation yields very different results when the $ symbol is omitted. In effect, it appears, as if the end-of-line marker were included in the second sub-pattern when the $ is omitted.Roddy

boen_robot · May 2, 2010

3. It's implementation defined unfortunatly. Often times, languages that let you use regular expressions provide a setting as to whether to work in "multy line" mode or not. If enabled, the "." is going to match new lines as well, and the "^" and "$" are going to mean respectively "start of string" and "end of string" instead of "start of line" and "end of line".If, in your experience, the "." doesn't match a new line, then most likely, the "$" works on per line basis too. If you get a different result with and without it, that's likely because in some instances, you have three names, and in some you have two. Without "$", you work on the first and second, and with it, you work with the second and third OR first and second if there's no third. In combination with the fact that the "." matches white space too, you're essentially matching "everything but the last" and "the last".

Sign In

Regular Expression

Recommended Posts

iwato

Link to comment

Share on other sites

jeffman

Link to comment

Share on other sites

iwato

Link to comment

Share on other sites

jeffman

Link to comment

Share on other sites

iwato

Link to comment

Share on other sites

boen_robot

Link to comment

Share on other sites

Archived

Browse

Activity