rain13 Posted July 10, 2013 Share Posted July 10, 2013 Hello. I am trying to write very basic syntax highlighter. It has to be able to highlight strings quoted in different ways and keywords. Here is my javascript code: var codes = document.getElementsByClassName('code'); for (var i = 0; i < codes.length; ++i) { var item = codes[i]; alert(decodeHTMLEntities(item.innerHTML)); item.innerHTML = item.innerHTML .replace(new RegExp('"([^"]*)"', 'g'), "<span class="string">$&</span>"); //item.innerHTML = item.innerHTML .replace(new RegExp(keywords, 'g'), "<span class="keyword">$&</span>"); } Here is how it currently highlights: a = "I am string for test if escaping "can break it..";And this is how it should be highlighted a = "I am string for test if escaping "can break it..";I would also like to know how I could avoid highlighting keywords in string.I know there are enough ready to use highlighters available, but my intention is to get better at javascript/regex by writing at least basic one myself/with help of this forum.Would anyone be so nice and tell me how I could write regex that would do what I want it to? Link to comment Share on other sites More sharing options...
justsomeguy Posted July 10, 2013 Share Posted July 10, 2013 You need to use a lookahead or lookbehind in the regular expression to make sure that the ending quote does not have a slash before it. Link to comment Share on other sites More sharing options...
rain13 Posted July 10, 2013 Author Share Posted July 10, 2013 Thanks for info. I'll see what can dig out. At the moment I hear both terms lookahead and lookbehind for a first time. Link to comment Share on other sites More sharing options...
rain13 Posted July 15, 2013 Author Share Posted July 15, 2013 (edited) could anyone say why this regex doesn't do what I am tying to do? item.innerHTML = item.innerHTML .replace(new RegExp('"([^"]*)(?<!)"', 'g'), "<span class="string">$&</span>");source: http://www.regular-expressions.info/lookaround.htmlAlso tried this with out any resultRegExp('"([^"]*)?"', 'g') Edited July 15, 2013 by SoItBegins Link to comment Share on other sites More sharing options...
justsomeguy Posted July 15, 2013 Share Posted July 15, 2013 You're still telling it to look for anything that is not a quote, so it's not going to match if it finds an escaped quote. Link to comment Share on other sites More sharing options...
rain13 Posted July 15, 2013 Author Share Posted July 15, 2013 I am not quite used to regex. I am sorry to say that but I didn't really get much help from what you said.I tried to match anything that is not quote or that is escaped quote: "(([^"]|")*)(?<!)" but that didn't help. Meanwhile I also tries these: http://blog.stevenlevithan.com/archives/match-quoted-string but they didnt seem to work either. Link to comment Share on other sites More sharing options...
justsomeguy Posted July 15, 2013 Share Posted July 15, 2013 It looks like Javascript doesn't support lookbehinds. There are some workarounds: http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript Link to comment Share on other sites More sharing options...
rain13 Posted July 15, 2013 Author Share Posted July 15, 2013 (edited) Thanks.I still don't get what I am doing wrong. item.innerHTML = item.innerHTML .replace(new RegExp('"([^"]|(")*)?"', 'g'), "<span class="string">$&</span>");the code above is based on this example// Mimic leading, negative lookbehind like replace(/(?<!es)t/g, 'x')var output = 'testt'.replace(/(es)?t/g, function($0, $1){ return $1 ? $0 : 'x';});Edit: Could you tell me if this part of regex is correct: ([^"]|(")*)? Edited July 15, 2013 by SoItBegins Link to comment Share on other sites More sharing options...
justsomeguy Posted July 15, 2013 Share Posted July 15, 2013 The asterisk should be outside the parentheses, the parens hold the entire pattern that you want to match 0 or more times. Link to comment Share on other sites More sharing options...
rain13 Posted July 16, 2013 Author Share Posted July 16, 2013 (edited) I tried "([^"]|("))*?" but still no effect. Could you tell me what else is wrong with it. Just wondering how much slower/faster it would work If I'd build highlighter manually by using indexOf() to find first quote and and then while loop to find next quote that is not escaped. Edited July 16, 2013 by SoItBegins Link to comment Share on other sites More sharing options...
justsomeguy Posted July 16, 2013 Share Posted July 16, 2013 That pattern is saying that the slash is optional, that's what a question mark quantifier means. It means 0 or 1 times, so you're telling it to look for 0 or 1 slashes before the quote. He points out that those workarounds only work with the replace method, and only in certain situations. You might want to use the reversal method, where you reverse the string and write the regular expression backwards, and then use a lookahead. For whatever reason Javascript regular expressions do support lookahead, but not lookbehind. Link to comment Share on other sites More sharing options...
rain13 Posted July 16, 2013 Author Share Posted July 16, 2013 (edited) Thanks for tip. I finally got it working (almost) but it seems to be a little too greedy item.innerHTML = item.innerHTML .replace(new RegExp('"([^"]|("))*"', 'g'), "<span class="string">$&</span>"); It highlights it like that: a = "I am string for test if escaping "can 'break' it.."; b = 'I am string for test if escaping 'can "break" it..';So I found this:{x}Repeat the previous character, set or group exactly x times. but "([^"]|("))*"{1} doesn't tell it to stop at first match Edited July 16, 2013 by SoItBegins Link to comment Share on other sites More sharing options...
justsomeguy Posted July 16, 2013 Share Posted July 16, 2013 No, it tells it to match exactly one quote character (which doesn't really add anything to the pattern, it's already matching one quote). You make a certain quantifier non-greedy by putting a question mark after it, e.g.: "([^"]|("))*?" That means it will match as little as possible to still have a match. Link to comment Share on other sites More sharing options...
rain13 Posted July 16, 2013 Author Share Posted July 16, 2013 (edited) I've tried that too (Just didn't post here that it didn't work). It stopped at " Edited July 16, 2013 by SoItBegins Link to comment Share on other sites More sharing options...
justsomeguy Posted July 16, 2013 Share Posted July 16, 2013 It does work to make the match non-greedy. It doesn't work for your issue because you still need a lookbehind, or a reverse/lookahead. You need to tell it to stop at a quote that does not have a slash before it, which requires a lookbehind or a workaround. Link to comment Share on other sites More sharing options...
rain13 Posted July 17, 2013 Author Share Posted July 17, 2013 This is workaround for lookbehind: "([^"]|("))*" which makes it go past " but goes until last non-escaped this "([^"]|("))*?" is non greedy version of it which stops at ". How do I combine these to so that it would stop at first non-escaped quote? Link to comment Share on other sites More sharing options...
justsomeguy Posted July 17, 2013 Share Posted July 17, 2013 This is workaround for lookbehind: "([^"]|("))*" What makes you think that pattern is a workaround for a negative lookbehind? It looks to me that the pattern will match a quote, then anything that is not a quote, or an escaped quote, then an escaped quote. That's not a negative lookbehind. Again, I would suggest the reversal method of reversing the target string, writing the regular expression backwards, and using a lookahead. Link to comment Share on other sites More sharing options...
rain13 Posted July 20, 2013 Author Share Posted July 20, 2013 (edited) What makes you think that pattern is a workaround for a negative lookbehind? I got it from some site that explained it for JS. So that's best I could come up with... I am not the expert of regex.So instead I invented this: function Highlight(){ var keywords = "if|then|else|end|endif|function|string|short|unsigned|int|double|float|char|include|for|while|goto|const|void|return" var codes = document.getElementsByClassName('code'); for (var i = 0; i < codes.length; ++i) { var item = codes[i]; //alert(decodeHTMLEntities(item.innerHTML)); var decoded = decodeHTMLEntities(item.innerHTML); //item.innerHTML = decoded.replace(new RegExp('"([^"])*?(?!"$)"', ''), "<span class="string">$&</span>"); var tokens = decoded.split(/(?=["'])/g); //alert(tokens.length); var quote =""; for (var j = 0; j < tokens.length; ++j) { if(tokens[j].charAt(0)=='"' || tokens[j].charAt(0)=='''){ if(j > 0 && tokens[j-1].charAt(tokens[j-1].length-1) != "") { if(quote == ""){//not in quote yet quote = tokens[j].charAt(0); tokens[j] = "<span class="string">"+tokens[j]; }else{//in quote if(tokens[j].charAt(0) == quote){//found quote == quote we are curently in? quote = ''; tokens[j] = tokens[j].insert(1, "</span>"); } } } } if(quote==""){ unquoted_tokens = tokens[j].split(/(?=[*()[]{} ])/g); for (var k = 0; k < unquoted_tokens.length; ++k) { unquoted_tokens[k] = unquoted_tokens[k].replace(new RegExp(keywords, 'g'), "<span class="keyword">$&</span>"); } tokens[j] = unquoted_tokens.join(""); } } item.innerHTML = tokens.join(""); }} Just wondering what's your opinion of it. How is it's speed compared to regex? Here's the result. int compare (const void * a, const void * {a = "I am string fortest if escaping "can 'break' it..";b = 'I am string fortest if escaping 'can "break" it..';return ( *(int*)a - *(int*)b );} Edited July 20, 2013 by SoItBegins Link to comment Share on other sites More sharing options...
justsomeguy Posted July 22, 2013 Share Posted July 22, 2013 I don't know how fast that is relative to something else, you can benchmark it if you want to test it though. Language syntax parsing and highlighting is not an easy thing to do, I doubt you're going to be able to find regular expressions that can encapsulate all of the rules for syntax parsing. Yours has issues though, for example a line break at the end of a string is not valid, Javascript does not support multi-line strings unless the end of the line is escaped. It doesn't look like you're necessarily trying to parse Javascript code though. But I like I said, parsing a language is not a trivial task. Link to comment Share on other sites More sharing options...
rain13 Posted July 23, 2013 Author Share Posted July 23, 2013 (edited) I was just trying to make generic highlighter that would work for any language. That's why i only have strings and keywords. Now that same code wold work with C/C++, PHP, JS, basic, lua and so on. all those languages have keywords such as for, while, do, if and so on. And string highlighting that supports line break and escaping would also work on languages that doesn't allow these.But anyway, thanks for taking time to read my posts and give hints about regex Edited July 23, 2013 by SoItBegins Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now