Need some regex help with basic syntax highlighter.

rain13 · July 10, 2013

Hello.

I am trying to write very basic syntax highlighter. It has to be able to highlight strings quoted in different ways and keywords.

Here is my javascript code:

	var codes = document.getElementsByClassName('code');	for (var i = 0; i < codes.length; ++i) {		var item = codes[i];		alert(decodeHTMLEntities(item.innerHTML));		item.innerHTML =  item.innerHTML .replace(new RegExp('"([^"]*)"', 'g'), "<span class="string">$&</span>");		//item.innerHTML =  item.innerHTML .replace(new RegExp(keywords, 'g'), "<span class="keyword">$&</span>");	}

Here is how it currently highlights:

a = "I am string for test if escaping "can break it..";And this is how it should be highlighted

a = "I am string for test if escaping "can break it..";I would also like to know how I could avoid highlighting keywords in string.I know there are enough ready to use highlighters available, but my intention is to get better at javascript/regex by writing at least basic one myself/with help of this forum.Would anyone be so nice and tell me how I could write regex that would do what I want it to?

justsomeguy · July 10, 2013

You need to use a lookahead or lookbehind in the regular expression to make sure that the ending quote does not have a slash before it.

rain13 · July 10, 2013

Thanks for info. I'll see what can dig out. At the moment I hear both terms lookahead and lookbehind for a first time.

rain13 · July 15, 2013

could anyone say why this regex doesn't do what I am tying to do?

item.innerHTML =  item.innerHTML .replace(new RegExp('"([^"]*)(?<!)"', 'g'), "<span class="string">$&</span>");

source: http://www.regular-expressions.info/lookaround.htmlAlso tried this with out any result

RegExp('"([^"]*)?"', 'g')

Edited July 15, 2013 by SoItBegins

justsomeguy · July 15, 2013

You're still telling it to look for anything that is not a quote, so it's not going to match if it finds an escaped quote.

rain13 · July 15, 2013

I am not quite used to regex. I am sorry to say that but I didn't really get much help from what you said.I tried to match anything that is not quote or that is escaped quote: "(([^"]|")*)(?<!)" but that didn't help. Meanwhile I also tries these: http://blog.stevenlevithan.com/archives/match-quoted-string but they didnt seem to work either.

justsomeguy · July 15, 2013

It looks like Javascript doesn't support lookbehinds. There are some workarounds:

http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript

rain13 · July 15, 2013

Thanks.I still don't get what I am doing wrong.

item.innerHTML =  item.innerHTML .replace(new RegExp('"([^"]|(")*)?"', 'g'), "<span class="string">$&</span>");

the code above is based on this example

// Mimic leading, negative lookbehind like replace(/(?<!es)t/g, 'x')var output = 'testt'.replace(/(es)?t/g, function($0, $1){	return $1 ? $0 : 'x';});

Edit: Could you tell me if this part of regex is correct: ([^"]|(")*)? Edited July 15, 2013 by SoItBegins

justsomeguy · July 15, 2013

The asterisk should be outside the parentheses, the parens hold the entire pattern that you want to match 0 or more times.

rain13 · July 16, 2013

I tried "([^"]|("))*?" but still no effect. Could you tell me what else is wrong with it. Just wondering how much slower/faster it would work If I'd build highlighter manually by using indexOf() to find first quote and and then while loop to find next quote that is not escaped.

Edited July 16, 2013 by SoItBegins

justsomeguy · July 16, 2013

That pattern is saying that the slash is optional, that's what a question mark quantifier means. It means 0 or 1 times, so you're telling it to look for 0 or 1 slashes before the quote. He points out that those workarounds only work with the replace method, and only in certain situations. You might want to use the reversal method, where you reverse the string and write the regular expression backwards, and then use a lookahead. For whatever reason Javascript regular expressions do support lookahead, but not lookbehind.

rain13 · July 16, 2013

Thanks for tip. I finally got it working (almost) but it seems to be a little too greedy item.innerHTML = item.innerHTML .replace(new RegExp('"([^"]|("))*"', 'g'), "<span class="string">$&</span>"); It highlights it like that: a = "I am string for test if escaping "can 'break' it.."; b = 'I am string for test if escaping 'can "break" it..';So I found this:{x}Repeat the previous character, set or group exactly x times.

but "([^"]|("))*"{1} doesn't tell it to stop at first match

Edited July 16, 2013 by SoItBegins

justsomeguy · July 16, 2013

No, it tells it to match exactly one quote character (which doesn't really add anything to the pattern, it's already matching one quote). You make a certain quantifier non-greedy by putting a question mark after it, e.g.:

"([^"]|("))*?"

That means it will match as little as possible to still have a match.

rain13 · July 16, 2013

I've tried that too (Just didn't post here that it didn't work). It stopped at "

Edited July 16, 2013 by SoItBegins

justsomeguy · July 16, 2013

It does work to make the match non-greedy. It doesn't work for your issue because you still need a lookbehind, or a reverse/lookahead. You need to tell it to stop at a quote that does not have a slash before it, which requires a lookbehind or a workaround.

rain13 · July 17, 2013

This is workaround for lookbehind: "([^"]|("))*" which makes it go past " but goes until last non-escaped this "([^"]|("))*?" is non greedy version of it which stops at ". How do I combine these to so that it would stop at first non-escaped quote?

justsomeguy · July 17, 2013

This is workaround for lookbehind: "([^"]|("))*"

What makes you think that pattern is a workaround for a negative lookbehind? It looks to me that the pattern will match a quote, then anything that is not a quote, or an escaped quote, then an escaped quote. That's not a negative lookbehind. Again, I would suggest the reversal method of reversing the target string, writing the regular expression backwards, and using a lookahead.

rain13 · July 20, 2013

What makes you think that pattern is a workaround for a negative lookbehind?

I got it from some site that explained it for JS. So that's best I could come up with... I am not the expert of regex.So instead I invented this:

function Highlight(){		var keywords = "if|then|else|end|endif|function|string|short|unsigned|int|double|float|char|include|for|while|goto|const|void|return"	var codes = document.getElementsByClassName('code');	for (var i = 0; i < codes.length; ++i) {		var item = codes[i];		//alert(decodeHTMLEntities(item.innerHTML));		var decoded = decodeHTMLEntities(item.innerHTML);		//item.innerHTML =  decoded.replace(new RegExp('"([^"])*?(?!"$)"', ''), "<span class="string">$&</span>");		var tokens = decoded.split(/(?=["'])/g);		//alert(tokens.length);		var quote ="";		for (var j = 0; j < tokens.length; ++j) {			if(tokens[j].charAt(0)=='"' || tokens[j].charAt(0)=='''){				if(j > 0 && tokens[j-1].charAt(tokens[j-1].length-1) != "")				{					if(quote == ""){//not in quote yet						quote = tokens[j].charAt(0);						tokens[j] = "<span class="string">"+tokens[j];					}else{//in quote						if(tokens[j].charAt(0) == quote){//found quote == quote we are curently in?							quote = '';							tokens[j] = tokens[j].insert(1, "</span>");						}					}				}			}						if(quote==""){				unquoted_tokens = tokens[j].split(/(?=[*()[]{} ])/g);				for (var k = 0; k < unquoted_tokens.length; ++k) {					unquoted_tokens[k] = unquoted_tokens[k].replace(new RegExp(keywords, 'g'), "<span class="keyword">$&</span>");				}				tokens[j] = unquoted_tokens.join("");			}					}		item.innerHTML =  tokens.join("");	}}

Just wondering what's your opinion of it. How is it's speed compared to regex?

Here's the result.

int compare (const void * a, const void * {a = "I am string fortest if escaping "can 'break' it..";b = 'I am string fortest if escaping 'can "break" it..';return ( *(int*)a - *(int*)b );}

Edited July 20, 2013 by SoItBegins

justsomeguy · July 22, 2013

I don't know how fast that is relative to something else, you can benchmark it if you want to test it though. Language syntax parsing and highlighting is not an easy thing to do, I doubt you're going to be able to find regular expressions that can encapsulate all of the rules for syntax parsing. Yours has issues though, for example a line break at the end of a string is not valid, Javascript does not support multi-line strings unless the end of the line is escaped. It doesn't look like you're necessarily trying to parse Javascript code though. But I like I said, parsing a language is not a trivial task.

rain13 · July 23, 2013

I was just trying to make generic highlighter that would work for any language. That's why i only have strings and keywords. Now that same code wold work with C/C++, PHP, JS, basic, lua and so on. all those languages have keywords such as for, while, do, if and so on. And string highlighting that supports line break and escaping would also work on languages that doesn't allow these.But anyway, thanks for taking time to read my posts and give hints about regex

Edited July 23, 2013 by SoItBegins

Need some regex help with basic syntax highlighter.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in