Jump to content

Need some regex help with basic syntax highlighter.


rain13

Recommended Posts

Hello.

 

I am trying to write very basic syntax highlighter. It has to be able to highlight strings quoted in different ways and keywords.

 

Here is my javascript code:

	var codes = document.getElementsByClassName('code');	for (var i = 0; i < codes.length; ++i) {		var item = codes[i];		alert(decodeHTMLEntities(item.innerHTML));		item.innerHTML =  item.innerHTML .replace(new RegExp('"([^"]*)"', 'g'), "<span class="string">$&</span>");		//item.innerHTML =  item.innerHTML .replace(new RegExp(keywords, 'g'), "<span class="keyword">$&</span>");	}

Here is how it currently highlights:

 

a = "I am string for test if escaping "can break it..";And this is how it should be highlighted

 

a = "I am string for test if escaping "can break it..";I would also like to know how I could avoid highlighting keywords in string.I know there are enough ready to use highlighters available, but my intention is to get better at javascript/regex by writing at least basic one myself/with help of this forum.Would anyone be so nice and tell me how I could write regex that would do what I want it to?

 

Link to comment
Share on other sites

Thanks.I still don't get what I am doing wrong. :(

item.innerHTML =  item.innerHTML .replace(new RegExp('"([^"]|(")*)?"', 'g'), "<span class="string">$&</span>");
the code above is based on this example
// Mimic leading, negative lookbehind like replace(/(?<!es)t/g, 'x')var output = 'testt'.replace(/(es)?t/g, function($0, $1){	return $1 ? $0 : 'x';});
Edit: Could you tell me if this part of regex is correct: ([^"]|(")*)? Edited by SoItBegins
Link to comment
Share on other sites

I tried "([^"]|("))*?" but still no effect. Could you tell me what else is wrong with it. Just wondering how much slower/faster it would work If I'd build highlighter manually by using indexOf() to find first quote and and then while loop to find next quote that is not escaped.

Edited by SoItBegins
Link to comment
Share on other sites

That pattern is saying that the slash is optional, that's what a question mark quantifier means. It means 0 or 1 times, so you're telling it to look for 0 or 1 slashes before the quote. He points out that those workarounds only work with the replace method, and only in certain situations. You might want to use the reversal method, where you reverse the string and write the regular expression backwards, and then use a lookahead. For whatever reason Javascript regular expressions do support lookahead, but not lookbehind.

Link to comment
Share on other sites

Thanks for tip. I finally got it working (almost) but it seems to be a little too greedy item.innerHTML = item.innerHTML .replace(new RegExp('"([^"]|("))*"', 'g'), "<span class="string">$&</span>"); It highlights it like that: a = "I am string for test if escaping "can 'break' it.."; b = 'I am string for test if escaping 'can "break" it..';So I found this:{x}Repeat the previous character, set or group exactly x times.

 

but "([^"]|("))*"{1} doesn't tell it to stop at first match

Edited by SoItBegins
Link to comment
Share on other sites

No, it tells it to match exactly one quote character (which doesn't really add anything to the pattern, it's already matching one quote). You make a certain quantifier non-greedy by putting a question mark after it, e.g.:

 

"([^"]|("))*?"

 

That means it will match as little as possible to still have a match.

Link to comment
Share on other sites

It does work to make the match non-greedy. It doesn't work for your issue because you still need a lookbehind, or a reverse/lookahead. You need to tell it to stop at a quote that does not have a slash before it, which requires a lookbehind or a workaround.

Link to comment
Share on other sites

This is workaround for lookbehind: "([^"]|("))*" which makes it go past " but goes until last non-escaped this "([^"]|("))*?" is non greedy version of it which stops at ". How do I combine these to so that it would stop at first non-escaped quote?

Link to comment
Share on other sites

 

 

This is workaround for lookbehind: "([^"]|("))*"

What makes you think that pattern is a workaround for a negative lookbehind? It looks to me that the pattern will match a quote, then anything that is not a quote, or an escaped quote, then an escaped quote. That's not a negative lookbehind. Again, I would suggest the reversal method of reversing the target string, writing the regular expression backwards, and using a lookahead.

Link to comment
Share on other sites

What makes you think that pattern is a workaround for a negative lookbehind?

I got it from some site that explained it for JS. So that's best I could come up with... I am not the expert of regex.So instead I invented this:

function Highlight(){		var keywords = "if|then|else|end|endif|function|string|short|unsigned|int|double|float|char|include|for|while|goto|const|void|return"	var codes = document.getElementsByClassName('code');	for (var i = 0; i < codes.length; ++i) {		var item = codes[i];		//alert(decodeHTMLEntities(item.innerHTML));		var decoded = decodeHTMLEntities(item.innerHTML);		//item.innerHTML =  decoded.replace(new RegExp('"([^"])*?(?!"$)"', ''), "<span class="string">$&</span>");		var tokens = decoded.split(/(?=["'])/g);		//alert(tokens.length);		var quote ="";		for (var j = 0; j < tokens.length; ++j) {			if(tokens[j].charAt(0)=='"' || tokens[j].charAt(0)=='''){				if(j > 0 && tokens[j-1].charAt(tokens[j-1].length-1) != "")				{					if(quote == ""){//not in quote yet						quote = tokens[j].charAt(0);						tokens[j] = "<span class="string">"+tokens[j];					}else{//in quote						if(tokens[j].charAt(0) == quote){//found quote == quote we are curently in?							quote = '';							tokens[j] = tokens[j].insert(1, "</span>");						}					}				}			}						if(quote==""){				unquoted_tokens = tokens[j].split(/(?=[*()[]{} ])/g);				for (var k = 0; k < unquoted_tokens.length; ++k) {					unquoted_tokens[k] = unquoted_tokens[k].replace(new RegExp(keywords, 'g'), "<span class="keyword">$&</span>");				}				tokens[j] = unquoted_tokens.join("");			}					}		item.innerHTML =  tokens.join("");	}}

Just wondering what's your opinion of it. How is it's speed compared to regex?

 

 

Here's the result.

int compare (const void * a, const void * B){a = "I am string fortest if escaping "can 'break' it..";b = 'I am string fortest if escaping 'can "break" it..';return ( *(int*)a - *(int*)b );}

Edited by SoItBegins
Link to comment
Share on other sites

I don't know how fast that is relative to something else, you can benchmark it if you want to test it though. Language syntax parsing and highlighting is not an easy thing to do, I doubt you're going to be able to find regular expressions that can encapsulate all of the rules for syntax parsing. Yours has issues though, for example a line break at the end of a string is not valid, Javascript does not support multi-line strings unless the end of the line is escaped. It doesn't look like you're necessarily trying to parse Javascript code though. But I like I said, parsing a language is not a trivial task.

Link to comment
Share on other sites

I was just trying to make generic highlighter that would work for any language. That's why i only have strings and keywords. Now that same code wold work with C/C++, PHP, JS, basic, lua and so on. all those languages have keywords such as for, while, do, if and so on. And string highlighting that supports line break and escaping would also work on languages that doesn't allow these.But anyway, thanks for taking time to read my posts and give hints about regex :)

Edited by SoItBegins
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...