Multiple regexp not working | Sololearn: Learn to code for FREE!

0

Multiple regexp not working

I'm building a compiler that transforms html into function calls to create virtual DOM nodes, or to be used with any other function that can modify the html template. I'm using a technique that I like to call destructured matching, where the largest regexp matches a large chunk of the template(e.g. the html tags), then use a smaller regexp to match other smaller chunks(e.g. attributes/props). The compiler works with single chunks, but not with multiple chunks in the template. Here's the compiler along with some examples: https://code.sololearn.com/c4Sfyait72Ea/?ref=app Thank you all for helping, and feel free to comment and suggest about my compiler!

4/8/2021 11:40:51 PM

Ivan Rodriguez

3 Answers

New Answer

+2

The problem is the '^' and '$' in the regexp strings returned from the getters of the Grammars class. Let's take the example of Grammars.startTag(). The regex string is `^<(@?${this.#identifier})( ${this.#attrs})*>$` (because you're adding '^' and '$' before passing it to the RegExp constructor) If I take the string, "<h1>Hello</h1>". This start tag will match with the regex, but the '^' and '$' metacharacters require "<h1>" to be the start and end of the string, which it is not in this case. The same issue is with all the getters of the Grammers class. Also, in the regexp returned from Grammers.emptyTag(), you forgot to consider the '@' character before the tag name. This will fix your current test cases, but you might also want to add "\s*" here and there in the regexp objects, as you don't know the user's choice of whitespace.

+1

Also, I understand that this might just be a code for practising regex, but using regex for HTML parsing is generally not a good idea. It is very hard to make the regex as forgiving as the HTML engines used by browsers. The general way to do HTML parsing is to make a HTML tokenizer that goes character by character in the code deciding what to do with each character and splits the HTML into tokens and then a parser that goes token by token and constructs the DOM tree. W3 has a detailed spec on how to tokenize and parse html documents. See https://www.w3.org/TR/2011/WD-html5-20110525/parsing.html

+1

Thank you XXX, but I have the ${...} for interpretation, so the regexp isn't too long, and is easier to maintain. About the whitespace issue: I will add a mechanism that replaces whitespace with a single space in the serialization section, so the work is easier in the rest of the compiler. I should have explained that in the source code. Sorry about that. If you think that changes anything, let me know, please. Thank you.