Python Regex Doubt | SoloLearn: Learn to code for FREE!


Python Regex Doubt

Hello everyone, i recently attempted a python regex chalenge and saw the sollution to it and found this: import re first_multiple_input = raw_input().rstrip().split() n = int(first_multiple_input[0]) m = int(first_multiple_input[1]) matrix = [] for _ in xrange(n): matrix_item = raw_input() matrix.append(matrix_item) encoded_string = "".join([matrix[j][i] for i in range(m) for j in range(n)]) pat = r'(?<=[a-zA-Z0-9])[^a-zA-Z0-9]+(?=[a-zA-Z0-9])' print(re.sub(pat,' ',encoded_string)) So my query is that i wasnt able to understand the code after the for loop, pls explain it to me if you can understand it. Reference: Sample input: 7 3 Tsi h%x i # sM $a #t% ir! Sample Output: This is Matrix# %!

9/14/2020 9:15:31 AM


3 Answers

New Answer


The encoded_string variable is set with a list comprehension that cycles through each element of the matrix. I have created this small code so that you may compare that syntax with something I hope you are more familiar with - nested loops. The regex part consists of three parts. The middle part - [^a-zA-Z0-9]+ - is the only section that actually returns what is matched. It returns any non-zero length string of characters that are NOT a letter - lower or upper case - or a number. The first part is called a (positive) lookbehind, denoted by the syntax (?<=...). This lookbehind means that the middle section (above) will only return something if it follows something that the lookbehind matches. The lookbehind is trying to match any character that IS a letter - lower or upper case - or a number. So the middle section only matches a non-letter/number string of characters if it follows a letter or a number.


The third section is a (positive) lookahead, which works the same way as the lookbehind, except that it means that the middle section only matches if it precedes any letter/number character. All put together this regex matches any string of non-letter/number characters that is immediately preceded by a letter/number character and immediately followed by a letter/number character. The second argument in the re.sub() method means that, whatever it matched, it replaces it with a space. So, the encoded_string is found by reading the columns of the matrix from top to bottom in turn (going from left to right), which gives "This$#is% Matrix# %!". The regex matches the string "$#" because it is preceded by an "s" and followed by "i" and changes it to a space. It does the same with the string "% " because it between "s" and "M". But it doesn't change the last sequence of characters "# %!" because it didn't come before a letter/number character. Hope that all makes sense!


Thanks a lot @Russ. Your explanaton was just wonderful.