18

In a Python regular expression, I encounter this singular problem. Could you give instruction on the differences between re.findall('(ab|cd)', string) and re.findall('(ab|cd)+', string)?

import re

string = 'abcdla'
result = re.findall('(ab|cd)', string)
result2 = re.findall('(ab|cd)+', string)
print(result)
print(result2)

Actual Output is:

['ab', 'cd']
['cd']

I'm confused as to why does the second result doesn't contain 'ab' as well?

1
  • re.findall('(ab|cd)', string) gets ['ab', 'cd'] re.findall('(ab|cd)+', string) gets ['cd'] Commented Jan 7, 2020 at 8:33

3 Answers 3

16

+ is a repeat quantifier that matches one or more times. In the regex (ab|cd)+, you are repeating the capture group (ab|cd) using +. This will only capture the last iteration.

You can reason about this behaviour as follows:

Say your string is abcdla and regex is (ab|cd)+. Regex engine will find a match for the group between positions 0 and 1 as ab and exits the capture group. Then it sees + quantifier and so tries to capture the group again and will capture cd between positions 2 and 3.


If you want to capture all iterations, you should capture the repeating group instead with ((ab|cd)+) which matches abcd and cd. You can make the inner group non-capturing as we don't care about inner group matches with ((?:ab|cd)+) which matches abcd

https://www.regular-expressions.info/captureall.html

From the Docs,

Let’s say you want to match a tag like !abc! or !123!. Only these two are possible, and you want to capture the abc or 123 to figure out which tag you got. That’s easy enough: !(abc|123)! will do the trick.

Now let’s say that the tag can contain multiple sequences of abc and 123, like !abc123! or !123abcabc!. The quick and easy solution is !(abc|123)+!. This regular expression will indeed match these tags. However, it no longer meets our requirement to capture the tag’s label into the capturing group. When this regex matches !abc123!, the capturing group stores only 123. When it matches !123abcabc!, it only stores abc.

Sign up to request clarification or add additional context in comments.

4 Comments

can you link to some doc making clear the fact that + only captures the last iteration, and what is a capture group?
@Gulzar, updated the answer. You can read about capture groups here - regular-expressions.info/refcapture.html
@Shashank, thanks, your reply is exactly what I need. sincerely thanks
There's no need to surround the whole regex with brackets. Just '(?:ab|cd)+' will work.
6

I don't know if this will clear things more, but let's try to imagine what happen under the hood in a simple way, we going to sumilate what happen using match

   # group(0) return the matched string the captured groups are returned in groups or you can access them
   # using group(1), group(2).......  in your case there is only one group, one group will capture only 
   # one part so when you do this
   string = 'abcdla'
   print(re.match('(ab|cd)', string).group(0))  # only 'ab' is matched and the group will capture 'ab'
   print(re.match('(ab|cd)+', string).group(0)) # this will match 'abcd'  the group will capture only this part 'cd' the last iteration

findall match and consume the string at the same time let's imagine what happen with this REGEX '(ab|cd)':

      'abcdabla' ---> 1:   match: 'ab' |  capture : ab  | left to process:  'cdabla'
      'cdabla'   ---> 2:   match: 'cd' |  capture : cd  | left to process:  'abla'
      'abla'     ---> 3:   match: 'ab' |  capture : ab  | left to process:  'la'
      'la'       ---> 4:   match: '' |  capture : None  | left to process:  ''

      --- final : result captured ['ab', 'cd', 'ab']  

Now the same thing with '(ab|cd)+'

      'abcdabla' ---> 1:   match: 'abcdab' |  capture : 'ab'  | left to process:  'la'
      'la'       ---> 2:   match: '' |  capture : None  | left to process:  ''
      ---> final result :   ['ab']  

I hope this clears thing a little bit.

Comments

0

So, for me confusing part was the fact that

If one or more groups are present in the pattern, return a list of groups;

docs

so it's returning you not a full match but only match of a capture. If you make this group not capturing (re.findall('(?:ab|cd)+', string), it'll return ["abcd"] as I initially expected

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.