re.findall('(ab|cd)', string) vs re.findall('(ab|cd)+', string)

Question

In a Python regular expression, I encounter this singular problem. Could you give instruction on the differences between re.findall('(ab|cd)', string) and re.findall('(ab|cd)+', string)?

import re

string = 'abcdla'
result = re.findall('(ab|cd)', string)
result2 = re.findall('(ab|cd)+', string)
print(result)
print(result2)

Actual Output is:

['ab', 'cd']
['cd']

I'm confused as to why does the second result doesn't contain 'ab' as well?

re.findall('(ab|cd)', string) gets ['ab', 'cd'] re.findall('(ab|cd)+', string) gets ['cd'] — rock
– rock, Commented Jan 7, 2020 at 8:33

Shashank V · Accepted Answer · 2020-01-07 09:35:11Z

16

+ is a repeat quantifier that matches one or more times. In the regex (ab|cd)+, you are repeating the capture group (ab|cd) using +. This will only capture the last iteration.

You can reason about this behaviour as follows:

Say your string is abcdla and regex is (ab|cd)+. Regex engine will find a match for the group between positions 0 and 1 as ab and exits the capture group. Then it sees + quantifier and so tries to capture the group again and will capture cd between positions 2 and 3.

If you want to capture all iterations, you should capture the repeating group instead with ((ab|cd)+) which matches abcd and cd. You can make the inner group non-capturing as we don't care about inner group matches with ((?:ab|cd)+) which matches abcd

https://www.regular-expressions.info/captureall.html

From the Docs,

Let’s say you want to match a tag like !abc! or !123!. Only these two are possible, and you want to capture the abc or 123 to figure out which tag you got. That’s easy enough: !(abc|123)! will do the trick.

Now let’s say that the tag can contain multiple sequences of abc and 123, like !abc123! or !123abcabc!. The quick and easy solution is !(abc|123)+!. This regular expression will indeed match these tags. However, it no longer meets our requirement to capture the tag’s label into the capturing group. When this regex matches !abc123!, the capturing group stores only 123. When it matches !123abcabc!, it only stores abc.

edited Jan 7, 2020 at 9:35

answered Jan 7, 2020 at 8:50

Shashank V

11.5k3 gold badges33 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Gulzar Over a year ago

can you link to some doc making clear the fact that + only captures the last iteration, and what is a capture group?

Shashank V Over a year ago

@Gulzar, updated the answer. You can read about capture groups here - regular-expressions.info/refcapture.html

rock Over a year ago

@Shashank, thanks, your reply is exactly what I need. sincerely thanks

Bernhard Barker Over a year ago

There's no need to surround the whole regex with brackets. Just '(?:ab|cd)+' will work.

Charif DZ · Accepted Answer · 2020-01-07 09:58:15Z

I don't know if this will clear things more, but let's try to imagine what happen under the hood in a simple way, we going to sumilate what happen using match

   # group(0) return the matched string the captured groups are returned in groups or you can access them
   # using group(1), group(2).......  in your case there is only one group, one group will capture only 
   # one part so when you do this
   string = 'abcdla'
   print(re.match('(ab|cd)', string).group(0))  # only 'ab' is matched and the group will capture 'ab'
   print(re.match('(ab|cd)+', string).group(0)) # this will match 'abcd'  the group will capture only this part 'cd' the last iteration

findall match and consume the string at the same time let's imagine what happen with this REGEX '(ab|cd)':

      'abcdabla' ---> 1:   match: 'ab' |  capture : ab  | left to process:  'cdabla'
      'cdabla'   ---> 2:   match: 'cd' |  capture : cd  | left to process:  'abla'
      'abla'     ---> 3:   match: 'ab' |  capture : ab  | left to process:  'la'
      'la'       ---> 4:   match: '' |  capture : None  | left to process:  ''

      --- final : result captured ['ab', 'cd', 'ab']

Now the same thing with '(ab|cd)+'

      'abcdabla' ---> 1:   match: 'abcdab' |  capture : 'ab'  | left to process:  'la'
      'la'       ---> 2:   match: '' |  capture : None  | left to process:  ''
      ---> final result :   ['ab']

I hope this clears thing a little bit.

RiaD · Accepted Answer · 2020-01-07 17:29:54Z

0

So, for me confusing part was the fact that

If one or more groups are present in the pattern, return a list of groups;

docs

so it's returning you not a full match but only match of a capture. If you make this group not capturing (re.findall('(?:ab|cd)+', string), it'll return ["abcd"] as I initially expected

answered Jan 7, 2020 at 17:29

RiaD

47.8k12 gold badges85 silver badges128 bronze badges

Collectives™ on Stack Overflow

re.findall('(ab|cd)', string) vs re.findall('(ab|cd)+', string)

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related