Applying regex to pandas column based on different pos of same character

Question

I have a dataframe like as shown below

tdf = pd.DataFrame({'text_1':['value: 1.25MG - OM - PO/TUBE - ashaf', 'value:2.5 MG - OM - PO/TUBE -test','value: 18 UNITS(S)','value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had', 'value: 75 MG - OM - PO/TUBE']})

I would like to apply regex and create two columns based on rules given below

col val should store all text after value: and before first hyphen

col Adm should store all text after third hyphen

I tried the below but it doesn't work accurately

tdf['text_1'].str.findall('[.0-9]+\s*[mgMG/lLcCUNIT]+')

RavinderSingh13 · Accepted Answer · 2021-04-12 12:38:08Z

10

With your shown samples, could you please try following.

tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\S+(?:\s[^-]+)?)(?:\s-\s.*?-([^-]*)(?:-.*)?)?$', expand=True)
tdf

Online demo for above regex

Output will be as follows.

                                                    text_1          val                  Adm
0                     value: 1.25MG - OM - PO/TUBE - ashaf       1.25MG             PO/TUBE 
1                        value:2.5 MG - OM - PO/TUBE -test       2.5 MG             PO/TUBE 
2                                       value: 18 UNITS(S)  18 UNITS(S)                  NaN
3  value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had       850 MG   SC (SUBCUTANEOUS) 
4                              value: 75 MG - OM - PO/TUBE        75 MG              PO/TUBE

Explanation: Adding detailed explanation for above.

^value:\s?       ##Checking if value starts from value: space is optional here.
(\S+             ##Starting 1st capturing group from here and matching all non space here.
  (?:\s[^-]+)?   ##In a non-capturing group matching space till - comes keeping it optional.
)                ##Closing 1st capturing group here.
(?:\s-\s.*?-     ##In a non-capturing group matching space-space till - first occurrence.
  ([^-]*)        ##Creating 2nd capturing group which has values till next - here.
  (?:-.*)?       ##In a non capturing group from - till end of value keeping it optional.
)?$              ##Closing non-capturing group at the end of the value here.

edited Apr 12, 2021 at 12:38

answered Apr 12, 2021 at 11:23

RavinderSingh13

135k14 gold badges62 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

RavinderSingh13 Over a year ago

@TheGreat, could you please do let me know for which samples it doesn't work. With your shown samples its working fine for me.

The Great Over a year ago

Oh sorry, I thought you read my comment on the previous answer. I updated my sample dataframe now.

RavinderSingh13 Over a year ago

@TheGreat, with your latest changes(df) when I test tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\d+(?:\.\d+)?\s?(?:\S+)?)(?:\s-[^-]*-(.*))?$', expand=True) my bonus solution it works fine for me, let me know on same please.

The Great Over a year ago

Sorry, bonus solution doesn't work with the updated sample

Shubham Sharma · Accepted Answer · 2021-04-12 16:35:39Z

`Series.str.extract`

tdf['text_1'].str.extract(r'^value:\s?([^-]+)(?:\s-.*?-\s)?([^-]*)(?:\s|$)')

             0                  1
0       1.25MG            PO/TUBE
1       2.5 MG            PO/TUBE
2  18 UNITS(S)                   
3       850 MG  SC (SUBCUTANEOUS)
4        75 MG            PO/TUBE

Regex details:

^ : Assert position at start of line
value: : Matches character sequence value:
\s?: Matches any whitespace character between zero and one time
([^-]+) : First capturing group matches any character except - one or more times
(?:\s-.*?-\s)? : Non capturing group match between zero and one time
- \s: Matches single whitespace character
- - : Matches character -
- .*? : Matches any character between zero and unlimited times but as few times as possible
- - : Matches character -
- \s : Matches single whitespace character
([^-]*) : Second capturing group matches any character except - zero or more times
(?:\s|$) : Non capturing group
- \s- : Matches single whitespace character
- | : Or switch
- $ : Assert position at the end of line

See the online Regex demo

Wiktor Stribiżew · Accepted Answer · 2021-04-12 10:10:08Z

5

You can use

tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^val:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?$', expand=True)
# => >>> tdf
                                             text_1          val  \
0                        val: 1.25MG - OM - PO/TUBE       1.25MG   
1                         val:2.5 MG - OM - PO/TUBE       2.5 MG   
2                                  val: 18 UNITS(S)  18 UNITS(S)   
3  val: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS)       850 MG   
4                         val: 75 MG - OM - PO/TUBE        75 MG   


0            PO/TUBE  
1            PO/TUBE  
2                NaN  
3  SC (SUBCUTANEOUS)  
4            PO/TUBE

See the regex demo.

Details:

^val: - val: at the start of string (if val: is not always at the start of the string, remove ^ anchor)
\s* - zero or more whitespaces
([^-]*?) - Group 1: any chars other than - as few as possible
(?:\s*-[^-]*-\s*(.*))? - an optional sequence of
- \s* - zero or more whitespaces
- -[^-]*- - a -, any zero or more chars other than -, and then a -
- \s* - zero or more whitespaces
- (.*) - Group 2: the rest of the line
$ - end of string.

answered Apr 12, 2021 at 10:10

Wiktor Stribiżew

632k41 gold badges505 silver badges634 bronze badges

5 Comments

The Great Over a year ago

one minor question. Let's say if I wish to reorder the columns. Meaning, Adm should come first and val as the last column. Would the regex remain the same?

The Great Over a year ago

Sorry, couldn't try as I am away from my desk

Wiktor Stribiżew Over a year ago

@TheGreat After extracting, you can reorder the columns, add the tdf = tdf[['text_1', 'Adm', 'val']] line.

The Great Over a year ago

Thanks, one last question. I am trying to change your regex to pick all text after 3rd hyphen but before 4th hyphen... So, I write the below tdf["text_1"].str.extract(r'^value:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?[^-]*', expand=True) but that seems to give incorrect output. would you be able to help?

Wiktor Stribiżew Over a year ago

@TheGreat That will be ^val:\s*([^-]*?)(?:\s*-[^-]*-\s*([^-]*)), see demo.

Collectives™ on Stack Overflow

Applying regex to pandas column based on different pos of same character

3 Answers 3

4 Comments

`Series.str.extract`

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related