12

I have a dataframe like as shown below

tdf = pd.DataFrame({'text_1':['value: 1.25MG - OM - PO/TUBE - ashaf', 'value:2.5 MG - OM - PO/TUBE -test','value: 18 UNITS(S)','value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had', 'value: 75 MG - OM - PO/TUBE']})

I would like to apply regex and create two columns based on rules given below

col val should store all text after value: and before first hyphen

col Adm should store all text after third hyphen

I tried the below but it doesn't work accurately

tdf['text_1'].str.findall('[.0-9]+\s*[mgMG/lLcCUNIT]+')

enter image description here

3 Answers 3

10

With your shown samples, could you please try following.

tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\S+(?:\s[^-]+)?)(?:\s-\s.*?-([^-]*)(?:-.*)?)?$', expand=True)
tdf

Online demo for above regex

Output will be as follows.

                                                    text_1          val                  Adm
0                     value: 1.25MG - OM - PO/TUBE - ashaf       1.25MG             PO/TUBE 
1                        value:2.5 MG - OM - PO/TUBE -test       2.5 MG             PO/TUBE 
2                                       value: 18 UNITS(S)  18 UNITS(S)                  NaN
3  value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had       850 MG   SC (SUBCUTANEOUS) 
4                              value: 75 MG - OM - PO/TUBE        75 MG              PO/TUBE

Explanation: Adding detailed explanation for above.

^value:\s?       ##Checking if value starts from value: space is optional here.
(\S+             ##Starting 1st capturing group from here and matching all non space here.
  (?:\s[^-]+)?   ##In a non-capturing group matching space till - comes keeping it optional.
)                ##Closing 1st capturing group here.
(?:\s-\s.*?-     ##In a non-capturing group matching space-space till - first occurrence.
  ([^-]*)        ##Creating 2nd capturing group which has values till next - here.
  (?:-.*)?       ##In a non capturing group from - till end of value keeping it optional.
)?$              ##Closing non-capturing group at the end of the value here.
Sign up to request clarification or add additional context in comments.

4 Comments

@TheGreat, could you please do let me know for which samples it doesn't work. With your shown samples its working fine for me.
Oh sorry, I thought you read my comment on the previous answer. I updated my sample dataframe now.
@TheGreat, with your latest changes(df) when I test tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\d+(?:\.\d+)?\s?(?:\S+)?)(?:\s-[^-]*-(.*))?$', expand=True) my bonus solution it works fine for me, let me know on same please.
Sorry, bonus solution doesn't work with the updated sample
10

Series.str.extract

tdf['text_1'].str.extract(r'^value:\s?([^-]+)(?:\s-.*?-\s)?([^-]*)(?:\s|$)')

             0                  1
0       1.25MG            PO/TUBE
1       2.5 MG            PO/TUBE
2  18 UNITS(S)                   
3       850 MG  SC (SUBCUTANEOUS)
4        75 MG            PO/TUBE

Regex details:

  • ^ : Assert position at start of line
  • value: : Matches character sequence value:
  • \s?: Matches any whitespace character between zero and one time
  • ([^-]+) : First capturing group matches any character except - one or more times
  • (?:\s-.*?-\s)? : Non capturing group match between zero and one time
    • \s: Matches single whitespace character
    • - : Matches character -
    • .*? : Matches any character between zero and unlimited times but as few times as possible
    • - : Matches character -
    • \s : Matches single whitespace character
  • ([^-]*) : Second capturing group matches any character except - zero or more times
  • (?:\s|$) : Non capturing group
    • \s- : Matches single whitespace character
    • | : Or switch
    • $ : Assert position at the end of line

See the online Regex demo

Comments

5

You can use

tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^val:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?$', expand=True)
# => >>> tdf
                                             text_1          val  \
0                        val: 1.25MG - OM - PO/TUBE       1.25MG   
1                         val:2.5 MG - OM - PO/TUBE       2.5 MG   
2                                  val: 18 UNITS(S)  18 UNITS(S)   
3  val: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS)       850 MG   
4                         val: 75 MG - OM - PO/TUBE        75 MG   


0            PO/TUBE  
1            PO/TUBE  
2                NaN  
3  SC (SUBCUTANEOUS)  
4            PO/TUBE  

See the regex demo.

Details:

  • ^val: - val: at the start of string (if val: is not always at the start of the string, remove ^ anchor)
  • \s* - zero or more whitespaces
  • ([^-]*?) - Group 1: any chars other than - as few as possible
  • (?:\s*-[^-]*-\s*(.*))? - an optional sequence of
    • \s* - zero or more whitespaces
    • -[^-]*- - a -, any zero or more chars other than -, and then a -
    • \s* - zero or more whitespaces
    • (.*) - Group 2: the rest of the line
  • $ - end of string.

5 Comments

one minor question. Let's say if I wish to reorder the columns. Meaning, Adm should come first and val as the last column. Would the regex remain the same?
Sorry, couldn't try as I am away from my desk
@TheGreat After extracting, you can reorder the columns, add the tdf = tdf[['text_1', 'Adm', 'val']] line.
Thanks, one last question. I am trying to change your regex to pick all text after 3rd hyphen but before 4th hyphen... So, I write the below tdf["text_1"].str.extract(r'^value:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?[^-]*', expand=True) but that seems to give incorrect output. would you be able to help?
@TheGreat That will be ^val:\s*([^-]*?)(?:\s*-[^-]*-\s*([^-]*)), see demo.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.