python - pandas - DataFrame expansion with outer join -


first of new @ pandas , trying lean thorough answers appreciated.

i want generate pandas dataframe representing map witter tag subtoken -> poster tag subtoken means in set {hashtaga} u {i | in split('_', hashtaga)} table matching poster -> tweet

for example:

in [1]: df = pd.dataframe([["jim", "i #yolo_omg her"], ["jack", "you #yes_omg #best_place_ever"], ["neil", "yo #rofl_so_funny"]])  in [2]: df out[2]:        0                                     1 0   jim           #yolo_omg 1  jack  #yes_omg #best_place_ever 2  neil                     yo #rofl_so_funny 

and want like

      0          1 0   jim          yolo_omg 1   jim          yolo 2   jim          omg 3  jack          yes_omg 4  jack          yes 5  jack          omg 6  jack          best_place_ever 7  jack          best 8  jack          place 9  jack          ever 10 neil          rofl_so_funny 11 neil          rofl 12 neil          13 neil          funny 

i managed construct mostrosity job:

in [143]: df[1].str.findall('#([^\s]+)') \     .apply(pd.series).stack() \     .apply(lambda s: [s] + s.split('_') if '_' in s else [s]) \     .apply(pd.series).stack().to_frame().reset_index(level=0) \     .join(df, on='level_0', how='right', lsuffix='_l')[['0','0_l']]  out[143]:          0              0_l 0 0   jim         yolo_omg   1   jim             yolo   2   jim              omg   0  jack          yes_omg   1  jack              yes   2  jack              omg 1 0  jack  best_place_ever   1  jack             best   2  jack            place   3  jack             ever 0 0  neil    rofl_so_funny   1  neil             rofl   2  neil                 3  neil            funny 

but have strong feeling there better ways of doing this, given real dataset set huge.

pandas indeed has function doing natively. series.str.findall() applies regex , captures group(s) specify in it.

so if had dataframe:

df = pd.dataframe([["jim", "i #yolo_omg her"], ["jack", "you #yes_omg #best_place_ever"], ["neil", "yo #rofl_so_funny"]]) 

what first set names of columns, this:

df.columns = ['user', 'tweet'] 

or on creation of dataframe:

df = pd.dataframe([["jim", "i #yolo_omg her"], ["jack", "you #yes_omg #best_place_ever"], ["neil", "yo #rofl_so_funny"]], columns=['user', 'tweet']) 

then apply extract function regex:

df['tag'] = df["tweet"].str.findall("(#[^ ]*)") 

and use negative character group instead of positive one, more survive special cases.


Comments

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

javascript - Highcharts multi-color line -

javascript - Enter key does not work in search box -