python - pandas - DataFrame expansion with outer join -
first of new @ pandas , trying lean thorough answers appreciated.
i want generate pandas dataframe representing map witter tag subtoken -> poster tag subtoken means in set {hashtaga} u {i | in split('_', hashtaga)} table matching poster -> tweet
for example:
in [1]: df = pd.dataframe([["jim", "i #yolo_omg her"], ["jack", "you #yes_omg #best_place_ever"], ["neil", "yo #rofl_so_funny"]])  in [2]: df out[2]:        0                                     1 0   jim           #yolo_omg 1  jack  #yes_omg #best_place_ever 2  neil                     yo #rofl_so_funny   and want like
      0          1 0   jim          yolo_omg 1   jim          yolo 2   jim          omg 3  jack          yes_omg 4  jack          yes 5  jack          omg 6  jack          best_place_ever 7  jack          best 8  jack          place 9  jack          ever 10 neil          rofl_so_funny 11 neil          rofl 12 neil          13 neil          funny   i managed construct mostrosity job:
in [143]: df[1].str.findall('#([^\s]+)') \     .apply(pd.series).stack() \     .apply(lambda s: [s] + s.split('_') if '_' in s else [s]) \     .apply(pd.series).stack().to_frame().reset_index(level=0) \     .join(df, on='level_0', how='right', lsuffix='_l')[['0','0_l']]  out[143]:          0              0_l 0 0   jim         yolo_omg   1   jim             yolo   2   jim              omg   0  jack          yes_omg   1  jack              yes   2  jack              omg 1 0  jack  best_place_ever   1  jack             best   2  jack            place   3  jack             ever 0 0  neil    rofl_so_funny   1  neil             rofl   2  neil                 3  neil            funny   but have strong feeling there better ways of doing this, given real dataset set huge.
pandas indeed has function doing natively. series.str.findall() applies regex , captures group(s) specify in it.
so if had dataframe:
df = pd.dataframe([["jim", "i #yolo_omg her"], ["jack", "you #yes_omg #best_place_ever"], ["neil", "yo #rofl_so_funny"]])   what first set names of columns, this:
df.columns = ['user', 'tweet']   or on creation of dataframe:
df = pd.dataframe([["jim", "i #yolo_omg her"], ["jack", "you #yes_omg #best_place_ever"], ["neil", "yo #rofl_so_funny"]], columns=['user', 'tweet'])   then apply extract function regex:
df['tag'] = df["tweet"].str.findall("(#[^ ]*)")   and use negative character group instead of positive one, more survive special cases.
Comments
Post a Comment