python - pandas - DataFrame expansion with outer join -
first of new @ pandas , trying lean thorough answers appreciated.
i want generate pandas dataframe representing map witter tag subtoken -> poster
tag subtoken means in set {hashtaga} u {i | in split('_', hashtaga)}
table matching poster -> tweet
for example:
in [1]: df = pd.dataframe([["jim", "i #yolo_omg her"], ["jack", "you #yes_omg #best_place_ever"], ["neil", "yo #rofl_so_funny"]]) in [2]: df out[2]: 0 1 0 jim #yolo_omg 1 jack #yes_omg #best_place_ever 2 neil yo #rofl_so_funny
and want like
0 1 0 jim yolo_omg 1 jim yolo 2 jim omg 3 jack yes_omg 4 jack yes 5 jack omg 6 jack best_place_ever 7 jack best 8 jack place 9 jack ever 10 neil rofl_so_funny 11 neil rofl 12 neil 13 neil funny
i managed construct mostrosity job:
in [143]: df[1].str.findall('#([^\s]+)') \ .apply(pd.series).stack() \ .apply(lambda s: [s] + s.split('_') if '_' in s else [s]) \ .apply(pd.series).stack().to_frame().reset_index(level=0) \ .join(df, on='level_0', how='right', lsuffix='_l')[['0','0_l']] out[143]: 0 0_l 0 0 jim yolo_omg 1 jim yolo 2 jim omg 0 jack yes_omg 1 jack yes 2 jack omg 1 0 jack best_place_ever 1 jack best 2 jack place 3 jack ever 0 0 neil rofl_so_funny 1 neil rofl 2 neil 3 neil funny
but have strong feeling there better ways of doing this, given real dataset set huge.
pandas indeed has function doing natively. series.str.findall() applies regex , captures group(s) specify in it.
so if had dataframe:
df = pd.dataframe([["jim", "i #yolo_omg her"], ["jack", "you #yes_omg #best_place_ever"], ["neil", "yo #rofl_so_funny"]])
what first set names of columns, this:
df.columns = ['user', 'tweet']
or on creation of dataframe:
df = pd.dataframe([["jim", "i #yolo_omg her"], ["jack", "you #yes_omg #best_place_ever"], ["neil", "yo #rofl_so_funny"]], columns=['user', 'tweet'])
then apply extract function regex:
df['tag'] = df["tweet"].str.findall("(#[^ ]*)")
and use negative character group instead of positive one, more survive special cases.
Comments
Post a Comment