hadoop - Hive & RegexSerde returning just NULL -
i'm trying parse following line example using regexeserde in hive:
2011-07-22 20:34:51 808 8b1f27d094fb33ea - - - observed "unavailable" http://www.4shared.com/ 200 tcp_nc_miss text/javascript;charset=utf-8 http dc413.4shared.com 80 /network/search-suggest.jsp ?search=2 kfzhnit2lhyqa==&format=jsonp jsp "mozilla/5.0 (windows; u; windows nt 6.1; en-us; rv:1.9.2.18) gecko/20110614 firefox/3.6.18" 82.137.200.42 484 852 -
my table definition this:
create external table browsing_data_ext( cdate string, ctime string, time_taken string, c_ip string, cs_username string, cs_auth_group string, x_exception_id string, sc_filter_result string, cs_categories string, cs_referer string, sc_status string, s_action string, cs_method string, rs_content_type string, cs_uri_scheme string, cs_host string, cs_uri_port string, cs_uri_path string, cs_uri_query string, cs_uri_extension string, cs_user_agent string, s_ip string, sc_bytes string, cs_bytes string, x_virus_id string ) row format serde 'org.apache.hadoop.hive.contrib.serde2.regexserde' serdeproperties ( "input.regex" = "([\\-0-9]*) ([\\:0-9]*) ([\\d]*) ([\\.a-z0-9]*) ([\\-a-z0-9]*) ([\\-a-z0-9]*) ([\\-a-z0-9]*) ([\\w]*) (\\\"[\\w]*\\\") ([\\.\\-\\=\\&:\\/\\?a-z0-9]*) ([\\d]*) ([\\_\\w]*) ([\\w]*) ([\\/\\w]*) ([\\w]*) ([\\.\\w]*) ([\\d]*) ([\\.\\-\\=\\&:\\/\\?a-z0-9]*) ([\\.\\-\\=\\&:\\/\\?a-z0-9]*) ([\\.\\w]*) (\\\"[\\w\\w]*\\\") ([.:a-z0-9]*) ([\\d]*) ([\\d]*) ([\\-a-z0-9]*)", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s %18$s %19$s %20$s %21$s %22$s %23$s %24$s %25$s" ) stored textfile location '/user/hdfs/data' tblproperties ("skip.header.line.count"="6");
i've tested in rubular , few other regex validation tools pass when i'm selecting table i'm receiving null values;
thanks, daniel
i had read long log file , procedure solve was:
create regex 1) https://regex101.com/#java
2) replace "\w" "\s" , "\w" "\w"
inside each parentheses used "+" not "*" referring "one or more of a".
without 2) result whole line null values after adding double "\" special characters test parsed.
Comments
Post a Comment