mysql - Sql Popularity algorithm with weighted score -
i'm implement algorithm returns popular posts @ moment, given likes , dislikes.
to this, each post add likes (1) , dislikes (-1) score each like/dislike weighted : latest, heaviest. example, @ moment user likes post, weights 1. after 1 day, weights 0.95 (or -0.95 if it's dislike), after 2 days, 0.90, , on... minimal of 0.01 reached after 21 days. (ps: theses totally approximate values)
here how tables made :
posts table
id | title | user_id | ... ------------------------------------------- 1 | random post | 10 | ... 2 | post | 36 | ... n | ... | n | ... likes table
id | vote | post_id | user_id | created ---------------------------------------- 1 | 1 | 2 | 10 | 2014-08-18 15:34:20 2 | -1 | 1 | 24 | 2014-08-15 18:54:12 3 | 1 | 2 | 54 | 2014-08-17 21:12:48 here sql query i'm using which job
select post.*, like.*, sum(like.vote * (1 - if((timestampdiff(minute, like.created, now()) / 60 / 24) / 21 > 0.99, 0.99, (timestampdiff(minute, like.created, now()) / 60 / 24) / 21)) ) score posts post left join likes on (post.id = like.post_id) group post.id order score desc ps: i'm using timestampdiff minute , not day directly because i'm calculating day myself otherwise returns me integrer , want float value, in order gradually decay overtime , not day per day. timestampdiff(minute, like.created, now())/60/24 gives me number of day passed since creation decimal part.
here questions :
- look @
if(expr1, expr2, expr3)part : necessary in order set minimal value like's weight, not go under 0.01 , become negative (and like, older still has little weight). i'm calculating 2 times same thing : expr1 same expr2. isn't there way avoid duplicate expression ? - i going cache query , update every 5 minutes, think pretty heavy on big
post,liketable. cache necessary or not ? i'm aiming run query on table 50 000 entries, , each 200 associated likes (that makes 10 000 000 entriesliketable). - should create index in
liketable post_id ? , created ?
thank !
edit: imagine post can have multiple tags, , each tag can belong multiple posts. if want populars posts given tag or multiple tag, can't cache each query ; there amount of possible queries. query still viable ?
edit final solution: did tests. created table post 30 000 entries , 250 000 entries. without index, query incredibly long (timed out > 10mn), indexes on post.id (primary), like.id(primary) , like.post_id took ~0.5s.
so i'm not caching data, neither using update every 5mn. if table keeps growing still possible solution (over 1s it's not acceptable).
2: going cache query , update every 5 minutes, think pretty heavy on big post , table. cache necessary or not ? i'm aiming run query on table 50 000 entries, , each 200 associated likes (that makes 10 000 000 entries table).
10000 , 50000 considered small on current hardware. table sizes won't need cache, unless query run several times per second. anyway, performance test before deciding have cache.
3: should create index in table post_id ? , created ?
i create index (post_id, created, vote). way query can information index , doesn't need read table @ all.
edit (response comments):
an index slow down inserts/updates slightly. in end, path choose dictate characteristics of need in terms of cpu/ram/disk i/o. if have enough ram db expect entire like table cached in ram might better off index on post_id.
in terms of total load need consider ratio between insert , select , relative cost of insert , select or without index. gut feeling total load lower index.
regarding question on concurrency (selecting , inserting simultaneously). happens depends on isolation level. general advice keep inserts/updates short possible. if don't unneccessary things between start of insert , commit should fine.
Comments
Post a Comment