Personalised Hacker News Weekly Updates based on your selected topics

PaulHoule · 2 hours ago

Sounds like it would be straightforward to do, particularly because of how a weekly cadence would give time for stories to get voted and commented on which would give you some stats for selection.

How are you categorizing articles?

reply

MKBSP · 1 hour ago

Still work in progress, and I'm a noob, so will take a little while.

I'm scraping the posts every 30 minutes, and categorizing according to whether it is just a link to another page (a), or if there is actually content(b).

a => open link, and scrape said content b => is scraped from the beginning, open and scrape links if possible.

This effectively gives me an "enriched" database, so each week, I can use the "extra" data to do a "semantic search" like: Submissions that talk about Beauty, Spain, and Beauty in Spain, or different combinations of the topics. (RAG https://help.openai.com/en/articles/8868588-retrieval-augmen...).

The problem with only doing this once per week, is that content, that is within a niche topic, that didn't get sufficient upvotes, gets lost. But, I want to add the "weight" of upvotes and comments.

What do you think?

reply

PaulHoule · 19 minutes ago

I have my own personal recommender YOShInOn which is an RSS reader that shows me about 5% of what it ingested. If you look me up in the profile I could show you a demo.

My answer to the diversity problem is this: out of maybe 2000-10,000 items I have the system make N=20 clusters with

https://scikit-learn.org/1.5/modules/generated/sklearn.clust...

and instead of picking out the 300 items with the highest score I pick the top 15 items in each cluster. Everything I post to HN was selected by YOShInOn once and by me twice and I think you can see the clusters at work if you note I post articles about programming, sports, environmental issues, advanced manufacturing, omics, energy technology, etc. If I pulled the top 300 it would all be arXiv papers about recommender systems with a few "circular economy" manufacturing topics.

If you found, say, 200 HN articles on a certain topic last week you might need a smaller cluster size, maybe N=5. There are other approaches to the diversity problem in the literature too but this one is easy.

I get amazing results with SBERT embeddings on HN titles and similar short texts. There is the trouble of ambiguous titles which nobody could classify but if a title is clear enough for you to get the gist of it, SBERT probably does well on it. If you are crawling the stories you are increasing your data 1000x but you are NOT going to get 1000x better results. Here is how I do on thumbs up/thumbs down classification with just titles and an obsolete algo:

https://ontology2.com/essays/ClassifyingHackerNewsArticles/

SBERT would put a few points of AUC on, I could imagine into the low .8's, but the up/down classification is noisy.

You actually could make a decent prototype that just uses the titles and not face webcralwer problems, context window not big enough problems, etc.

reply