It won't ban them, it just throttles simultaneous connections from the same IP.
Unless there are dozens of people from the exact same IP and not an IP pool, it won't be a problem.
The worst case is they will see a longer delay for an initial page load in their browser by half a second. But it helps the server tremendously, especially since HN seems to use Apache.
...and if they are all hitting HN at the exact same millisecond, then their connection should be delayed
HN serves with connection-close, not keep-alive, so as soon as one request is done, the connection is freed for the next visitor on the same IP. This would just force them to be in single file on a very quickly moving line instead of requiring dozens of connections to be served all at the same time.
Think of grocery store with one super-fast express lane vs no express lane and a dozen very slow cashiers and people with full carts ahead of you.
Don't knock connlimit until you try it. Again, it's not a ban, just backlogs the requests.
That sounds better, but it feels like a band-aid solution to me. For example, I worry about whether it will actually fix the load problems if a bad network has lots of requests, resulting in a very long queue and lots of open connections. It sounds like it's worth trying, at least.
As it currently stands, they would simply be unable to use HN if they were loading it at the same time, as the server would just ban them; do you feel that is really a better solution to the proposed delay?
I think that the proposed solution gives preferential treatment to users who were around long enough (or have enough money) to be on a network where they are assigned their very own personal IPv4 address. If IP addresses mapped 1:1 to users or machines, then I'd be all for using xt_connlimit to throttle users who perform excess requests.
Even if you add a proposed delay, a user behind one of these NATted networks could (unintentionally, I hope) cause a DoS by sending lots of requests to make the queue unreasonably long, which, to someone behind the NAT, is just as bad as a server ban.
Static objects are coming from Amazon while dynamic are coming from another server @theplanet.com
Apache/2.2.19 (FreeBSD)
So you are right, it's FreeBSB, but it's still Apache which really needs connection throttling. But there might be a reverse proxy in place. You can also IP throttle with a module in nginx.
pg: I have fair bit of lisp dev experience. If, as a weekend project, I modified the HN src to use postgres and memcache would you consider using it in production? Obviously, I don't expect carte blanche prior agreement, but I wouldn't want to invest the time unless I thought it was plausible the work could actually help.
I would expect it to solve most of your performance problems for the foreseeable future (at the very least, by letting you scale horizontally and move the DB, frontends, and memcaches to separate boxes - plus ending memory leaks/etc by moving most of the data off the MzScheme heap).
The obvious downside is that it would use your (or someone at YC's) time. First to merge the changes I make to http://ycombinator.com/arc/arc3.tar into the production code, then to buy/setup some extra boxes and do the migration. We're probably talking, roughly, a day. It also has the unfortunate side effect of costing HN's src some of its pedagogical value, since it adds external dependencies and loses 'purity'.
Been looking for an excuse to learn arc for a while now ...
I suspect there's good reason why HN is still using this old codebase. YC after all is not short of the cash needed for a complete revamp.
The site is very much hacked together, but works... In a lot of ways, this reflects the hacker ethos of getting something up and running quickly at low cost while still producing value.
A revamp might have negative impact too by attracting a wider, more mainstream audience which could possibly dilute the purity of the community here.
Careful now :) It's not like there's anything stopping HN attracting a wider audience anyway; there's no restriction on who can register. Anyone can come and join in, which (in my opinion) is as it should be.
Of course. I'm not suggesting that there should be any limitations on who can join, but as the community moves more mainstream, quality will dilute. As the site is rather un-sexy right now, it seems to attract those who are genuinely interested. Remember what happened to Digg...
There's also the usual engineer estimation: "Oh, it will probably take a day to rewrite the code. We'll deploy it and it will probably work just fine in production."
Any engineer that has live code has made this mistake before.
Very generous offer, but I would argue that HN's slow performance is a feature, not a bug. The average drive-by person, that is attracted to sensationalist articles and titles, simply doesn't have the patience for the slow load times of every page. The user that is seeking intelligent conversation, however, is more than willing to have 5+ second wait times if they know that they will be getting valuable content. Couple that with page load times having consistent slow load times, rather than surges of performance, and I wouldn't put past PG to build a delay into page loads to act as a sort of filter. Even if it's unintentional, I would still argue that is still useful in driving out some riff-raff
I also believe that Hacker News runs on a small stack of services developed by some past companies from Y Combinator.
I would agree that there is also little to no desire to make Hacker News "the news place" - where it supports thousands of posts a second and is extremely popular. In general Hacker News is used (and the hope is to stay that way) by startups and people interested in startups - it's slowly growing out to include more types of people - marketing, companies, blog posts who just want a lot of hits, etc - and not many people want to purposely support that.
The same thing every smart developer who ever committed or deployed a line of vulnerable code thought: "I'm just trying to get this feature done, not write a formal proof". You're in good company.
It makes me think that one non-negotiable feature of any webapp architecture is to detect situations when inbound strings are placed in any context where they can be interpreted as code, and either refuse to run or at least spit out a severe warning.
And there are no webapp architectures which do this.
Neat. Something like SafeBuffer is a practical way to approach the problem.
It seems like with the rise of 'zero copy' approaches we could do even better - simply designate a memory region as unsafe, and transform it into a safe version depending on which context it is used. These transforms would want to add a little metadata pointing to the original unsafe region in case the transformed region is ever subsequently used in a different execution context. Alas, from the perspective of one program the input to another always just looks like a string, which means that somehow our host program (and programmer) needs to signal the appropriate transform on, say, concatenation. The only way I can think of around this requirement is to force implementors of contexts to tag their interfaces as a context, and for callers to construct arguments to those functions such that constituents that derive from unsafe regions are detectable. For example we have a SQL context that takes an array of string pointers, where some of the pointers point to 'unsafe' regions, and we just concatenate the elements of the array to construct the context argument.
The Play framework (Scala, Java) and Mojolicious (Perl) (and many other newer frameworks probably) escape output by default, so at least they make you think before allowing XSS.
Ah, the fun part of this is "interpreted as code". Which language? html, xml, js, css, json? Get that part wrong or slightly off, and what you sanitized for one isn't for the other. And sometimes there can be nested contexts.
While the idea of "taint" is useful, it is only half the battle. The other half is accounting for the context.
Do you have a rough set of guidelines for how fast we should request from HN? For a side project, I was thinking of writing something that scraped the HN frontpage and all the associated comment threads every 10 minutes or so, and I'd rather not cause performance issues or get banned. I'd be happy to rate-limit requests to whatever is convenient.
Quote: "HNSearch was built by the team at ThriftDB to give back to the community and to test the capabilities of the ThriftDB flexible datastore with search built-in."
I'm curious to why HN would be walking such a performance tightrope. I could speculate, but it would be uninformed rambling, so I'd love it if someone more knowledgeable than I could explain.
The last bit is key. HN is served off flat files, and caches state in-memory in global variables. That -- and not cost -- makes it hard to add a second machine.
It also makes it nearly impossible to slowly read one's own comment history, as the "next" pagination links are session data dependent and are garbage collected quite frequently.
This is, quite possibly, the worst webapp I use on a regular basis.
The first and foremost reason for me to consult HN is because it is fast. I am in China, and usualy send time on the web only on my phone, with 3g connection.
HN speed beats all other link agregators, blogs, news site, and even goolgle search, and -- most interesting: even fast Chinese sites.
I don't know why it is so fast (except when it is dead, obviously), maybe because of this flat-file architecture, which could just make sense. (Git is very fast too, right?)
And I think it is interesting that the "make it fast" is a leitmotiv that has been forgotten by so many people, Google firstly, but is still a reason for some (me, at least) to pick this site over that sire.
Thank you for explaining why the 'unknown link' happens at all. It's terrible - same with when you spend time thinking and formulating a response, only to see it disappear with the same error.
Awesome. I've gotten my IP banned several times after the browser crashed and I reopened the tabs (I had too many HN threads open prior to crash, enough to trigger the ban)
Yeah... if I open Chrome I am pretty much guaranteed to be banned for days. :( The mechanism should really be changed to account for this: a ton of requests per second for only a few seconds should not trigger an issue, it should be a number of requests per second spike along with some sustained usage per minute. I actually made modifications to Chrome to change how it loads tabs mainly because of Hacker News' weird IP ban system, but I still got burned recently as I accidentally hit "undo close tab" one too many times, which reopened an entire window.
Yeah: I ended up figuring out a way to add it. I now generally like having the feature, but it was a complete necessity due to the Hacker News IP ban rules (although, as I mentioned, still doesn't solve the underlying problem for this site, which is incredibly touchy).
It is so annoying that all the other browsers STILL have not implemented this little but very effective idea - please speak more loud about this, as it seems even most developers here did not even notice this feature...
My solution is to use a firewall with per-application rules and just turn off network access for chrome before I launch it. On my laptop I just unplug the wired/wireless network for during the launch. This was mainly because of HN but also has the added benefit of taking less system resources since a blank page typically is less resource hungry than a real page.
Firefox has a better solution for this but then again, I don't use firefox.
Repost from "Show dead" that relates to this issue:
[−]sunstone1 10 hours ago | link [dead]
Well I never had my IP banned but I did have my account hell banned after about a dozen posts as you can see. Oh, actually no, you can't see, because it's banned. No, I never bothered to get another account, now I'm just a taker not a giver.
Most of the time it's clear why a user was banned, but looking at sunstone's history I don't really see a reason. While the algorithm will never be perfect, it would be nice if there was a clearer solution for misfires.
Great news! I was banned last week (http://news.ycombinator.com/item?id=4736919), the bann was lifted in the meantime. But this will come in handy the next time I'll be developing an extension for HN and will refresh it all the time :)
Well, I might as well try striking while the code is hot..
It occurs to me that I would like to interact with noprocrast in a different manner. Currently, I leave noprocrast disabled most of the time. I like to use longish minaway times (~day), but this makes me feel as if my first visit to HN will start the clock ticking, and I'd better be sure to get my HN fill before the timer runs out (yes, this is kind of ridiculous). So I only enable noprocrast (with a short maxvisit) upon realizing I'm stuck in a web loop.
The mechanism that I envision is either a button that immediately starts a one-shot noprocrast ban, or a page-count based maxvisit. The latter might be better since it could always be left enabled.
Thanks Paul! I'm reluctant to try this in conjunction with developing any HN scrapers since I'm not sure what set it off in the first place and your language suggests it will only unban the IP once (I will, however, make sure the CMU IP I was using gets unbanned). It would be helpful to know what, precisely, that hair trigger is so we can make sure to avoid it.
Since there are so few images on HN, there is no reason to have more than a couple connections per IP on port 80.
It will radically reduce your server load and there will be no blacklists/whitelists to maintain.
reply