Hacker News Clone new | comments | show | ask | jobs | submit | github repologin
Breaking the 4Chan CAPTCHA (www.nullpt.rs)
83 points by hazebooth 2 hours ago | hide | past | web | 28 comments | favorite





The part about bad Keras<->Tensorflow.js interop is classic Tensorflow. Using TF always felt like using a bunch of vaguely related tools put under the same umbrella rather than an integrated, streamlined product.

Actually, I'll extend that to saying every open source Google library/tool feels like that.


something something Conway's law

Appropriate response by 4Chan to this: simplify the human work given that anyway it's simple to solve via NNs. We are at a point where designing very hard captchas has high probabilities to increase the human annoyance without decreasing the machine solvability.

This project also solves the 4chan captcha https://github.com/moffatman/chan

That’s like spending a few hours, learning to take the lid off your septic tank.

Little bit, but at least you learned something :)

Following the links to the captcha solving service you can read profiles of the humans doing the work where its pitched as more ethical than them working in hazardous factories!

It might be worth noting that this, including the harder version the op encountered, are not the hardest captchas that 4chan can serve. There is a still harder version which is sent to less trustworthy IPs. I imagine it would still be tractably solved with computer vision. This in part misses the point though, since 4chan has been continuously altering their captcha since it released, making it difficult to create a permanent solution that won't be broken down the road.

Datacenter IPs can’t even post at all, nevermind needing to solve a CAPTCHA. That’s why the accusations of “VPN shill” are usually wrong, as is the assumption of anonymity – 4chan is in fact one of the least anonymous sites on the internet. The optional username feature gives it a veneer of anonymity, but the strict IP requirements ensure almost every post is attributable to a residential internet connection, and reliably associable with other posts from that same connection.

Some datacenter IPs can post fine, mostly just not those belonging to any large hosting company. I would mention a list of ones I know aren't blocked, but, well, that might get them blocked.

That’s surprising to me. I assumed they were using some service (like Cloudflare) with an updated list of non-residential IP addresses.

I’ve only ever tried to post through Cloudflare WARP (or Apple Private Relay, which is also Cloudflare but different exit IP range). Once I realized that didn’t work, I thought maybe it wasn’t worth posting at all :) I don’t like the idea of my ISP having any suspicion I posted to 4Chan (even if it’s technically https yadda yadda…)


What about users behind CGNAT, like mobile users?

That’s attributable with the right warrant and correlation with other data available to the ISP.

CGNAT is not an anonymity mechanism – at best it may be a very crude one, but the carriers will make extra effort to remove that anonymity through logging, retention, and segmentation.


"Attributable" means by law enforcement, and mobile carriers, like all ISPs, must keep logs. In this case, for who had which IP address when.

(Otherwise, it's akin to the usual confusion between anonymity and pseudonymity.)


That’s true, but to be fair my original comment also said posts would be reliably associable with other posts from the same IP. With CGNAT, that association will be slightly less reliable, but not meaningfully so. The segment of the population who posts on 4chan is so low that there is negligible chance of two 4chan users sharing an exit IP and time window. Even with non-overlapping time windows, the population will be low enough for stylometry (and other factors) to remove any remaining ambiguity.

Yeah, I encountered those as well in my data gathering. I threw them out from the training set, but I kept them for possible future experimentation.

Can you upload a few of these samples somewhere?

I need to manipulate the data a bit, because right now it's just raw, unaligned foreground/background images with solutions. I need to do the alignment and save them as images rather than JSON files. I'll do that when I have the time.

Jesus looking at both example captchas... as a human... i have no fucking clue the answer lol

I can only imagine how much worse they'll make the captcha after stuff like this picks up speed with the users all the while being ineffective against the bots.

I really doubt that they're the first to do this.

captchas are broken, forever. There is no way to prevent bots without also preventing a bottom tier of human users (visually impaired people, old people, or just impatient people). Like this xkcd [1] comic suggests, we need to just focus on rewarding and punishing specific behavior, regardless of whether the agent is human or not

[1] https://xkcd.com/810/


I mean at some point ... the average visitor is dumber than the AI and your now just blocking dumb people

yes, we're creating websites that are gated by IQ tests. This isn't the way

  > The official TensorFlow-to-TFJS model converter doesn't work on Python 3.12. This doesn't seem to really be documented, and the error messages thrown when you try to use it on Python 3.12 are non-obvious. I tried an older version of Python (3.10) on a hunch, using PyEnv, and it worked like a charm.
Amazing. And then people wonder why "just use python 2" is still a thing.

Do you have examples of "just use python 2" still being a thing in 2024?

Yeah, whenever i need to write a quick script and have no time to suffer "$library needs python 3.x, where x must be > $value and <= $value2, and not a prime except when that ends in a 3, except on leap days"

2 is stable and does not change from under you. Which is what you want in a programming langiuage


Congratulations, now it will get upgraded and become more work for humans to solve, increasing the burden on every non-malicious user.

It's not like bots aren't already bypassing these CAPTCHAs. One author writing a blog post about how they accomplished what spammers and bots have been doing for ages isn't going to change anything.

I just opened 4chan and after the initial Cloudflare bot detection I was told to register an email or wait 15 minutes before I was allowed to even obtain a CAPTCHA. Looks like they're already taking a layered approach to combat bots.


(author here) Interestingly, the email registration/time-limit was added after I started this project, but before I told anyone about it.

There are already loads of extensions and scripts out there that can solve these captchas with a great success rate.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: