Breaking the 4Chan CAPTCHA

cherryteastain · 50 minutes ago

The part about bad Keras<->Tensorflow.js interop is classic Tensorflow. Using TF always felt like using a bunch of vaguely related tools put under the same umbrella rather than an integrated, streamlined product.

Actually, I'll extend that to saying every open source Google library/tool feels like that.

reply

Retr0id · 1 minute ago

something something Conway's law

reply

antirez · 1 hour ago

Appropriate response by 4Chan to this: simplify the human work given that anyway it's simple to solve via NNs. We are at a point where designing very hard captchas has high probabilities to increase the human annoyance without decreasing the machine solvability.

reply

makifoxgirl · 18 minutes ago

This project also solves the 4chan captcha https://github.com/moffatman/chan

reply

ChrisMarshallNY · 51 minutes ago

That’s like spending a few hours, learning to take the lid off your septic tank.

reply

blackjackfoe · 36 minutes ago

Little bit, but at least you learned something :)

reply

morkalork · 50 minutes ago

Following the links to the captcha solving service you can read profiles of the humans doing the work where its pitched as more ethical than them working in hazardous factories!

reply

lofenfew · 1 hour ago

It might be worth noting that this, including the harder version the op encountered, are not the hardest captchas that 4chan can serve. There is a still harder version which is sent to less trustworthy IPs. I imagine it would still be tractably solved with computer vision. This in part misses the point though, since 4chan has been continuously altering their captcha since it released, making it difficult to create a permanent solution that won't be broken down the road.

reply

chatmasta · 26 minutes ago

Datacenter IPs can’t even post at all, nevermind needing to solve a CAPTCHA. That’s why the accusations of “VPN shill” are usually wrong, as is the assumption of anonymity – 4chan is in fact one of the least anonymous sites on the internet. The optional username feature gives it a veneer of anonymity, but the strict IP requirements ensure almost every post is attributable to a residential internet connection, and reliably associable with other posts from that same connection.

reply

blackjackfoe · 18 minutes ago

Some datacenter IPs can post fine, mostly just not those belonging to any large hosting company. I would mention a list of ones I know aren't blocked, but, well, that might get them blocked.

reply

chatmasta · 11 minutes ago

That’s surprising to me. I assumed they were using some service (like Cloudflare) with an updated list of non-residential IP addresses.

I’ve only ever tried to post through Cloudflare WARP (or Apple Private Relay, which is also Cloudflare but different exit IP range). Once I realized that didn’t work, I thought maybe it wasn’t worth posting at all :) I don’t like the idea of my ISP having any suspicion I posted to 4Chan (even if it’s technically https yadda yadda…)

reply

gruez · 17 minutes ago

What about users behind CGNAT, like mobile users?

reply

chatmasta · 15 minutes ago

That’s attributable with the right warrant and correlation with other data available to the ISP.

CGNAT is not an anonymity mechanism – at best it may be a very crude one, but the carriers will make extra effort to remove that anonymity through logging, retention, and segmentation.

reply

BlueTemplar · 10 minutes ago

"Attributable" means by law enforcement, and mobile carriers, like all ISPs, must keep logs. In this case, for who had which IP address when.

(Otherwise, it's akin to the usual confusion between anonymity and pseudonymity.)

reply

chatmasta · 6 minutes ago

That’s true, but to be fair my original comment also said posts would be reliably associable with other posts from the same IP. With CGNAT, that association will be slightly less reliable, but not meaningfully so. The segment of the population who posts on 4chan is so low that there is negligible chance of two 4chan users sharing an exit IP and time window. Even with non-overlapping time windows, the population will be low enough for stylometry (and other factors) to remove any remaining ambiguity.

reply

blackjackfoe · 58 minutes ago

Yeah, I encountered those as well in my data gathering. I threw them out from the training set, but I kept them for possible future experimentation.

reply

Shank · 56 minutes ago

Can you upload a few of these samples somewhere?

reply

blackjackfoe · 18 minutes ago

I need to manipulate the data a bit, because right now it's just raw, unaligned foreground/background images with solutions. I need to do the alignment and save them as images rather than JSON files. I'll do that when I have the time.

reply

cchance · 50 minutes ago

Jesus looking at both example captchas... as a human... i have no fucking clue the answer lol

reply

tumsfestival · 1 hour ago

I can only imagine how much worse they'll make the captcha after stuff like this picks up speed with the users all the while being ineffective against the bots.

reply

rany_ · 1 hour ago

I really doubt that they're the first to do this.

reply

OmarShehata · 51 minutes ago

captchas are broken, forever. There is no way to prevent bots without also preventing a bottom tier of human users (visually impaired people, old people, or just impatient people). Like this xkcd [1] comic suggests, we need to just focus on rewarding and punishing specific behavior, regardless of whether the agent is human or not

[1] https://xkcd.com/810/

reply

cchance · 53 minutes ago

I mean at some point ... the average visitor is dumber than the AI and your now just blocking dumb people

reply

OmarShehata · 51 minutes ago

yes, we're creating websites that are gated by IQ tests. This isn't the way

reply

dmitrygr · 57 minutes ago

  > The official TensorFlow-to-TFJS model converter doesn't work on Python 3.12. This doesn't seem to really be documented, and the error messages thrown when you try to use it on Python 3.12 are non-obvious. I tried an older version of Python (3.10) on a hunch, using PyEnv, and it worked like a charm.

Amazing. And then people wonder why "just use python 2" is still a thing.

reply

orhmeh09 · 55 minutes ago

Do you have examples of "just use python 2" still being a thing in 2024?

reply

dmitrygr · 48 minutes ago

Yeah, whenever i need to write a quick script and have no time to suffer "$library needs python 3.x, where x must be > $value and <= $value2, and not a prime except when that ends in a 3, except on leap days"

2 is stable and does not change from under you. Which is what you want in a programming langiuage

reply

anigbrowl · 1 hour ago

Congratulations, now it will get upgraded and become more work for humans to solve, increasing the burden on every non-malicious user.

reply

jeroenhd · 1 hour ago

It's not like bots aren't already bypassing these CAPTCHAs. One author writing a blog post about how they accomplished what spammers and bots have been doing for ages isn't going to change anything.

I just opened 4chan and after the initial Cloudflare bot detection I was told to register an email or wait 15 minutes before I was allowed to even obtain a CAPTCHA. Looks like they're already taking a layered approach to combat bots.

reply

blackjackfoe · 1 hour ago

(author here) Interestingly, the email registration/time-limit was added after I started this project, but before I told anyone about it.

reply

sunaookami · 1 hour ago

There are already loads of extensions and scripts out there that can solve these captchas with a great success rate.

reply