Hacker News Clone new | comments | show | ask | jobs | submit | github repologin
Bluesky's open API means anyone can scrape your data for AI training (techcrunch.com)
50 points by bookofjoe 2 hours ago | hide | past | web | 60 comments | favorite





Good? I'd rather it be available to everyone than just a handful of large corporations.

Crazy part is this researcher is a part of the Bluesky community, he thought he was giving back - this stuff is great for moderation purposes, spam rejection, crypto scam detection, and just fun "maps of supercluster" projects [0].

[0] https://nucleardiner.wordpress.com/2023/05/21/bluesky-is-for...


I feel like there's a fundamental disconnect between the hacker ethos of "everything online is fair game, let's scrape a ton of data and see what cool stuff we can do" and the more...let's say European ideas about data and privacy rights. Neither one is inherently right or wrong, but they both cannot coexist on the internet.

Then what's the point for those moving from X to Bluesky?

AFAIK many are moving due to content might be used to train AI models.


"They'll train an AI on my deeply-valuable thoughts" is the absolute least of my worries about having an account on Twitter.

It's more just about me not wanting to stay in the place I grew up after all my friends moved away, especially after a whole bunch of slightly terrible people moved in next door.


Very few people are switching to Bluesky so they don't have their data scraped for AI models. X has become a messy cesspool of spammy ads, toxic comments, and an owner intent on making it his personal echo chamber.

I don't know where you got that idea. People are leaving because they don't like Elon.

Here come the flurry of scary articles looking to find that angle. Can’t call this one a foreign actor so now openness is “bad”.

The article takes that angle because Bluesky users are currently harassing and threatening a researcher. Read the replies to his original thread: https://bsky.app/profile/danielvanstrien.bsky.social/post/3l...

From the article:

>Per a report by 404 Media, Daniel van Strien, a machine learning librarian at AI firm Hugging Face, pulled 1 million public posts from Bluesky via its Firehose API for machine learning research, pushing the dataset to a public repository. Van Strien later removed the data due to the controversy that ensued; however, it serves as a timely reminder that everything you post publicly to Bluesky is, well, public.


So what was the license on the data?

HF certainly have a restrictive license on data they produce.


"Harassing and threatening" by asking him not to use their data illegally and without consent?

> By creating an account you agree to the Terms of Service and Privacy Policy.

> The Bluesky App is a microblogging service for public conversation, so any information you add to your public profile and the information you post on the Bluesky App is public.

What did they expect? I don't know how they could make it more clear. It doesn't say "information you post will only be used by people you like". It is public.

Anybody can write like 10 lines of Python and start consuming the firehose.


Copyright laws don't cease to be a thing the moment you post something on the internet. The words are still yours. If this was X/Reddit/Facebook or the like instead of Bluesky the researcher would have immediately found himself on the wrong end of a DMCA takedown request and maybe even a lawsuit.

A lawsuit such as LinkedIn v. hiQ ?

The concluding scraping publically accessible data was not a violation of CFAA, after which Twitter et al went logged-in-users-only?

I don't know if copyright comes into this. While social media terms of service are very clear about licensing the comments of individual users, that's to protect them from what the law doesn't say implicitly. Is every comment posted to Twitter or Bluesky a "literary work" ? An original, creative expression ? I have my doubts but I guess there's room for a lawsuit yet.


I didn't see a lot of asking in that thread. They should probably organize and sue if they want to clarify what is legal, as is the bulk of the comments I read are denigrating and insulting

> Your job is making the world a worse place

> Hey man, fuck you

> fuck you nerd

> Take your AI horseshit out of here you horrible piece of human trash.

> have you considered killing yourself

> This is garbage, you're a thief, and you disgust me.

> Throw yourself off a building [thumbs up emoji]

> Somebody cyberbully this guy


See also the comments where Daniel apologizes. Bluesky hasnt changed anything about the shape of Twitter and the kind of harassment that occurs. A mob of people have some vague enemy (AI bros) and pile on the first target of opportunity directing all their "kill yourself" energy at one individual at a time.

https://bsky.app/profile/danielvanstrien.bsky.social/post/3l...


The hate doesn't prevent the research taking place, but ensures it won't occur publicly where it's more useful.

There is literally no one that can be pleased anymore.

try drawing a straight line through 8+ billion scattered dots.

Bluesky themselves have made it abundantly clear they're with the artists on this one. The firehose isn't a license to use someone's art for commercial software development.

Is that worse than Meta, X or Google hoarding up for themselves?

Or is it worse than Reddit selling to anyone with money?


Maybe someone is upset that they can’t get competitive advantage by throwing money.

Maybe it's just me, or I might be missing a relevant implication - but I'm having a hard time understanding why so many people have become alarmist about the fact, that things that they publish on the web, can and will be scraped?

Might just be one of these: https://en.wikipedia.org/wiki/Availability_cascade

I do find PG's idea of "aggressively conventional-minded" people to be a useful concept: https://paulgraham.com/conformism.html


It seems to be mainly a reaction against AI (as opposed to scraping in-general, e.g. for a search engine).

I'm not saying it makes sense, but there is a large and growing idea of: I want my content out in the world, but I don't want companies to use it for training AIs, especially for profit.


I don't see how Bluesky is sustainable. Who's paying for hosting the main instance of Bluesky? Who's paying for the firehose? How long will this last?

>We believe that there must be better strategies to sustain social networks that don’t require selling user data for ads. Our first step in another direction is paid services, and we’re starting with custom domains...

>...

>We’re partnering with Namecheap, a popular domain registrar, to offer a service for easy domain purchasing and management.

https://bsky.social/about/blog/7-05-2023-business-plan



So that's an investor, but it's not a revenue or profit model. The article notes:

> The future for Bluesky includes expanding their go-to-market efforts, building out their product roadmap with new features like subscription-based profile customizations, and further engaging with their growing developer community, all while maintaining their commitment to a free and accessible platform.

Is "subscription-based profile customizations" a sufficient revenue model? The investor will at some point want a return.


It’s a solid business model just like the Underpants Gnomes from South Park…

1. Build Twitter clone

2. ???

3. Profit


People got angry on bluesky and say it has to be forbidden; you either want open or closed. If closed, your data is still used to train AI, but the owner of the network is making money with it. Better if anyone can get it; not like it is stoppable anymore anyway.

There's a lot that could be said for the behavior of Bluesky users and moderators, but aside from the practical matter that anyone can scrape your data, there seems to be some confusion on users' part about what it means that they have "ownership" of their data. You license it to Bluesky, but in the absence of other licensing agreements (or as Gen Z likes to say, "communication of consent preferences"), is there no way to prevent what you do in public from being absorbed by the facehuggers and monetized thereafter?

Does Bluesky's decentralised nature make it hard/impossible to apply a bot blocker^ like cloudflare?

^ More accurately, cloudflare is a bot-slower as it (and services like it) make it slower and most costly to scrape data, but not impossible.


Each relay could implement bot rejection as a means of saving bandwidth but the whole point of the architecture is that the firehouse can be mirrored

That way they can claim it's not just their scabbing up your data but it's a design feature in case you try to sue us to resell. For the betterment of humanity and such.

Good. Just like the open web.

Does bsky have an actual plan for how they'll make money long term?

The users are all in a honeymoon phase about how it's so different from twitter, but it seems like it's only a matter of time until they're directly selling data for AI training, offering paid corporate accounts, intrusive ads, etc. In the end, I imagine it'll end up just like twitter but with a different CEO. I use it simply because lots of others who I follow migrated, but I'm under no delusion that it'll be any different long term.


Long long term is anyone's guess, but I think if the team stays mission driven and doesn't get distracted they won't have any problems with money. The company is still owned by employees, and the entire team is just 20 people. They don't seem to have a bloated stack or unnecessary features. They got a sizable initial war chest of $14 million from Twitter. When that runs out I bet adding just a "donate" button somewhere on the site will be enough to cover expenses.

14 million pays for a couple salaries and a few server bills. That burns away fast when users get higher and higher into the tens of millions.

Being mission driven and not getting distracted is nice, but that doesn't put money in their pockets. Firefox and other products that are/were beloved by similar crowd have been asking for donations for years and it hasn't worked out. Wikipedia seems like the rare exception that pulls it off, and that's because they're aggressive with their campaigning for cash and they're basically an indispensable public service at this point.


Already raised another 15 million from Blockchain Capital. Not a donation, but a Series A. They could have walked the nonprofit donation route like Signal but they dance with VCs instead.

https://www.blockchaincapital.com/blog/bluesky-13m-users-and...


There's a statement here (July 2023) with some info,

https://bsky.social/about/blog/7-05-2023-business-plan

but selling domain services doesn't seem it will go very far. I've seen some other rumors about paid accounts for extra features (posting longer or high-quality videos, etc).

Does anyone know if being a "public benefit corporation" is significant? Or will the same monetary pressures build up?


Well they recently published the AT protocol that they use.

It’s an open protocol, so they’re going for some kind of community angle maybe?

They could probably sell instances like mastodon, but I don’t think that will scale how they probably want.

I think they will eventually find a way to either monetize the data or add an ad extension to AT proto.


2 million bluesky posts if you want to use it for something:

https://x.com/AlpinDale/status/1861819574259192082


That's a very depressing and unpleasant-looking dataset. I'm don't know if that reflects more on BlueSky, or on the "uniform random sampling" aspect.

If we sampled HN data randomly, would *we* look that bad?


A book's open text interface means anyone can scrape your data for AI training.

An open and accessible API is good actually. "Bad" actors will always find a way around closed/restricted/no API access.

What's the point of "enable[ing] users to communicate their consent preferences", when none of the entities who receive those preferences are under any obligation or restraint to respect them?

This is a "close elevator" button. It's a placebo button your users can press that can make them feel—without any basis in reality—more safe, more private, more $whatever. It's deceptive. An ethical company should get rid of those no-op preferences settings altogether.


I think the goal is to enable users to avoid content they don’t wish to see, not prevent content from going to specific places.

No, in this case they are talking about consent about where their public posts are scraped to. I was quoting and replying to this part:

- "Bluesky said that it’s looking at ways to enable users to communicate their consent preferences externally, though it’s up to those parties whether they respect those preferences. The company posted: “Bluesky won’t be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings. We’re having ongoing conversations with engineers & lawyers and we hope to have more updates to share on this shortly!”.


Yes, this particular feature isn’t effective at stopping content from going somewhere unwanted, but I think the main design goal of bsky and the AT protocol isn’t that.

HF wouldn't exist if it was an ethical company.

says some 60+ npr guy who thinks AI wants his watermarked digital slr pic of a sunset so it can profit from his creative genious.

oh no :scream:

As soon as someone figures out how to DCMA the AI industry the whole industry will become enshittified. It all relies on copyright infringement and “generative” AI doesn’t generate as much as it originally promised, it’s more like an extremely advanced search engine with the ability to combine and edit the source data.

An analogy is music producers who sample other tracks, who most definitely have to pay royalties.

If it was as easy to detect trained AI data as it is to detect a music video or movie in a YouTube video, every AI company would be toast.

You’d basically have a category of application whose cost is so high it’s hard to justify. You’ve gotta run the world’s most expensive type of computing to do the AI training and you have to license a massive amount of work from copyright owners for it to have any use.

To swing this back to being more related to the article at hand, I think that being open to the public can be okay if BlueSky or people who publish content on the platform are better able to exert their rights under copyright law. When I post something online I shouldn’t be giving up my copyright rights just because it’s hard to enforce.

If there was a law that was truly progressive about online privacy it would protect individuals’ intellectual property rights more on social networks. A social media company shouldn’t magically get to own my content just because they said so in their EULA.


That's not how the scaling laws work. The number of samples required to reach a given quality level reduces exponentially over time. Most researchers use small datasets.

Interesting, because in The New York Times' lawsuit there is a very large block of text repeated verbatim. Page 30: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

How much of a copyrighted work do I have to copy and reproduce/redistribute to violate copyright law? Am I allowed to sell my handheld recording of the last two minutes of Gladiator 2 for $1.99 at the flea market?


You don't have to give your data to social media companies, you know. That's just part of the trade off in using their apps.

Me making a comment on social media: anyone can can redistribute it, they could even charge money to read my comment if they'd like.

Disney playing Frozen 2 in the airplane in-flight entertainment system: I'm not allowed to copy, reproduce, distribute, or disseminate it.

See the double standard here?


A lot of creative professionals work on commission. Social media is basically a requirement if you want to make money in certain artistic spaces. If you can't show your work in public you won't get jobs. Catch 22.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: