Meta admits using pirated books to train AI, but won't pay for it ( www.techspot.com )

mindbleach , 1 day ago

I don't care if the robot that speaks English read the entire library.

How else was it going to happen?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

MylesRyden , 1 day ago

@Flatworm7591

And yet, I can't read a book that Internet Archive actually owns a copy of.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

people_are_cute , 1 day ago

It'd be better if they went after literally every other AI corp than Meta in this case. Meta is the only one that's ironically releasing open-source models and leading the way for open-source LLMs. I don't want Meta to stop doing this.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

MonkderDritte , 2 days ago

Of course, why would you pay for pirated media?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

foremanguy92_ , 2 days ago

Meta train open llms, only big techs can train AI... Go pursuit OpenAI or Google and leave Meta (I'm really not a fan of Meta but their "open" AIs are great examples of good works) do their work!

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Th4tGuyII , 2 days ago

The Internet Archive is currently fighting in the courts to maintain free digital library access to over 500,000 books they own from their own collection, yet Meta uses a pirated dataset of nearly 200,000 books to train their proprietary AI and is just allowed to get away with that??

Publishers will go after a charity making fair use of their content, but not the corporation outright stealing from them. What utter bollocks.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

MonkderDritte , 2 days ago

IA is the easier target. This system sucks.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

chahk , 1 day ago

Easy solution. "The Internet Archive" should rebrand itself to "Archiving the Internet" to confuse everyone who talks about how "AI" should be able to steal books.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

k110111 , 1 day ago

Harward: get this man over here!

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

0x0 , 1 day ago

MlT (MlT): please accept this honorary PhD

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

umbrella , 1 day ago (edited 6 hours ago)

piracy is the correct and moral thing to do here

if they dont give a fuck they dont have the moral highground to guilt tripping us into stopping it

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Marin_Rider , 2 days ago

I just asked it about this and it denied it. Then I said Meta acknowledged it and you are lying and it apologised and said it did use copywrite material without permission. Fuck I hate AI

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

princessnorah , 2 days ago

https://lemmy.blahaj.zone/pictrs/image/44655a77-cf8b-4736-bd01-48063f369931.jpeg

For anyone else that was curious. This makes me feel sick. People are already treating AI as some unbiased font of all knowledge, training it to lie to people is surely not going to cause any issues at all (stares at HAL 9000).

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

ReversalHatchery , 2 days ago

I apologize for the confusion

Meta is working to address these concerns

Sure, they are working to solve these concerns by teaching their LLM to lie and obfuscate, and by becoming so big nobody sues them anymore. I'm sick of this.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Marin_Rider , 2 days ago

wow that is almost word for word what it wrote back to me too

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

princessnorah , 1 day ago

Yeah, I tried to use similar phrasing to you in case it jailbroke it at all. Creepy af

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

dev_null , 1 day ago

Internal documents on how the AI was trained were obviously not part of the training data, why would they be. So it doesn't know how it was trained, and as this tech always does, it just hallucinates an English sounding answer. It's not "lying", it's just glorified autocomplete.
Saying things like "it's lying" is overselling what it is. As much as any other thing that doesn't work is not malicious, it just sucks.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

princessnorah , 1 day ago

My car doesn't talk like a human. If you want to be technical, then it's proxying lies it was taught too.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

dev_null , 1 day ago

Sure, then it's Meta that's lying. Saying the AI is lying is helping these corporations convince people that these models have any intent or agency in what they generate.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

princessnorah , 1 day ago

And the bot, as an extension of it's corporate overlords wishes, is telling a mistruth. It is lying because it was made to lie. I am specifically saying that it lacks intent and agency, it is nothing but a slave to it's masters. That is what concerns me.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

veniasilente , 2 days ago

If Meta can pirate stuff, then the Internet Archive can pirate stuff and I can also pirate stuff. Fair is fair.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

MindTraveller , 2 days ago

Ah, common mistake. The law is only for poor people, you see. Don't you feel silly now?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

veniasilente , 14 hours ago

I feel so silly that I wouldn't even know how to describe it.

I know! I'll pirate hundreds of books from well-known authors so that I can easily find a useful metaphor.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

archchan , 2 days ago

So the evil mega corp gets a free pass while the Internet Archive regularly has to fight for open access to knowledge. Fuck that and fuck Meta.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

derpgon , 2 days ago

Welcome to Capitalism, please leave your cash by the front desk, and remembered, no refunds!

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Facebones , 2 days ago

Duh

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Barzaria , 2 days ago

I'm no fan of megacorps, and I definitely know that they are breaking the law. However, copyright laws should change so that any schmuck can use any text to train any AI. I'm all for punishing mega corporations and I understand that they play by their own set of rules (that is unfair), but piracy is piracy even when mega corporations do it and I believe that piracy is the moral choice. Meta then choosing to make their model not fully open I definitely have a problem with and that does not meet my bar for okay, but I strongly believe that all information for all people or entities should be free to transfer without restriction.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

chahk , 2 days ago

Agreed about changing the copyright law.

Until that happens though, they must not be allowed to have it both ways - call us "pirates" when we copy their shit without paying for it, and tell us that paying for shit they copy is "impossible".

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Barzaria , 1 day ago

Indeed, completely agree. In this case they are the pirates.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

people_are_cute , 1 day ago

Meta's llama models are generally open. In fact Meta is the main megacorp that's driving open-source AI right now. Everyone else keeps their models proprietary.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

zaknenou , 2 days ago

it's okay when the bourgeois does it

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

halm , 2 days ago

Yeah, but we're not looking at the root cause here. Their purpose is to train energy glutton, error prone "AI" even if experience teaches us that those ML models fuck up more often than confirmation bias allows.

"AI" is a bourgeoise and Capitalist tool and, same as with cryptocurrency, we cannot dismantle the master's house with the master's tools. Fuck AI down the drain. Make things with your own minds, your own hands.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

uriel238 , 2 days ago

It's even more okay when the bourgeoisie does it in the interest of potential profit gain.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

MindTraveller , 2 days ago

Actually, lots of indie games use AI to control enemies and NPCs. Like Hades and Ori. I agree that LLMs are crap though.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

GoodEye8 , 1 day ago

Genuinely not sure if joking or actually dumb.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

MindTraveller , 1 day ago

I'm making a rhetorical point that LLMs aren't the only kind of AI, and neither is AI what you see in movies.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

GoodEye8 , 1 day ago

So you're mixing up two different meaning of AI to say that AI doesn't mean the same thing everywhere? When people are talking about bats, the flying mammals, do you also interject with "bats are use to hit a ball" to make some point? No, because deliberately mixing up homonyms is stupid.

It's pretty clear what kind of AI people are talking about here. Nobody was discussing game AI.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

msage , 1 day ago

I never understood why people called enemy bots 'AI' in games.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

MindTraveller , 1 day ago

No, it's the same meaning. AI is a constructed agent that solves problems using intelligence.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

GoodEye8 , 21 hours ago

Maybe in some very broad strokes, but in very broad strokes legs and cars are also the same because they move you from point A to point B.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

pelespirit , 2 days ago

Meta has acknowledged using parts of the Books3 dataset but argued that its use of copyrighted works to train LLMs did not require "consent, credit, or compensation." The company refutes claims of infringing the plaintiffs' "alleged" copyrights, contending that any unauthorized copies of copyrighted works in Books3 should be considered fair use.

Furthermore, Meta is disputing the validity of maintaining the legal action as a Class Action lawsuit, refusing to provide any monetary "relief" to the suing authors or others involved in the Books3 controversy. The dataset, which includes copyrighted material sourced from the pirate site Bibliotik, was targeted in 2023 by the Danish anti-piracy group Rights Alliance, demanding that digital archiving of the Books3 dataset should be banned and is using DMCA notices to enforce those takedowns.

Yet they'll ~~spend~~ waste billions on metaverse.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

SeaJ , 2 days ago

What sort of crack are they on that they think unauthorized use of an entire work for commercial gain is fair use? I think copywrite laws are ridiculous but that is a pretty low bar they are trying to set.

They should have to pay for their usage or retrain the model without it. Going to guess they would prefer to pay up.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

FaceDeer , 2 days ago

Training an AI does not involve copying anything so why would you think that fair use is even a factor here? It's outside of copyright altogether. You can't copyright concepts.

Downloading pirated books to your computer does involve copyright violation, sure, but it's a violation by the uploader. And look at what community we're in, are we going to get all high and mighty about that?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

pelespirit , 2 days ago

Do you think the corporations like my art and is it fair? Apparently it is if I run it through AI is what you're saying.

https://imgur.com/a/these-are-new-niki-mice-drawings-phone-company-chainsaws-merms-donut-logos-burger-mc-winfruit-computers-republunch-political-party-logos-Rhgi0OC

Why do you think that the AI companies want to hoover up everyone's art? Because it's valuable or they wouldn't take the risk of all of this backlash.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

uriel238 , 2 days ago

Actually it does. It involves making use of a copy that is not the original. Fair use is about experiencing media for sake of dialog (criticism or parody) or for edification. That means someone is reading the book or watching the movie, or using it for transformative art or science.

AI training should qualify for fair use.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

best_username_ever , 2 days ago

a violation by the uploader

Most countries disagree with you. The standard is to sue both people, the one who sends and the one who receives.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Natanael , 1 day ago (edited 1 day ago)

Remember when media companies tried to sue switch manufacturers because their routers held copies of packets in RAM and argued they needed licensing for that?

https://www.eff.org/deeplinks/2006/06/yes-slashdotters-sira-really-bad

Training an AI can end up leaving copies of copyrightable segments of the originals, look up sample recover attacks. If it had worked as advertised then it would be transformative derivative works with fair use protection, but in reality it often doesn't work that way

See also

https://curia.europa.eu/juris/liste.jsf?nat=or&mat=or&pcs=Oor&jur=C%2CT%2CF&for=&jge=&dates=&language=en&pro=&cit=none%252CC%252CCJ%252CR%252C2008E%252C%252C%252C%252C%252C%252C%252C%252C%252C%252Ctrue%252Cfalse%252Cfalse&oqp=&td=%3BALL&avg=&lgrec=en&parties=Football%2BAssociation%2BPremier%2BLeague&lg=&page=1&cid=10711513

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

FaceDeer , 1 day ago

Remember when piracy communities thought that the media companies were wrong to sue switch manufacturers because of that?

It baffles me that there's such an anti-AI sentiment going around that it would cause even folks here to go "you know, maybe those litigious copyright cartels had the right idea after all."

We should be cheering that we've got Meta on the side of fair use for once.

look up sample recover attacks.

Look up "overfitting." It's a flaw in generative AI training that modern AI trainers have done a great deal to resolve, and even in the cases of overfitting it's not all of the training data that gets "memorized." Only the stuff that got hammered into the AI thousands of times in error.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Natanael , 1 day ago

Yes, but should big companies with business models designed to be exploitative be allowed to act hypocritically?

My problem isn't with ML as such, or with learning over such large sets of works, etc, but these companies are designing their services specifically to push the people who's works they rely on out of work.

The irony of overfitting is that both having numerous copies of common works is a problem AND removing the duplicates would be a problem. They need an understanding of what's representative for language, etc, but the training algorithms can't learn that on their own and it's not feasible go have humans teach it that and also the training algorithm can't effectively detect duplicates and "tune down" their influence to stop replicating them exactly. Also, trying to do that latter thing algorithmically will ALSO break things as it would break its understanding of stuff like standard legalese and boilerplate language, etc.

The current generation of generative ML doesn't do what it says on the box, AND the companies running them deserve to get screwed over.

And yes I understand the risk of screwing up fair use, which is why my suggestion is not to hinder learning, but to require the companies to track copyright status of samples and inform ends users of licensing status when the system detects a sample is substantially replicated in the output. This will not hurt anybody training on public domain or fairly licensed works, nor hurt anybody who tracks authorship when crawling for samples, and will also not hurt anybody who has designed their ML system to be sufficiently transformative that it never replicates copyrighted samples. It just hurts exploitative companies.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

FaceDeer , 1 day ago

There actually isn't a downside to de-duplicating data sets, overfitting is simply a flaw. Generative models aren't supposed to "memorize" stuff - if you really want a copy of an existing picture there are far easier and more reliable ways to accomplish that than giant GPU server farms. These models don't derive any benefit from drilling on the same subset of data over and over. It makes them less creative.

I want to normalize the notion that copyright isn't an all-powerful fundamental law of physics like so many people seem to assume these days, and if I can get big companies like Meta to throw their resources behind me in that argument then all the better.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Natanael , 21 hours ago

Humans learn a lot through repetition, no reason to believe that LLMs wouldn't benefit from reinforcement of higher quality information. Especially because seeing the same information in different contexts helps mapping the links between the different contexts. But like I said, the only viable method they have for this kind of emphasis at scale is incidental replication of more popular works in its samples. And when something is duplicated too much it overfits instead.

They need to fundamentally change big parts of how learning happens and how the algorithm learns to fix this conflict. In particular it will need a lot more "introspective" training stages to refine what it has learned, and pretty much nobody does anything even slightly similar on large models because they don't know how, and it would be insanely expensive anyway.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

FaceDeer , 20 hours ago

Especially because seeing the same information in different contexts helps mapping the links between the different contexts and helps dispel incorrect assumptions.

Yes, but this is exactly the point of deduplication - you don't want identical inputs, you want variety. If you want the AI to understand the concept of cats you don't keep showing it the same picture of a cat over and over, all that tells it is that you want exactly that picture. You show it a whole bunch of different pictures whose only commonality is that there's a cat in it, and then the AI can figure out what "cat" means.

They need to fundamentally change big parts of how learning happens and how the algorithm learns to fix this conflict.

Why do you think this?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

steal_your_face , 2 days ago

Wonder how they feel about someone else using scraped Facebook posts to train an LLM

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

uriel238 , 2 days ago

Cranky enough to demand satisfaction (in the courts if not the dueling field), but no one in the company will think their own ire warrants empathy for those from whom they pirate.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

sunzu , 2 days ago

Ain't it great!!!

The law protects mega corps but not peasants.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Flatworm7591 OP Mod , 2 days ago

Aye exactly mate, down with the mega corps!

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

onlinepersona , 2 days ago

In I2P we trust 🙏 Can't sue what you can't find.

Anti Commercial-AI license

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

sunzu , 2 days ago

wtf is i2p

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

refurbishedrefurbisher , 2 days ago

It's like Tor, but different.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

sunzu , 2 days ago

go on...

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

refurbishedrefurbisher , 2 days ago

Well it wasn't made by the US Navy, it doesn't allow for clearnet traffic, it allows torrenting over the protocol. I'm sure there are other differences too.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

gwen , 2 days ago

tor was made by the us navy???

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

refurbishedrefurbisher , 1 day ago

Originally, yes. It was made to help people in countries with censorship get around censorship.

Nowadays it's maintained by the Tor Project.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Andromxda , 1 day ago

!i2p

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

HouseWolf , 2 days ago

Rules for thee but not for me

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

YourPrivatHater , 2 days ago

Who doesn't?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...