AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums

sabreW4K3@lazysoci.al · 1 day ago

AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums

LandedGentry@lemmy.zip · edit-2 18 hours ago

Certainly far less objectionable than taking down public resources, though there’s more to it than that - again, it puts the onus on everyone else to protect themselves from companies that are essentially acting like malicious hackers, Companies that should be the ones responsible for not tearing down public resources. But I don’t really get what you’re trying to prove, because your proposal is not what they’re doing. They’re just doing whatever the fuck they want and don’t care who it impacts. They never do.

I don’t feel like this is very complicated. I’m not allowed to block public roads with my car. I’m not allowed to cut the power to a library and bar the doors. You can’t just deny people public resources like that as a private entity, unless of course you are an AI slop company, in which case states literally aren’t even allowed to make rules about you for the next decade due to our corrupt commander-in-chief. These AI companies are allowed to steamroll any private or public entity they want so long as they condense the right people they will make them a lot of money. It is wildly unethical and the fact that I have to spend so much time convincing you they deserve a little more scrutiny is kind of baffling.

Aaron Schwartz didn’t do anything like the above and your insistence that he is somehow critical to proving some perceived hypocrisy or inconsistency on my part is…well, i’m not sure what the word is, but it’s just not accurate at all.

FaceDeer@fedia.io · 18 hours ago

That suggestion is exactly the same as what I started with when I said “IMO the ideal solution would be the one Wikimedia uses, which is to make the information available in an easily-downloadable archive file.” It just cuts out the Aaron-Schwarts-style external middleman, so it’s easier and more efficient to create the downloadable data.

LandedGentry@lemmy.zip · edit-2 18 hours ago

I have said it twice already, but I will do it a third time:

It is not right to expect everyone else to accommodate private, venture capital fueled AI companies. This is their problem, they are the ones who have to train their models, so they are the ones who have to figure out how to get the data without fucking everyone else in the process. They are not entitled to breaking everything and going “whoopsies!”

I don’t understand why the burden is on the victims here. You are telling libraries et al that it is their responsibility to keep corporations from breaking into their homes, scattering everything everywhere, and forcing them to clean it up themselves.

FaceDeer@fedia.io · 17 hours ago

I don’t understand why the burden is on the victims here.

They put the website up. Load balancing, rate limiting, and such go with the turf. It’s their responsibility to make the site easy to use and hard to break. Putting up an archive of the content that the scrapers want is an easy and straightforward thing to do to accomplish this goal.

I think what’s really going on here is that your concern isn’t about ensuring that the site is up, and it’s certainly not about ensuring that the data it’s providing is readily available. It’s that there are these specific companies you don’t like and you just want to forbid them from accessing otherwise freely accessible data.

LandedGentry@lemmy.zip · edit-2 16 hours ago

That is absolutely ridiculous. The pressure AI scraping puts on sites vastly outstrips anything people built for, as evidenced by the fact that the systems are going down. Building out for that kind of assault costs a lot of money and time. I’m honestly wondering if you understand what it takes.

You know these systems are underfunded. You know these people are underpaid and underappreciated. This is absolutely ass backwards and I don’t understand why you’re defending these companies that are getting unholy amounts of money to inflict this upon our public resources. Do you know what happens if I do this to websites? It’s called a DDoS attack and I get a visit from the feds.

And for the record, I am all about the data being readily available. Readily available to all of us. I don’t care if AI companies use the data or acquire it, but if it comes at the cost of our access, then yes I am opposed. You should be too! Yet here you are misrepresenting the situation and drawing Incredibly crooked parallels.

AI evangelists are all the same, they can’t see beyond the religion they’ve built. Anyone that remotely questions the process these companies feel entitled to are branded as Luddites and shouted down. This can’t be emphasized enough. Do not buy into their cult.

Marketing firms have to pay for surveys or data acquired by other organizations. Political analysts have to pay for polling data or pay to conduct their own.

Let me ask you this: Why are AI companies special? Why do they get to take out my public library without warning and profit off of the act? You keep trying to reestablish what I am saying, so how about you actually express what you believe.

FaceDeer@fedia.io · 15 hours ago

That is absolutely ridiculous. The pressure AI scraping puts on sites vastly outstrips anything people built for, as evidenced by the fact that the systems are going down.

Yes. Which is why I’m suggesting providing an approach that doesn’t require scraping the site.

LandedGentry@lemmy.zip · edit-2 15 hours ago

At this point I’m tired of you ignoring whole swaths of what I’m writing. I responded to that particular thing several times. I made it a point to tell you I was doing it for the third time. Clearly you are not reading what I am writing. You are skimming it at best.

You have tunnel vision on this issue. It’s sad that you’re on the wrong side of it, but at this point I’ve wasted enough of my time. Have a good one dude

FaceDeer@fedia.io · 15 hours ago

Perhaps be more succinct? You’re really flooding the zone here.

You have tunnel vision on this issue.

No, I’m staying focused.

LandedGentry@lemmy.zip · edit-2 1 hour ago

deleted by creator

Hedgeknife@beehaw.org · 10 hours ago

“be more succinct”?

maybe have AI summarize it for you 🙄