AI companies are reportedly still scraping websites despite protocols meant to block them


Perplexity, which describes its product as a “free artificial intelligence search engine”, has come under fire in the past few days. After a short time Forbes accused him of stealing his story and reposting it on multiple platforms, Wired Perplexity reported that it ignored the Robots Exclusion Protocol, or robots.txt, and broke its website and other Condé Nast publications. Technology site Shortcut also accused the company of breaking its articles. now, Reuters He said the confusion was not the only one An AI company this is to bypass robots.txt files and then scrape websites to obtain content used to train their technologies.

Reuters It said it has seen a letter from TollBit to publishers, which connects them with AI firms and so it can get licensing deals, warning them that “AI agents from multiple sources (not just one company) prefer to bypass robots.txt content from sites.” protocol to obtain.” The robots.txt file contains instructions for web crawlers about which pages they can and cannot visit. Web developers have been using the protocol since 1994, but compliance is entirely voluntary.

No company is named in TollBit’s letter, however Business Insider he says he learned OpenAI and anthropic — creators of chatbots ChatGPT and Claude, respectively — also bypass robots.txt signals. Both companies have previously stated that they respect the “do not crawl” instructions that websites place in their robots.txt files.

During the investigation, Wired On Amazon’s server, a machine “definitely operated by Perplexity” discovered that its website was bypassing robots.txt instructions. To confirm if Perplexity has scraped its content, Wired provided the company’s tool with short prompts describing titles or stories from their articles. The tool is said to produce results that closely reflect articles “with minimal attribution.” And sometimes, he even created inaccurate summaries for his stories – Wired says the chatbot falsely claimed that it had reported a specific California police officer to have committed a crime in one case.

in an interview with Fast Company, Perplexity CEO Aravind Srinivas told the publication that his company “doesn’t ignore the Robot Exceptions Protocol and doesn’t lie about it.” However, that doesn’t mean it doesn’t benefit from browsers that ignore the protocol. Srinivas explained that the company uses third-party web browsers on its own and browser Wired was one of those identified. When Fast Company When Perplexity asked if he had told the crawler to stop hacking Wired’s website, he only replied, “it’s complicated.”

Srinivas defended his company’s practices, telling the publication that the Robot Exclusion Protocol was “not a legal framework” and that publishers and companies like his may have to forge a new relationship. It is reported that he instigated it Wired Instructions used intentionally to make Perplexity’s chatbot behave the way it does, so regular users won’t get the same results. As for the imprecise summaries produced by the tool, Srinivas said, “We never said we didn’t see hallucinations.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *