

Cloudflare, news publishers claim AI firm bypassed anti-scraping protocols
In a developing controversy within the generative AI industry, search startup Perplexity AI has come under fire for allegedly scraping content from websites that explicitly prohibited such activity. Reports from TechCrunch, PCMag, and other sources suggest that Perplexity’s bots may have bypassed standard anti-scraping tools like robots.txt and employed stealth techniques to extract data for its AI-generated responses.
The allegations, if proven, could have major implications for the ethics and legality of data acquisition practices in the competitive landscape of AI search engines.
Cloudflare Speaks Out
One of the most prominent voices in the controversy is cybersecurity firm Cloudflare, which provides protection and performance services to websites worldwide. According to a statement from Cloudflare spokesperson Carolyn Vadino, Perplexity's AI tools circumvented website restrictions using methods “reminiscent of tactics used by advanced persistent threat groups, like those from North Korea.”
Cloudflare accused Perplexity of masking the identity of its scrapers, making it harder for website operators to detect or block them. The report highlights that Perplexity utilized Amazon Web Services (AWS) to reroute its requests and avoid being identified by IP address or bot behavior, effectively bypassing site restrictions intended to limit AI scraping.
Publisher Backlash Grows
In addition to Cloudflare's claims, multiple online publishers, including prominent news organizations, have raised concerns about Perplexity sourcing their content without permission. The crux of the issue lies in whether Perplexity adequately attributes and respects copyright policies when displaying AI-generated summaries of articles originally published by others.
“Publishers are putting up walls for a reason. If those are being knocked down by AI models, it sets a dangerous precedent,” said an unnamed executive quoted by TechCrunch.
While Perplexity’s user interface often includes citations and sources, critics argue that it still enables users to bypass original websites, reducing traffic and undermining monetization models.
Perplexity’s Response
In response to the allegations, Perplexity co-founder Aravind Srinivas denied intentional wrongdoing. “We’re actively refining our systems and complying with industry standards,” he said in a statement. Srinivas emphasized that Perplexity strives to be “a responsible AI platform” and is willing to collaborate with publishers to address concerns.
However, he stopped short of admitting that the startup had knowingly circumvented protections like robots.txt, which webmasters use to signal that certain parts of their sites should not be crawled or indexed by bots.
Legal and Ethical Implications
The broader debate centers on whether AI companies are ethically or legally entitled to train their models on publicly accessible—but not necessarily opt-in—web data. While U.S. law remains murky on the matter, ongoing lawsuits involving companies like OpenAI and The New York Times show the growing tension between data use and copyright.
Web scraping itself is not inherently illegal. However, bypassing technical barriers such as robots.txt, authentication, or geo-fencing measures can raise legal concerns under the Computer Fraud and Abuse Act (CFAA) or other data protection laws.
Cloudflare’s involvement escalates the issue by framing the actions not merely as competitive overreach but as a security risk that could invite regulatory scrutiny.
Industry at a Crossroads
This controversy comes amid growing scrutiny of how large language models and AI agents are trained. As demand for generative AI accelerates, the need for large-scale, diverse datasets has led to increasing reliance on publicly available web content. But many publishers are pushing back, either by implementing technical safeguards or suing over copyright infringement.
With generative AI tools like Perplexity, ChatGPT, and Gemini becoming more widely adopted, the methods used to source training data are drawing more attention than ever.
“If AI models continue to extract and repurpose content from publishers without fair use agreements or licenses, the industry could face a major backlash,” said a legal analyst covering tech law.
Looking Ahead
While the long-term impact of these allegations remains uncertain, the incident highlights the urgent need for clearer legal frameworks, standardized AI data licensing practices, and industry self-regulation. It also puts companies like Perplexity in the spotlight, underscoring the delicate balance between innovation and ethical responsibility.
As more AI companies enter the space, the standards they set now will likely shape the future of content ownership, access, and attribution in the digital economy.