- Регистрация
- 17 Февраль 2018
- Сообщения
- 39 339
- Лучшие ответы
- 0
- Реакции
- 0
- Баллы
- 2 093
Offline
Scraper accused of stealing Reddit content “shocked” by lawsuit.
Credit: SOPA Images / Contributor | LightRocket
In a lawsuit filed on Wednesday, Reddit accused an AI search engine, Perplexity, of conspiring with several companies to illegally scrape Reddit content from Google search results, allegedly dodging anti-scraping methods that require substantial investments from both Google and Reddit.
Reddit alleged that Perplexity feeds off Reddit and Google, claiming to be “the world’s first answer engine” but really doing “nothing groundbreaking.”
“Its answer engine simply uses a different company’s” large language model “to parse through a massive number of Google search results to see if it can answer a user’s question based on those results,” the lawsuit said. “But Perplexity can only run its ‘answer engine’ by wrongfully accessing and scraping Reddit content appearing in Google’s own search results from Google’s own search engine.”
Likening companies involved in the alleged conspiracy to “bank robbers,” Reddit claimed it caught Perplexity “red-handed” stealing content that its “answer engine” should not have had access to.
Baiting Perplexity with “the digital equivalent of marked bills,” Reddit tested out posting content that could only be found in Google search engine results pages (SERPs) and “within hours, queries to Perplexity’s ‘answer engine’ produced the contents of that test post.”
“The only way that Perplexity could have obtained that Reddit content and then used it in its ‘answer engine’ is if it and/or its Co-Defendants scraped Google SERPs for that Reddit content and Perplexity then quickly incorporated that data into its answer engine,” Reddit’s lawsuit said.
In a Reddit post, Perplexity denied any wrongdoing, describing its answer engine as summarizing Reddit discussions and citing Reddit threads in answers, just like anyone who shares links or posts on Reddit might do. Perplexity suggested that Reddit was attacking the open Internet by trying to extort licensing fees for Reddit content, despite knowing that Perplexity doesn’t train foundational models. Reddit’s endgame, Perplexity alleged, was to use the Perplexity lawsuit as a “show of force in Reddit’s training data negotiations with Google and OpenAI.”
“We won’t be extorted, and we won’t help Reddit extort Google, even if they’re our (huge) competitor,” Perplexity wrote. “Perplexity will play fair, but we won’t cave. And we won’t let bigger companies use us in shell games. ”
Reddit likely anticipated Perplexity’s defense of the “open Internet,” noting in its complaint that “Reddit’s current Robots Exclusion Protocol file (‘robots.txt’) says, ‘Reddit believes in an open Internet, but not the misuse of public content.'”
Google reveals how scrapers steal from search results
To block scraping, Reddit uses various measures, such as “registered user-identification limits, IP-rate limits, captcha bot protection, and anomaly-detection tools,” the complaint said.
Similarly, Google relies on “anti-scraping systems and teams dedicated to preventing unauthorized access to its products and services,” Reddit said, noting Google prohibits “unauthorized automated access” to its SERPs.
To back its claims, Reddit subpoenaed Google to find out more about how the search giant blocks AI scrapers from accessing content on SERPs. Google confirmed it relies on “a technological access control system called ‘SearchGuard,’ which is designed to prevent automated systems from accessing and obtaining wholesale search results and indexed data while allowing individual users—i.e., humans—access to Google’s search results, including results that feature Reddit data.”
“SearchGuard prevents unauthorized access to Google’s search data by imposing a barrier challenge that cannot be solved in the ordinary course by automated systems unless they take affirmative actions to circumvent the SearchGuard system,” Reddit’s complaint explained.
Bypassing these anti-scraping systems violates the Digital Millennium Copyright Act, Reddit alleged, as well as laws against unfair trade and unjust enrichment. Seemingly, Google’s SearchGuard may currently be the easiest to bypass for alleged conspirators who supposedly pivoted to looting Google SERPs after realizing they couldn’t access Reddit content directly on the platform.
Scrapers shocked by Reddit lawsuit
Reddit accused three companies of conspiring with Perplexity—”a Lithuanian data scraper” called Oxylabs UAB, “a former Russian botnet” known as AWMProxy, and SerpApi, a Texas company that sells services for scraping search engines.
Oxylabs “is explicit that its scraping service is meant to circumvent Google’s technological measures,” Reddit alleged, pointing to an Oxylabs’ website called “How to Scrape Google Search Results.”
SerpApi touts the same service, including some options to scrape SERPs at “ludicrous speeds.” To trick browsers, SerpApi’s fastest option uses “a server-swarm to hide from, avoid, or simply overwhelm by brute force effective measures Google has put in place to ward off automated access to search engine results,” Reddit alleged. SerpApi also allegedly provides users “with tips to reduce the chance of being blocked while web scraping, such as by sending ‘fake user-agent string,’ shifting IP addresses to avoid multiple requests from the same address, and using proxies ‘to make traffic look like regular user traffic’ and thereby ‘impersonate’ user traffic.”
According to Reddit, the three companies disguise “their web scrapers as regular people (among other techniques) to circumvent or bypass the security restrictions meant to stop them.” During a two-week span in July, they scraped “almost three billion” SERPs containing Reddit text, URLs, images, and videos, a subpoena requesting information from Google revealed.
Ars could not immediately reach AWMProxy for comment. However, the other companies were surprised by Reddit’s lawsuit, while vowing to defend their business models.
SerpApi’s spokesperson told Ars that Reddit did not notify the company before filing the lawsuit.
“We strongly disagree with Reddit’s allegations and intend to vigorously defend ourselves in court,” SerpApi’s spokesperson said. “In the eight years we’ve been in business, SerpApi has always operated on the right side of the law. As stated on our website, ‘The crawling and parsing of public data is protected by the First Amendment of the United States Constitution. We value freedom of speech tremendously.’”
Additionally, SerpAPI works “closely with our attorneys to ensure that our services comply with all applicable laws and fair use principles. SerpApi stands firmly behind its business model and conduct, and we will continue to defend our rights to the fullest extent,” the spokesperson said.
Oxylabs’ chief governance strategy officer, Denas Grybauskas, told Ars that Reddit’s complaint seemed baffling since the other companies involved in the litigation are “unrelated and unaffiliated.”
“We are shocked and disappointed by this news, as Reddit has made no attempt to speak with us directly or communicate any potential concerns,” Grybauskas said. “Oxylabs has always been and will continue to be a pioneer and an industry leader in public data collection, and it will not hesitate to defend itself against these allegations. Oxylabs’ position is that no company should claim ownership of public data that does not belong to them. It is possible that it is just an attempt to sell the same public data at an inflated price.”
Grybauskas defended Oxylabs’ business as creating “real-world value for thousands of businesses and researchers, such as those driving open-source investigations, disinformation tackling, or environmental monitoring.”
“We strongly believe that our core business principles make the Internet a better place and serve the public good,” Grybauskas said. “Oxylabs provides infrastructure for compliant access to publicly available information, and we demand every customer to use our services lawfully. ”
Reddit cited threats to licensing deals
Apparently, Reddit caught on to the alleged scheme after sending cease-and-desist letters to Perplexity to stop scraping Reddit content that its answer engine was citing. Rather than ending the scraping, Reddit claimed Perplexity’s citations increased “forty-fold.” Since Perplexity is a customer listed on SerpApi’s website, Reddit hypothesized the two were conspiring to skirt Google’s anti-circumvention tools, the complaint said, along with the other companies.
In a statement provided to Ars, Ben Lee, chief legal officer at Reddit, said that Oxylabs, AWMProxy, and SerpApi were “textbook examples” of scrapers that “bypass technological protections to steal data, then sell it to clients hungry for training material.”
“Unable to scrape Reddit directly, they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search,” Lee said. “Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself.”
On Reddit, Perplexity pushed back on Reddit’s claims that Perplexity ignored requests to license Reddit content.
“Untrue. Whenever anyone asks us about content licensing, we explain that Perplexity, as an application-layer company, does not train AI models on content,” Perplexity said. “Never has. So, it is impossible for us to sign a license agreement to do so.”
Reddit supposedly “insisted we pay anyway, despite lawfully accessing Reddit data,” Perplexity said. “Bowing to strong arm tactics just isn’t how we do business.”
Perplexity’s spokesperson, Jesse Dwyer, told Ars the company chose to post its statement on Reddit “to illustrate a simple point.”
“It is a public Reddit link accessible to anyone, yet by the logic of Reddit’s lawsuit, if you mention it or cite it in any way (which is your job as a reporter), they might just sue you,” Dwyer said.
But Reddit claimed that its business and reputation have been “damaged” by “misappropriation of Reddit data and circumvention of technological control measures.” Without a licensing deal ensuring that Perplexity and others are respecting Reddit policies, Reddit cannot control who has access to data, how they’re using data, and if data use conflicts with Reddit’s privacy policy and user agreement, the complaint said.
Further, Reddit’s worried that Perplexity’s workaround could catch on, potentially messing up Reddit’s other licensing deals. All the while, Reddit noted, it has to invest “significant resources” in anti-scraping technology, with Reddit ultimately suffering damages, including “lost profits and business opportunities, reputational harm, and loss of user trust.”
Reddit’s hoping the court will grant an injunction barring companies from scraping Reddit content from Google SERPs. It also wants companies blocked from both selling Reddit data and “developing or distributing any technology or product that is used for the unauthorized circumvention of technological control measures and scraping of Reddit data.”
If Reddit wins, companies could be required to pay substantial damages or to disgorge profits from the sale of Reddit content.
Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder in Reddit.
Credit: SOPA Images / Contributor | LightRocket
In a lawsuit filed on Wednesday, Reddit accused an AI search engine, Perplexity, of conspiring with several companies to illegally scrape Reddit content from Google search results, allegedly dodging anti-scraping methods that require substantial investments from both Google and Reddit.
Reddit alleged that Perplexity feeds off Reddit and Google, claiming to be “the world’s first answer engine” but really doing “nothing groundbreaking.”
“Its answer engine simply uses a different company’s” large language model “to parse through a massive number of Google search results to see if it can answer a user’s question based on those results,” the lawsuit said. “But Perplexity can only run its ‘answer engine’ by wrongfully accessing and scraping Reddit content appearing in Google’s own search results from Google’s own search engine.”
Likening companies involved in the alleged conspiracy to “bank robbers,” Reddit claimed it caught Perplexity “red-handed” stealing content that its “answer engine” should not have had access to.
Baiting Perplexity with “the digital equivalent of marked bills,” Reddit tested out posting content that could only be found in Google search engine results pages (SERPs) and “within hours, queries to Perplexity’s ‘answer engine’ produced the contents of that test post.”
“The only way that Perplexity could have obtained that Reddit content and then used it in its ‘answer engine’ is if it and/or its Co-Defendants scraped Google SERPs for that Reddit content and Perplexity then quickly incorporated that data into its answer engine,” Reddit’s lawsuit said.
In a Reddit post, Perplexity denied any wrongdoing, describing its answer engine as summarizing Reddit discussions and citing Reddit threads in answers, just like anyone who shares links or posts on Reddit might do. Perplexity suggested that Reddit was attacking the open Internet by trying to extort licensing fees for Reddit content, despite knowing that Perplexity doesn’t train foundational models. Reddit’s endgame, Perplexity alleged, was to use the Perplexity lawsuit as a “show of force in Reddit’s training data negotiations with Google and OpenAI.”
“We won’t be extorted, and we won’t help Reddit extort Google, even if they’re our (huge) competitor,” Perplexity wrote. “Perplexity will play fair, but we won’t cave. And we won’t let bigger companies use us in shell games. ”
Reddit likely anticipated Perplexity’s defense of the “open Internet,” noting in its complaint that “Reddit’s current Robots Exclusion Protocol file (‘robots.txt’) says, ‘Reddit believes in an open Internet, but not the misuse of public content.'”
Google reveals how scrapers steal from search results
To block scraping, Reddit uses various measures, such as “registered user-identification limits, IP-rate limits, captcha bot protection, and anomaly-detection tools,” the complaint said.
Similarly, Google relies on “anti-scraping systems and teams dedicated to preventing unauthorized access to its products and services,” Reddit said, noting Google prohibits “unauthorized automated access” to its SERPs.
To back its claims, Reddit subpoenaed Google to find out more about how the search giant blocks AI scrapers from accessing content on SERPs. Google confirmed it relies on “a technological access control system called ‘SearchGuard,’ which is designed to prevent automated systems from accessing and obtaining wholesale search results and indexed data while allowing individual users—i.e., humans—access to Google’s search results, including results that feature Reddit data.”
“SearchGuard prevents unauthorized access to Google’s search data by imposing a barrier challenge that cannot be solved in the ordinary course by automated systems unless they take affirmative actions to circumvent the SearchGuard system,” Reddit’s complaint explained.
Bypassing these anti-scraping systems violates the Digital Millennium Copyright Act, Reddit alleged, as well as laws against unfair trade and unjust enrichment. Seemingly, Google’s SearchGuard may currently be the easiest to bypass for alleged conspirators who supposedly pivoted to looting Google SERPs after realizing they couldn’t access Reddit content directly on the platform.
Scrapers shocked by Reddit lawsuit
Reddit accused three companies of conspiring with Perplexity—”a Lithuanian data scraper” called Oxylabs UAB, “a former Russian botnet” known as AWMProxy, and SerpApi, a Texas company that sells services for scraping search engines.
Oxylabs “is explicit that its scraping service is meant to circumvent Google’s technological measures,” Reddit alleged, pointing to an Oxylabs’ website called “How to Scrape Google Search Results.”
SerpApi touts the same service, including some options to scrape SERPs at “ludicrous speeds.” To trick browsers, SerpApi’s fastest option uses “a server-swarm to hide from, avoid, or simply overwhelm by brute force effective measures Google has put in place to ward off automated access to search engine results,” Reddit alleged. SerpApi also allegedly provides users “with tips to reduce the chance of being blocked while web scraping, such as by sending ‘fake user-agent string,’ shifting IP addresses to avoid multiple requests from the same address, and using proxies ‘to make traffic look like regular user traffic’ and thereby ‘impersonate’ user traffic.”
According to Reddit, the three companies disguise “their web scrapers as regular people (among other techniques) to circumvent or bypass the security restrictions meant to stop them.” During a two-week span in July, they scraped “almost three billion” SERPs containing Reddit text, URLs, images, and videos, a subpoena requesting information from Google revealed.
Ars could not immediately reach AWMProxy for comment. However, the other companies were surprised by Reddit’s lawsuit, while vowing to defend their business models.
SerpApi’s spokesperson told Ars that Reddit did not notify the company before filing the lawsuit.
“We strongly disagree with Reddit’s allegations and intend to vigorously defend ourselves in court,” SerpApi’s spokesperson said. “In the eight years we’ve been in business, SerpApi has always operated on the right side of the law. As stated on our website, ‘The crawling and parsing of public data is protected by the First Amendment of the United States Constitution. We value freedom of speech tremendously.’”
Additionally, SerpAPI works “closely with our attorneys to ensure that our services comply with all applicable laws and fair use principles. SerpApi stands firmly behind its business model and conduct, and we will continue to defend our rights to the fullest extent,” the spokesperson said.
Oxylabs’ chief governance strategy officer, Denas Grybauskas, told Ars that Reddit’s complaint seemed baffling since the other companies involved in the litigation are “unrelated and unaffiliated.”
“We are shocked and disappointed by this news, as Reddit has made no attempt to speak with us directly or communicate any potential concerns,” Grybauskas said. “Oxylabs has always been and will continue to be a pioneer and an industry leader in public data collection, and it will not hesitate to defend itself against these allegations. Oxylabs’ position is that no company should claim ownership of public data that does not belong to them. It is possible that it is just an attempt to sell the same public data at an inflated price.”
Grybauskas defended Oxylabs’ business as creating “real-world value for thousands of businesses and researchers, such as those driving open-source investigations, disinformation tackling, or environmental monitoring.”
“We strongly believe that our core business principles make the Internet a better place and serve the public good,” Grybauskas said. “Oxylabs provides infrastructure for compliant access to publicly available information, and we demand every customer to use our services lawfully. ”
Reddit cited threats to licensing deals
Apparently, Reddit caught on to the alleged scheme after sending cease-and-desist letters to Perplexity to stop scraping Reddit content that its answer engine was citing. Rather than ending the scraping, Reddit claimed Perplexity’s citations increased “forty-fold.” Since Perplexity is a customer listed on SerpApi’s website, Reddit hypothesized the two were conspiring to skirt Google’s anti-circumvention tools, the complaint said, along with the other companies.
In a statement provided to Ars, Ben Lee, chief legal officer at Reddit, said that Oxylabs, AWMProxy, and SerpApi were “textbook examples” of scrapers that “bypass technological protections to steal data, then sell it to clients hungry for training material.”
“Unable to scrape Reddit directly, they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search,” Lee said. “Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself.”
On Reddit, Perplexity pushed back on Reddit’s claims that Perplexity ignored requests to license Reddit content.
“Untrue. Whenever anyone asks us about content licensing, we explain that Perplexity, as an application-layer company, does not train AI models on content,” Perplexity said. “Never has. So, it is impossible for us to sign a license agreement to do so.”
Reddit supposedly “insisted we pay anyway, despite lawfully accessing Reddit data,” Perplexity said. “Bowing to strong arm tactics just isn’t how we do business.”
Perplexity’s spokesperson, Jesse Dwyer, told Ars the company chose to post its statement on Reddit “to illustrate a simple point.”
“It is a public Reddit link accessible to anyone, yet by the logic of Reddit’s lawsuit, if you mention it or cite it in any way (which is your job as a reporter), they might just sue you,” Dwyer said.
But Reddit claimed that its business and reputation have been “damaged” by “misappropriation of Reddit data and circumvention of technological control measures.” Without a licensing deal ensuring that Perplexity and others are respecting Reddit policies, Reddit cannot control who has access to data, how they’re using data, and if data use conflicts with Reddit’s privacy policy and user agreement, the complaint said.
Further, Reddit’s worried that Perplexity’s workaround could catch on, potentially messing up Reddit’s other licensing deals. All the while, Reddit noted, it has to invest “significant resources” in anti-scraping technology, with Reddit ultimately suffering damages, including “lost profits and business opportunities, reputational harm, and loss of user trust.”
Reddit’s hoping the court will grant an injunction barring companies from scraping Reddit content from Google SERPs. It also wants companies blocked from both selling Reddit data and “developing or distributing any technology or product that is used for the unauthorized circumvention of technological control measures and scraping of Reddit data.”
If Reddit wins, companies could be required to pay substantial damages or to disgorge profits from the sale of Reddit content.
Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder in Reddit.