Google search alternatives face a big hurdle
Before a new search engine can hope to beat Google, it has to crawl.
But indexing the Web by “crawling” sites using automated software doesn’t just require expanding the vast reach of the Web, although that is a big challenge in itself. Individual sites are under no obligation to host a new search robot. Some instead display digital anti-intrusion signs, a way to discourage automated traffic that could slow performance.
“The web contains billions of documents,” says Vivek Raghunathan, co-founder of subscription-based, ad-free search startup Neeva. “And the web is much more difficult to explore than it was a few years ago.”
An October 2020 report on digital competition from the House Judiciary Committee’s antitrust subcommittee shed the government’s spotlight on this situation.
“The high cost of maintaining a new index and the decision of many large web pages to block most crawlers significantly limit new entrants to search engines,” the report says. “Today, the only English-language search engines that maintain their own comprehensive web page index are Google and Bing.”
That leaves many of Google’s competitors praising the index that Microsoft maintains for its Bing search, which has 6.4% of the US market, compared to Google’s 87.3%, in Statcounter’s metrics. Bing’s index works well for many queries, but sites that rely on it cede a vital way to differentiate themselves.
This is a problem for Neeva as well as two other privacy-focused search engines, DuckDuckGo and Brave. All three rely on Bing for some of the results they provide to users. It’s just an ingredient rather than their entire technology, but still: it would be easier to do without if creating a new web index wasn’t that difficult.
Robots are not welcome here
Websites control automated access to their pages using standardized “robots.txt” files listing where crawlers can go. Crawlers can ignore these instructions, as the Internet Archive began doing in 2017, to improve its backup of the web. But sites can punish an arrogant bot by blocking its access.
DuckDuckGo and Neeva cited Facebook’s platform as an example. Its robots.txt file takes a guest list approach, endorsing Google and Bing as well as less obvious crawlers like “Applebot,” which collects data for Apple’s Siri and Spotlight. But that excludes all bots not mentioned by name.
Jason Grosse, spokesperson for Facebook’s parent company, Meta, said in an email: “In general, our robots.txt policy is not out of step with other major platforms.
Indexing sites that don’t appreciate the attention of a new crawler can require discretion and diplomacy.
“A lot of the work we’ve done over the past year, a year and a half, is building a track system that performs well,” said Raghunathan of Neeva. “We’re doing things like a smart algorithmic estimate of how much can we crawl on this site so that it looks like a rounding error.”
Sometimes, however, Neeva has to ask for help. From whom? “I would say he’s the first person we know, and often the first person we know is the CEO or the head of engineering.”
Even a search site that is good at delivering web results will struggle to match Google’s full spectrum information search.
Brave, meanwhile, works in stealth mode by varying the identification of its crawler and only respecting the restrictions that a robots.txt file places on Google’s crawler. Josep M. Pujol, chief research officer at Brave, founded by Mozilla co-founder Brendan Eich and best known for his privacy-focused browser, said in an email that requires caution.
“We respect the spirit of the law but not the letter,” he said. “To date, the data centers that host our crawlers have received a very small number of complaints. “
Pujol said it was impractical to ask permission from individual sites: “How do you scale human interaction to thousands of businesses? “
Google, on the other hand, can get another head start as its non-search business lines, starting with display ads, but including services like Google Analytics, require access to sites that competitors can only ask, said Zack Maril, software engineer and founder of a research. – competition group called Knuckleheads’ Club.
These other companies, he wrote in an email, “all can benefit from Google’s search activities in ways that other competitors using only search engines simply cannot compete.”
Search sites with no Google or Bing traffic also lack large-scale metrics on more or less popular sites. Google and Bing “can look at whatever people liked and prioritize all the clicks from there,” says Raghunathan. “When you start, it’s much more difficult. “
A digital competition report, released in July 2020 by the UK Competition and Markets Authority, suggested requiring Google to provide some of these metrics. As DuckDuckGo vice president of communications, Kamyl Bazbaz, put it with approval, “Share a certain amount of click and query data that other search engines could use to level the playing field.”
Brave invites himself to a form of this sharing when he asks his users to allow “Google back-up mixing,” in which Brave sends a request to Google and then analyzes the results to improve its index.
Even a search site that is good at delivering web results will struggle to match Google’s full spectrum information search. For example, I’ve had DuckDuckGo on my iPad Mini by default for years, but its map results only cover driving and walking, so I always turn to Apple Maps and Google Maps.
Despite the challenges inherent in competing with Google in search, the fact that new companies are always willing to try says a lot about the stubbornness these newbies will need.
“We love that there are a lot of other search competitors out there now,” said DuckDuckGo’s Bazbaz. “It’s a market that historically people are really afraid of – and for good reason – because of how Google has dominated it. “