Every Wednesday, Daniel Russell, a researcher working with Google, posts a search question on his search & research blog. The search question for 26 September 2012 related to differences between the coastlines on the East and West coasts of the USA. Attempting to answer the question I typed in [Atlantic islands] into Google. Unlike the usual list I’d expected, I got this:
The images at the top of my search were a surprise. Clicking on the arrows gave me further images – totalling 55 island pictures. I tried a few other searches [Pacific Islands], [Indian Ocean Islands], etc. and found similar results. Yet most searches such as [Scottish Islands] gave me the normal type of listing.
Intrigued, I contacted a couple of colleagues – Karen Blakeman of RBA Information Services and Marydee Ojala, Editor of Online magazine (and the Online Insider blog). Both Karen and Marydee are also members of the Association of Independent Information Professionals and like so many AIIP members, are expert searchers. (All three of us are presenting at the forthcoming Internet Librarian Conference in London and led the London Websearch Academy in 2011).
Marydee admitted to being bemused but guessed it was connected to Google’s Knowledge Graph initiative – the new service that puts details on a search topic to the right of the search results – as with this example search for [Albert Einstein].
Knowledge Graph was launched by Google in May 2012 and aims to give instant answers to many encyclopedia type search queries. However this didn’t explain what I’d found. Marydee looked a bit further and found that the TechCrunch blog had discovered this earlier in September.
I mentioned that I’d found it because of Dan Russell’s blog and Marydee asked him about the new feature. Dan responded that the “carousel” of images is triggered whenever Google knows about a collection or group of connected items such as “Atlantic Islands”. The group is then summarised and made available at the top of the results list – allowing searchers to quickly recognise the collection and the other group members.
So that’s it then! It’s a new feature giving a “carousel” of images. If you search for [knowledge graph carousel] you get the above Techcrunch link and also Google’s own search blog on the topic . (There’s a lesson here – always check Google’s own blog posts if you spot what looks like odd Google behaviour). A search for [Knowledge graph] gives Google’s own description of the feature, including a YouTube video explaining it.
Dan Russell’s reply however said more:
What it triggers on is a bit more problematic. Answer: only collections we know about, which can be a bit odd. [moons of Saturn] but not [U.S. presidents]. [famous jazz composers] works, but not [cities in UAE]
This seems to explain why not all searches show the carousel. [Atlantic Islands] does. So does [Pacific Islands] but [Islands] doesn’t. [Greek Islands] is mentioned as an example in the YouTube video – but the less touristy [Scottish Islands] fails to show the carousel. It’s not just islands that give oddly inconsistent results. [Famous Jazz composers] results in the carousel appearing but [famous composers] gives a normal display. [20th century composers] works as does [19th century composers]. Bizarrely [18th century composers] doesn’t work and nor does [20th century artists] or [19th century artists]. Yet [impressionist artists] and [surrealist artists] do work. The results definitely seem surreal!
The TechCrunch blog tested the feature looking at rides at the Cedar Point theme park in Northern Ohio. I decided to ride the carousel on Disney parks. Again the results were odd – but a pattern seemed to emerge. [Disneyland rides], [Epcot rides], [Magic Kingdom Rides] all worked but [Disneyworld rides] didn’t. I then tried [Disney Paris Rides]. That works. So does [Disney California Rides]. However [Disney Florida Rides], [Disney Tokyo Rides] and [Disney Hong Kong Rides] all failed to work.
It seems as if there are two factors playing out here. The first is whether Google knows enough about the topic to create a set of common images. My guess is that Disney Hong Kong and Tokyo fail on that count – and possibly this explains why 18th century composers also fails. That can’t however explain the difference between Disney California and Paris, compared to Disney Florida. That brings in the second factor: the number of items in the collection. There are several Disney World theme parks for Disney Florida – Epcot, Magic Kingdom and more. I suspect that there are too many rides to be displayed in a meaningful manner. The aim of the Carousel is to encourage exploration – and a never-ending list tends to do the opposite: like a carousel that goes to fast, there is a risk that people may fall off.
I’ve been playing with a new data search engine called Zanran – that focuses on finding numerical and graphical data. The site is in an early beta. Nevertheless my initial tests brought up material that would only have been found using an advanced search on Google – if you were lucky. As such, Zanran promises to be a great addition for advanced data searching.
Zanran focuses on finding what it calls ‘semi-structured’ data on the web. This is defined as numerical data presented as graphs, tables and charts – and these could be held in a graph image or table in an HTML file, as part of a PDF report, or in an Excel spreadsheet. This is the key differentiator – essentially, Zanran is not looking for text but for formatted numerical data.
When I first started looking at the site I was expecting something similar to Wolfram Alpha – or perhaps something from Google (e.g. Google Squared or Google Public Data). Zanran is nothing like these – and so brings something new to search. Rather than take data and structure or tabulate it (as with Wolfram Alpha and Google Squared), Zanran searches for data that is already in tables or charts and uses this in its results listing.
The site has a nice touch in that hovering the cursor over results gives you the relevant data page – whether a table, a chart or a mix of text, tables or charts.
The advanced search options allow country searching (based on server location), document date and file type, each selectable from a drop-down box, as well as searches on specified web-sites. At the moment only English speaking countries can be selected (Australia, Canada, Ireland, India, UK New Zealand, USA and South Africa). The date selections allow for the last 6, 12 or 24 months and the file type allows for selection based on PDF; Excel; images in HTML files; tables in HTML files; PDF, Excel and dynamic data; and dynamic data alone. PowerPoint and Word files are promised as future options. There are currently no field search options (e.g. title searches).
My main dislike was that the site doesn’t give the full URLs for the data presented. The top-level domain is given, but not the actual URL which makes the site difficult to use when full attribution is required for any data found (especially if data gets downloaded, rather than opening up in a new page or tab).
Zanran.com has been in development since at least 2009 when it was a finalist in the London Technology Fund Competition. The technology behind Zanran is patented and based on open-source software, and cloud storage. Rather than searching for text, Zanran searches for numerical content, and then classifies it by whether it’s a table or a chart.
Atypically, Zanran is not a Californian Silicon Valley Startup, but is based in the Islington area of London, in a quiet residential side-street made up of a mixture of small mostly home-based businesses and flats/apartments. Zanran was founded by two chemists, Jonathan Goldhill and Yves Dassas, who had previously run telecom businesses (High Track Communications Ltd and Bikebug Radio Technologies) from the same address. Funding has come from the London Development Agency and First Capital among other investors.
Zanran views competitors as Wolfram Alpha, Google Public Data and also Infochimps (a database repository – enabling users to search for and download a wide variety of databases). The competitor list comes from Google’s cache of Zanran’s Wikipedia page as unfortunately, Wikipedia has deleted the actual page – claiming that the site is “too new to know if it will or will not ever be notable“.
I hope that Wikipedia is wrong and that Zanran will become “notable” as I think the company offers a new approach to searching the web for data. It will never replace Google or Bing – but that’s not its aim. Zanran aims to be a niche tool that will probably only ever be used by search experts. However as such, it deserves a chance, and if its revenue model (I’m assuming that there is one) works, it deserves success.
Search experts regularly emphasise that to get the best search results it is important to use more than one search engine. The main reason for this is that each search engine uses a different relevancy ranking leading to different search results pages. Using Google will give a results page with the sites that Google thinks are the most relevant for the search query, while using Bing is supposed to give a results page where the top hits are based on a different relevancy ranking. This alternative may give better results for some searches and so a comprehensive search needs to use multiple search engines.
You may have noticed that I highlighted the word supposed when mentioning Bing. This is because it appears that Bing is cheating, and is using some of Google’s results in their search lists. Plagiarising Google’s results may be Bing’s way of saying that Google is better. However it leaves a bad taste as it means that one of the main reasons for using Microsoft’s search engine can be questioned, i.e. that the results are different and that all are generated independently, using different relevancy rankings.
Bing is Microsoft’s third attempt at a market-leading, Google bashing, search engine – replacing Live.com which in turn had replaced MSN Search. Bing has been successful and is truly a good alternative to Google. It is the default search engine on Facebook (i.e. when doing a search on Facebook, you get Bing results) and is also used to supply results to other search utilities – most notably Yahoo! From a marketing perspective, however, it appears that the adage “differentiate or die” hasn’t been fully understood by Bing. Companies that fail to fully differentiate their product offerings from competitors are likely to fail.
The story that Bing was copying Google’s results dates back to Summer 2010, when Google noticed an odd similarity to a highly specialist search on the two search engines. This, in itself wouldn’t be a problem. You’d expect similar results for very targeted search terms – the main difference will be the sort order. However in this case, the same top results were being generated when spelling mistakes were used as the search term. Google started to look more closely – and found that this wasn’t just a one-off. However to prove that Bing was stealing Google’s results needed more than just observation. To test the hypothesis, Google set up 100 dummy and nonsense queries that led to web-sites that had no relationship at all to the query. They then gave their testers laptops with a new Windows install – running Microsoft’s Internet Explorer 8 and with the Bing Toolbar installed. The install process included the “Suggested Sites” feature of Internet Explorer and the toolbar’s default options.
Within a few weeks, Bing started returning the fake results for the same Google searches. For example, a search for hiybbprqag gave the seating plan for a Los Angeles theatre, while delhipublicschool40 chdjob returned a Ohio Credit Union as the top result. This proved that the source for the results was not Bing’s own search algorithm but that the result had been taken from Google.
What was happening was that the searches and search results on Google were being passed back to Microsoft – via some feature of Internet Explorer 8, Windows or the Bing Toolbar.
As Google states in their Blog article on the discovery (which is illustrated with screenshots of the findings):
At Google we strongly believe in innovation and are proud of our search quality. We’ve invested thousands of person-years into developing our search algorithms because we want our users to get the right answer every time they search, and that’s not easy. We look forward to competing with genuinely new search algorithms out there—algorithms built on core innovation, and not on recycled search results from a competitor. So to all the users out there looking for the most authentic, relevant search results, we encourage you to come directly to Google. And to those who have asked what we want out of all this, the answer is simple: we’d like for this practice to stop.
Interestingly, Bing doesn’t even try to deny the claim – perhaps because they realise that they were caught red-handed. Instead they have tried to justify using the data on customer computers as a way of improving search experiences – even when the searching was being done via a competitor. In fact, Harry Shum, a Bing VP, believes that this is actually good practice, stating in Bing’s response to a blog post by Danny Sullivan that exposed the practice:
“We have been very clear. We use the customer data to help improve the search experience…. We all learn from our collective customers, and we all should.”
It is well known that companies collect data on customer usage of their own web-sites – that is one purpose of cookies generated when visiting a site. It is less well known that some companies also collect data on what users do on other sites (which is why Yauba boasts about its privacy credentials). I’m sure that the majority of users of the Bing toolbar and other Internet Explorer and Windows features that seem to pass back data to Microsoft would be less happy if they knew how much data was collected and where from. Microsoft has been collecting such data for several years, but ethically the practice is highly questionable, even though Microsoft users may have originally agreed to the company collecting data to “help improve the online experience“.
What the story also shows is how much care and pride Google take in their results – and how they have an effective competitive intelligence (and counter-intelligence) programme, actively comparing their results with competitors. Microsoft even recognised this by falsely accusing Google of spying via their sting operation that exposed Microsoft’s practices – with Shum commenting (my italics):
What we saw in today’s story was a spy-novelesque stunt to generate extreme outliers in tail query ranking. It was a creative tactic by a competitor, and we’ll take it as a back-handed compliment. But it doesn’t accurately portray how we use opt-in customer data as one of many inputs to help improve our user experience.
To me, this sounds like sour-grapes. How can copying a competitor’s results improve the user experience? If it doesn’t accurately portray how customer data IS used, maybe now would be the time for Microsoft to reassure customers regarding their data privacy. And rather than view the comment that Google’s exposure of Bing’s practices was a back-handed compliment, I’d see it as slap in the face with the front of the hand. However what else could Microsoft & Bing say, other than Mea Culpa.
Update – Wednesday 2 February 2011:
The war of words between Google and Bing continues. Bing has now denied copying Google’s results, and moreover accused Google of click-fraud:
Google engaged in a “honeypot” attack to trick Bing. In simple terms, Google’s “experiment” was rigged to manipulate Bing search results through a type of attack also known as “click fraud.” That’s right, the same type of attack employed by spammers on the web to trick consumers and produce bogus search results. What does all this cloak and dagger click fraud prove? Nothing anyone in the industry doesn’t already know. As we have said before and again in this post, we use click stream optionally provided by consumers in an anonymous fashion as one of 1,000 signals to try and determine whether a site might make sense to be in our index.
Bing seems to have ignored the fact that Google’s experiment resulted from their observation that certain genuine searches seemed to be copied by Bing – including misspellings, and also some mistakes in their algorithm that resulted in odd results. The accusation of click fraud is bizarre as the searches Google used to test for click fraud were completely artificial. There is no way that a normal searcher would have made such searches, and so the fact that the results bore no resemblance to the actual search terms is completely different to the spam practice where a dummy site appears for certain searches.
Bing can accuse Google of cloak and dagger behaviour. However sometimes, counter-intelligence requires such behaviour to catch miscreants red-handed. It’s a practice carried out by law enforcement globally where a crime is suspected but where there is insufficient evidence to catch the culprit. As an Internet example, one technique used to catch paedophiles is for a police officer to pretend to be a vulnerable child on an Internet chat-room. Is this fraud – when the paedophile subsequently arranges to meet up – and is caught? In some senses it is. However saying such practices are wrong gives carte-blanche to criminals to continue their illegal practices. Bing appears to be putting themselves in the same camp – by saying that using “honeypot” attacks is wrong.
They also have not recognised the points I’ve stressed about the ethical use of data. There is a big difference between using anonymous data tracking user behaviour on your own search engine and tracking that of a competitor. Using your competitor’s data to improve your own product, when the intelligence was gained by technology that effectively hacks into usage made by your competitor’s customers is espionage. The company guilty of spying is Bing – not Google. Google just used competitive intelligence to identify the problem, and a creative approach to counter-intelligence to prove it.
I’d planned to write this post on business culture, working as part of a team and leadership. Meanwhile I’m still tinkering with WordPress – trying to get to know it better. There’s a couple of things I liked about Google’s Blogger tool that I’ve not yet managed to work out how to do on WordPress. Actually that’s not completely true. If you download WordPress and blog on your own server it’s fairly easy. However there are also things against doing that – for example some technical details, security & spam, etc. Conversely WordPress.com won’t let me download some of the plugins I wanted. Despite this, I’m pretty happy with WordPress as a blog platform.
Matt Cutts is well known as not only a Google expert (naturally) but also as an expert on search engine optimisation – in other words, how to get found on the web. There is so much in this one presentation that I think it should be compulsory viewing for everybody who writes for the web. Although I try and do most of what was said – there’s still more for me to do, and he had some great examples. The focus was on blogging using WordPress but in fact much of the content was much wider – with explanations on what search engines (and specifically Google) look for when indexing the web.
As not everybody will spare 45 minutes to watch the video, I’ll summarise some of the content – and the slides can be found at Matt’s web-site.
Matt starts by asking why write a blog in the first place, but quickly moves onto optimising sites for the web and how to increase your chances of being found. He gives a simple explanation for Google’s PageRank (named after Google founder, Larry Page, rather than that it measures the web page importance / popularity based on the number of links to the page). Around half way through the presentation, he starts emphasising the most important thing about writing for the web (whether for a general site or for a blog). The writing has to be relevant and reputable. Good and interesting writing gets read. Boring, trite, repetitive writing doesn’t. In other words, if you don’t love what you are writing about, and don’t know or have anything to say, then don’t say anything. (For more on good writing, read the Write Way – my brother’s blog – covering how to produce technical documentation that’s understandable).
Then we get to the bits on SEO (Search Engine Optimisation – i.e. writing web-sites so that they can be found). When I take training courses on finding competitive intelligence on the web I always emphasise the need to understand how sites get to the top spots. If you understand this, then it becomes easier to think of ways of finding sites that aren’t found on the first page – and often these are the pages that hold the hidden gems that the competitor analyst has to find.
One key skill is to think of alternative terms. As a portable back-up device I tend to use a memory stick. However other terms for the same device are “flash drive“, “USB drive” and a few others. Searching for only one of these risks missing out sites not using that term but one of its synonyms. Cutts gives an example of searches for ipod car for connecting an ipod to a car’s radio / entertainment system. There is an alternative less costly technology called iTrip that also allows an iPod to be connected to the car radio. For every two searches using the term iPod Car, there was one that used the key word iTrip. This means that excluding the latter term from sites selling the former will result in them missing out on a third of the potential Internet traffic. From a competitive intelligence perspective, it would also mean missing out information on a competing technology. Just because it’s not exactly the same, using a different technological approach and costing less, doesn’t mean it’s not also a competitor – so searching for one and not the other would mean missing out on what customers are actually looking to purchase.
Other SEO techniques covered include web-page naming, establishing a reputation, monitoring visitors via analysis of log files / google analytics and how not to spam (and scam).
Many searchers depend on their bookmark list but what happens when a key site disappears: if you don’t know how to search you are stuck.
Searching isn’t just going to google and typing your query in the search box. Expert searching demands that you consider where the information you are looking for is likely to be held, and in what format. It requires the searcher to understand the search tools they use – how they work and their strengths and weaknesses. Such skills are crucial when key sites disappear as happened in January with the small French meta-search engine, Kartoo.
Kartoo was innovative and presented results graphically. It enabled you to see links between terms and was brilliant for concept searching where you didn’t really know where to start. Unfortunately it’s now gone to cyber-heaven, or wherever dead web-sites disappear to. It will be missed – at least until something similar appears. Already Google’s wonderwheel (found from the “options” link just above the search results”) offers some of the functionality and graphic feel, and there are other sites that offer similar capabilities (e.g. Touchgraph). Kartoo however was special – it was simple, free and showed that Europeans can still come up with good search ideas.
Of course Kartoo isn’t the first innovative site to disappear. Over the years, many great search tools have gone. Greg Notess lists some in his SearchEngineShowdown blog – and an article in Online magazine. There are more. How many people remember IIBM’s Infomarket service – an early online news aggregator from 1995, or Transium.
In fact, it was learning that sites are mortal that led to my approach to searching: don’t depend on a limited selection of sites but rather know how to find sites and databases that lead you to the information wanted. That’s a key skill for all researchers and is as valid today in the Google generation as it was in the days before Google.
I’ve just been pointed to a new Googlelabs initiative – the Google Public Data Explorer. This promises to be a useful tool for finding public data in one place. (It’s always worth keeping an eye on GoogleLabs as they often bring out new ideas and products. These are kept together until ready to launch – and can be found from http://www.googlelabs.com.).