Archive

Posts Tagged ‘plagiarism’

How to spot web-site plagiarism and copyright theft with CopyScape

March 11, 2012 14 comments

AWARE has had a web-site since 1995 and our current domain (www.marketing-intelligence.co.uk) has been active since 1997. When we started there were less than 100,000 companies on the web. Google’s founders had not yet met each other, and even venerable search engines such as AltaVista had not yet started.

Over the years, we’ve made an effort to ensure that our web content was not copied and used on other sites without our permission.

Although doing manual checks by searching for key phrases is one way of checking for plagiarism and copyright theft, there are a number of dedicated plagiarism checking sites. One example is Plagium. Plagium’s drawback, shared with several similar services, is that you have to paste in the text you want to test rather than just enter the URL. Such services are generally aimed at helping teachers and college professors detect student cheating.

Although some services (such as Plagium) are free, most are not and may involve downloading dedicated software. Others only check a limited number of known “essay” type sites where students can download essays written by others. (We’ve found some of our content on such sites – evidently students who use them don’t care where they steal their content from. Once used successfully they then try to reuse it by uploading their A+ essay to the site).

CopyScape

Of all plagiarism detection websites probably the easiest and best is CopyScape. Copyscape’s aim is not only to help academics detect student cheating. It also allows webmasters to search for copied content in general. It doesn’t require users to paste in the suspect text. Instead web-site owners simply need to enter their URLs and get a report on other sites that use similar or identical wording. It’s sufficiently powerful that there is even a flippant web-page on ContentBoss’s website giving advice on how to bypass CopyScape and copy with impunity. (ContentBoss promises to provide unique content at a low monthly fee. Their bypass CopyScape tool uses a technique that will convert content into HTML guaranteed not to be picked up by plagiarism detectors. The catch is, as pointed out by ContentBoss, that using such content is also a guarantee that the site will be banned by search engines for spam content).

We’ve used CopyScape periodically over the years and miscreants included a competitor site that copied multiple pages from our site. We asked the site owner to change his pages and were ignored. We then took stronger action and within a couple of days the site was taken down. Another example involved an article published in a professional journal that took, almost verbatim, the content of our brief guide to competitive intelligence. We notified the publisher who ensured that the payment made to the “author” was recovered, and an apology published. The author said that he thought that material published on the web was copyright free. He was shown to be wrong.

Our most recent trawl for examples of copyright theft from AWARE’s pages turned up further examples where wording we’ve used has been stolen. The following images should show how effective the tool is – while at the same time naming and shaming the companies that are too weak, lazy or incompetent to produce their own copy and have to steal from others. (I’ve named them – but won’t give them the satisfaction of a link as this could help their search engine optimisation efforts – if they have any!)

The first example shows how text that appears on the footer of most of our pages is plagiarised.

This is the orignal text.

CopyScape found several sites had copied this text almost verbatim – for example Green Oasis Associates based in Nigeria:

or ICM Research from Italy and Pearlex from Virginia in the USA.

The ICM Research example is in fact the worst of these three, as their site has taken content from several other AWARE web-site pages.

The problem is that a company that is willing to steal content from other businesses is unethical – breaking the rule against misrepresenting who you are. If they are willing to steal content from others, they may also take short-cuts in the services provided and as a result should not be trusted to provide a competent service.

The page that is most often plagiarised is the Brief Guide to Competitive Intelligence Page, mentioned above. Clicking on a link found by CopyScape highlights the copied portions as seen in the following examples from AGResearch, Emisol and Wordsfinder.

Generally sites do not copy whole pages (although this does happen) but integrate chunks of stolen text into their pages – as seen in the AGResearch example, below – where 12% of the page is copied, and Wordsfinder where 13% has been copied.

The Emisol example below stole less – although copied key parts of the guide page:

Conclusion

Copyright theft is a compliment to the author of the original web-page, as it shows that the plagiarizing site views their competitor as top quality. However the purpose of writing good copy is to stand out and show one’s own capabilities. Sites that steal other site’s work remove this advantage as they make the claims seem anodyne and commonplace. They devalue both the copier – who cannot come up with their own material (and so are unlikely to be able to provide a competent service anyway) and the originator, as most people won’t be able to tell who came first. Fortunately search engines can, and when they detect duplication, they are likely to downplay the duplicated material meaning that such sites are less likely to appear high-up in search engine rankings. The danger is that both the originator of the material and the plagiarizer may get penalised by search engines – which is another reason to ensure that copyright thieves are caught and stopped. CopyScape is one tool that really works in protecting authors from such plagiarism.

Google versus Bing – a competitive intelligence case study

February 2, 2011 7 comments

Search experts regularly emphasise that to get the best search results it is important to use more than one search engine. The main reason for this is that each search engine uses a different relevancy ranking leading to different search results pages. Using Google will give a results page with the sites that Google thinks are the most relevant for the search query, while using Bing is supposed to give a results page where the top hits are based on a different relevancy ranking. This alternative may give better results for some searches and so a comprehensive search needs to use multiple search engines.

You may have noticed that I highlighted the word supposed when mentioning Bing. This is because it appears that Bing is cheating, and is using some of Google’s results in their search lists. Plagiarising Google’s results may be Bing’s way of saying that Google is better. However it leaves a bad taste as it means that one of the main reasons for using Microsoft’s search engine can be questioned, i.e. that the results are different and that all are generated independently, using different relevancy rankings.

Bing is Microsoft’s third attempt at a market-leading, Google bashing, search engine – replacing Live.com which in turn had replaced MSN Search. Bing has been successful and is truly a good alternative to Google. It is the default search engine on Facebook (i.e. when doing a search on Facebook, you get Bing results) and is also used to supply results to other search utilities – most notably Yahoo! From a marketing perspective, however, it appears that the adage “differentiate or die” hasn’t been fully understood by Bing. Companies that fail to fully differentiate their product offerings from competitors are likely to fail.

The story that Bing was copying Google’s results dates back to Summer 2010, when Google noticed an odd similarity to a highly specialist search on the two search engines. This, in itself wouldn’t be a problem. You’d expect similar results for very targeted search terms – the main difference will be the sort order. However in this case, the same top results were being generated when spelling mistakes were used as the search term. Google started to look more closely – and found that this wasn’t just a one-off. However to prove that Bing was stealing Google’s results needed more than just observation. To test the hypothesis, Google set up 100 dummy and nonsense queries that led to web-sites that had no relationship at all to the query. They then gave their testers laptops with a new Windows install – running Microsoft’s Internet Explorer 8 and with the Bing Toolbar installed. The install process included the “Suggested Sites” feature of Internet Explorer and the toolbar’s default options.

Within a few weeks, Bing started returning the fake results for the same Google searches. For example, a search for hiybbprqag gave the seating plan for a Los Angeles theatre, while delhipublicschool40 chdjob returned a Ohio Credit Union as the top result. This proved that the source for the results was not Bing’s own search algorithm but that the result had been taken from Google.

What was happening was that the searches and search results on Google were being passed back to Microsoft – via some feature of Internet Explorer 8, Windows or the Bing Toolbar.

As Google states in their Blog article on the discovery (which is illustrated with screenshots of the findings):

At Google we strongly believe in innovation and are proud of our search quality. We’ve invested thousands of person-years into developing our search algorithms because we want our users to get the right answer every time they search, and that’s not easy. We look forward to competing with genuinely new search algorithms out there—algorithms built on core innovation, and not on recycled search results from a competitor. So to all the users out there looking for the most authentic, relevant search results, we encourage you to come directly to Google. And to those who have asked what we want out of all this, the answer is simple: we’d like for this practice to stop.

Interestingly, Bing doesn’t even try to deny the claim – perhaps because they realise that they were caught red-handed. Instead they have tried to justify using the data on customer computers as a way of improving search experiences – even when the searching was being done via a competitor.  In fact, Harry Shum, a Bing VP, believes that this is actually good practice, stating in Bing’s response to a blog post by Danny Sullivan that exposed the practice:

“We have been very clear. We use the customer data to help improve the search experience…. We all learn from our collective customers, and we all should.”

It is well known that companies collect data on customer usage of their own web-sites – that is one purpose of cookies generated when visiting a site. It is less well known that some companies also collect data on what users do on other sites (which is why Yauba boasts about its privacy credentials). I’m sure that the majority of users of the Bing toolbar and other Internet Explorer and Windows features that seem to pass back data to Microsoft would be less happy if they knew how much data was collected and where from. Microsoft has been collecting such data for several years, but ethically the practice is highly questionable, even though Microsoft users may have originally agreed to the company collecting data to “help improve the online experience“.

What the story also shows is how much care and pride Google take in their results – and how they have an effective competitive intelligence (and counter-intelligence) programme, actively comparing their results with competitors. Microsoft even recognised this by falsely accusing Google of spying via their sting operation that exposed Microsoft’s practices – with Shum commenting (my italics):

What we saw in today’s story was a spy-novelesque stunt to generate extreme outliers in tail query ranking. It was a creative tactic by a competitor, and we’ll take it as a back-handed compliment. But it doesn’t accurately portray how we use opt-in customer data as one of many inputs to help improve our user experience.

To me, this sounds like sour-grapes. How can copying a competitor’s results improve the user experience? If it doesn’t accurately portray how customer data IS used, maybe now would be the time for Microsoft to reassure customers regarding their data privacy. And rather than view the comment that Google’s exposure of Bing’s practices was a back-handed compliment, I’d see it as slap in the face with the front of the hand. However what else could Microsoft & Bing say, other than Mea Culpa.

Update – Wednesday 2 February 2011:

The war of words between Google and Bing continues. Bing has now denied copying Google’s results, and moreover accused Google of click-fraud:

Google engaged in a “honeypot” attack to trick Bing. In simple terms, Google’s “experiment” was rigged to manipulate Bing search results through a type of attack also known as “click fraud.” That’s right, the same type of attack employed by spammers on the web to trick consumers and produce bogus search results.  What does all this cloak and dagger click fraud prove? Nothing anyone in the industry doesn’t already know. As we have said before and again in this post, we use click stream optionally provided by consumers in an anonymous fashion as one of 1,000 signals to try and determine whether a site might make sense to be in our index.

Bing seems to have ignored the fact that Google’s experiment resulted from their observation that certain genuine searches seemed to be copied by Bing – including misspellings, and also some mistakes in their algorithm that resulted in odd results. The accusation of click fraud is bizarre as the searches Google used to test for click fraud were completely artificial. There is no way that a normal searcher would have made such searches, and so the fact that the results bore no resemblance to the actual search terms is completely different to the spam practice where a dummy site appears for certain searches.

Bing can accuse Google of cloak and dagger behaviour. However sometimes, counter-intelligence requires such behaviour to catch miscreants red-handed. It’s a practice carried out by law enforcement globally where a crime is suspected but where there is insufficient evidence to catch the culprit. As an Internet example, one technique used to catch paedophiles is for a police officer to pretend to be a vulnerable child on an Internet chat-room. Is this fraud – when the paedophile subsequently arranges to meet up – and is caught? In some senses it is. However saying such practices are wrong gives carte-blanche to criminals to continue their illegal practices. Bing appears to be putting themselves in the same camp – by saying that using “honeypot” attacks is wrong.

They also have not recognised the points I’ve stressed about the ethical use of data. There is a big difference between using anonymous data tracking user  behaviour on your own search engine and tracking that of a competitor. Using your competitor’s data to improve your own product, when the intelligence was gained by technology that effectively hacks into usage made by your competitor’s customers is espionage. The company guilty of spying is Bing – not Google. Google just used competitive intelligence to identify the problem, and a creative approach to counter-intelligence to prove it.