Archive
How to spot web-site plagiarism and copyright theft with CopyScape
AWARE has had a web-site since 1995 and our current domain (www.marketing-intelligence.co.uk) has been active since 1997. When we started there were less than 100,000 companies on the web. Google’s founders had not yet met each other, and even venerable search engines such as AltaVista had not yet started.
Over the years, we’ve made an effort to ensure that our web content was not copied and used on other sites without our permission.
Although doing manual checks by searching for key phrases is one way of checking for plagiarism and copyright theft, there are a number of dedicated plagiarism checking sites. One example is Plagium. Plagium’s drawback, shared with several similar services, is that you have to paste in the text you want to test rather than just enter the URL. Such services are generally aimed at helping teachers and college professors detect student cheating.
Although some services (such as Plagium) are free, most are not and may involve downloading dedicated software. Others only check a limited number of known “essay” type sites where students can download essays written by others. (We’ve found some of our content on such sites – evidently students who use them don’t care where they steal their content from. Once used successfully they then try to reuse it by uploading their A+ essay to the site).
CopyScape
Of all plagiarism detection websites probably the easiest and best is CopyScape. Copyscape’s aim is not only to help academics detect student cheating. It also allows webmasters to search for copied content in general. It doesn’t require users to paste in the suspect text. Instead web-site owners simply need to enter their URLs and get a report on other sites that use similar or identical wording. It’s sufficiently powerful that there is even a flippant web-page on ContentBoss’s website giving advice on how to bypass CopyScape and copy with impunity. (ContentBoss promises to provide unique content at a low monthly fee. Their bypass CopyScape tool uses a technique that will convert content into HTML guaranteed not to be picked up by plagiarism detectors. The catch is, as pointed out by ContentBoss, that using such content is also a guarantee that the site will be banned by search engines for spam content).
We’ve used CopyScape periodically over the years and miscreants included a competitor site that copied multiple pages from our site. We asked the site owner to change his pages and were ignored. We then took stronger action and within a couple of days the site was taken down. Another example involved an article published in a professional journal that took, almost verbatim, the content of our brief guide to competitive intelligence. We notified the publisher who ensured that the payment made to the “author” was recovered, and an apology published. The author said that he thought that material published on the web was copyright free. He was shown to be wrong.
Our most recent trawl for examples of copyright theft from AWARE’s pages turned up further examples where wording we’ve used has been stolen. The following images should show how effective the tool is – while at the same time naming and shaming the companies that are too weak, lazy or incompetent to produce their own copy and have to steal from others. (I’ve named them – but won’t give them the satisfaction of a link as this could help their search engine optimisation efforts – if they have any!)
The first example shows how text that appears on the footer of most of our pages is plagiarised.
This is the orignal text.

CopyScape found several sites had copied this text almost verbatim – for example Green Oasis Associates based in Nigeria:

or ICM Research from Italy and Pearlex from Virginia in the USA.

The ICM Research example is in fact the worst of these three, as their site has taken content from several other AWARE web-site pages.
The problem is that a company that is willing to steal content from other businesses is unethical – breaking the rule against misrepresenting who you are. If they are willing to steal content from others, they may also take short-cuts in the services provided and as a result should not be trusted to provide a competent service.
The page that is most often plagiarised is the Brief Guide to Competitive Intelligence Page, mentioned above. Clicking on a link found by CopyScape highlights the copied portions as seen in the following examples from AGResearch, Emisol and Wordsfinder.
Generally sites do not copy whole pages (although this does happen) but integrate chunks of stolen text into their pages – as seen in the AGResearch example, below – where 12% of the page is copied, and Wordsfinder where 13% has been copied.


The Emisol example below stole less – although copied key parts of the guide page:

Conclusion
Copyright theft is a compliment to the author of the original web-page, as it shows that the plagiarizing site views their competitor as top quality. However the purpose of writing good copy is to stand out and show one’s own capabilities. Sites that steal other site’s work remove this advantage as they make the claims seem anodyne and commonplace. They devalue both the copier – who cannot come up with their own material (and so are unlikely to be able to provide a competent service anyway) and the originator, as most people won’t be able to tell who came first. Fortunately search engines can, and when they detect duplication, they are likely to downplay the duplicated material meaning that such sites are less likely to appear high-up in search engine rankings. The danger is that both the originator of the material and the plagiarizer may get penalised by search engines – which is another reason to ensure that copyright thieves are caught and stopped. CopyScape is one tool that really works in protecting authors from such plagiarism.
Analysing weak signals for competitive & marketing intelligence
I’ve just read an interesting blog post by Philippe Silberzahn and Milo Jones. The post “Competitive intelligence and strategic surprises: Why monitoring weak signals is not the right approach” looked at the problems of weak signals in competitive intelligence and how even though an organisation may have lots of intelligence, they still get surprised.
Silberzahn and Jones point out that it’s not usually the intelligence that is the problem, but the interpretation of the gathered intelligence. This echoed a statement by Issur Harel, the former head of Mossad responsible for capturing the Nazi war criminal Eichmann. Harel was quoted as saying “We do not deal with certainties. The world of intelligence is the world of probabilities. Getting the information is not usually the most difficult task. What is difficult is putting upon it the right interpretation. Analysis is everything.”
In their post, Silberzahn and Jones argue that more important than monitoring for weak signals, is the need to monitor one’s own assumptions and hypotheses about what is happening in the environment. They give several examples where weak signals were available but still resulted in intelligence failures. Three different types of failure are mentioned:
- Too much information: the problem faced by the US who had lots of information prior to the Pearl Harbour attack of 7 December 1941,
- Disinformation, as put out by Osama bin Laden to keep people in a high-state of alert – by dropping clues that “something was about to happen“, when nothing was (and of course keeping silent when it was),
- “Warning fatigue” (the crying wolf syndrome) where constant repetition of weak signals leads to reinterpretation and discounting of threats, as happened prior the Yom Kippur war.
Their conclusion is that with too much data, you can’t sort the wheat from the chaff, and with too little you make analytical errors. Their solution is that rather than collect data and subsequently analyse it to uncover its meaning you should first come up with hypotheses and use that to drive data collection. They quote Peter Drucker (Management: Tasks, Responsibilities, Practices, 1973) who wrote: “Executives who make effective decisions know that one does not start with facts. One starts with opinions… To get the facts first is impossible. There are no facts unless one has a criterion of relevance.” and emphasise that “it is hypotheses that must drive data collection”.
Essentially this is part of the philosophy behind the “Key Intelligence Topic” or KIT process – as articulated by Jan Herring and viewed as a key CI technique by many Competitive Intelligence Professionals.
I believe that KITs are an important part of CI, and it is important to come up with hypotheses on what is happening in the competitive environment, and then test these hypotheses through data collection. However this should not detract from general competitive monitoring, including the collection of weak signals.
The problem is how to interpret and analyse weak signals. Ignoring them or even downplaying them is NOT the solution in my view – and is in fact highly dangerous. Companies with effective intelligence do not get beaten or lose out through known problems but from unknown ones. It’s the unknown that catches the company by surprise, and often it is the weak signals that, in hindsight, give clues to the unknown. In hindsight, their interpretation is obvious. However at the time, the interpretation is often missed, misunderstood, or ignored as unimportant.
There is an approach to analysing weak signals that can help sort the wheat from the chaff. When you have a collection of weak signals don’t treat them all the same. Categorise them.
- Are they about a known target’s capabilities? Put these in box 1.
- Are they relating to a target’s strategy? These go into box 2.
- Do they give clues to a target’s goals or drivers? Place these in box 3.
- Can the weak signal be linked to assumptions about the environment held by the target? These go into box 4.
Anything else goes into box 5. Box 5 holds the real unknowns – unknown target or topic or subject. You have a signal but don’t know what to link it to.
First look at boxes 1-4 and compare each bit of intelligence to other information.
- Does it fit in? If so good. You’ve added to the picture.
- If it doesn’t, why not?
Consider the source of the information you have. What’s the chronology? Does the new information suggest a change? If so, what could have caused that change? For this, compare the other 3 boxes to see if there’s any information that backs up the new signal – using the competitor analysis approach sometimes known as 4-corners analysis, to see if other information would help create a picture or hypothesis of what is happening.
If you find nothing, go back and look at the source.
- Is it old information masquerading as new? If so, you can probably discount it.
- Is it a complete anomaly – not fitting in with anything else at all? Think why the information became available. Essentially this sort of information is similar to what goes into box 5.
- Could it be disinformation? If so, what is likely to be the truth? Knowing it may be disinformation may lead to what is being hidden?
- Or is it misinformation – which can probably be discounted?
- What about if you can’t tell? Then it suggests another task – to try and identify other intelligence that would provide further detail and help you evaluate the anomaly. Such weak signals then become leads for future intelligence gathering.
With box 5 – try and work out why it is box 5. (It may be that you have information but no target to pin it to, for example – so can’t do the above). As with anomalies, think why the information became available. You may need to come up with a number of hypotheses to explain meaning behind the information. These can sometimes (but not always) be tested.
Silberzahn and Jones mention a problem from Nassim Taleb’s brilliant book “The Black Swan: The Impact of the Highly Improbable“. The problem is how do you stop being like a turkey before Thanksgiving. Prior to Thanksgiving the turkey is regularly fed and given lots and lots of food. Life seems good, until the fateful day, just before Thanksgiving, when the food stops and the slaughterer enters to prepare the turkey for the Thanksgiving meal. For the turkey this is a complete surprise as all the evidence prior to this suggests that everything is going well. Taleb poses the question as to whether a turkey can learn from the events of yesterday what is about to happen tomorrow. Can an unknown future be predicted – and in this case, the answer seems to be no.
For an organisation, this is a major problem as if they are like turkeys, then weak signals become irrelevant. The unknown can destroy them however much information they hold prior to the unforeseen event. As Harel said, the problem is not information but analysis. The wrong analysis means death!
This is where a hypothesis approach comes in – and why hypotheses are needed for competitive intelligence gathering. In the Thanksgiving case, the turkey has lots of consistent information coming in saying “humans provide food”. The key is to look at the source of the information and try to understand it. In other words:
Information: Humans provide food.
Source: observation that humans give food every day – obtained from multiple reliable sources.
You now need to question the reason or look at the objectives behind this observation. Why was this observation available? Come up with hypotheses that can be used to test the observations and see what matches. Then choose a strategy based on an assessment of risk. In the case of the turkey there are two potential hypotheses:
- “humans like me and so feed me” (i.e. humans are nice)
- “humans feed me for some other reason” (i.e. humans may not be nice).
Until other information comes in to justify hypothesis 1, hypothesis 2 is the safer one to adopt as even if hypothesis 1 is true, you won’t get hurt by adopting a strategy predicated on hypothesis 2. (You may not eat so much and be called skinny by all the other turkeys near you. However you are less likely to be killed).
This approach can be taken with anomalous information in general, and used to handle weak signals. The problem then becomes not the analysis of information but the quantity. Too much information and you start to drown and can’t categorise it – it’s not a computer job, but a human job. In this case one approach is to do the above with a random sample of information – depending on your confidence needs and the quantity of information. This gets into concepts of sampling theory – which is another topic.





