How a search engine finds your personal dataMikey 2 comments
This evening I walked in while a story on was being broadcast and I put all better judgement aside to actually sit down and watch, only because I heard the word 'telemarketing' spoken and that subject happens to be a pet hate of mine. Before I continue, let it be known I wouldn't watch that show on purpose otherwise.
The story centred around shady telemarketing practices, or in this case sloppy practices. Briefly, some people were contacted by a telemarketing company and offered phone deals. At the beginning of the conversation they were advised that the phone call would be recorded for training purposes, and assured that anything they said would be private. A reasonable assumption.
But what followed was surprising indeed. One customer had Googled his name and was horrified to learn that all his personal details he had given to the marketer and even a recording of the conversation were available for the world to see.
The reporter had the opportunity to explain how this could have happened, but instead decided to play the 'they put all their details onto Google' sensationalist card. To the layman that might sound feasible, but the reality is no-one 'put the details onto Google', and instead this incident is the result of the sloppy practices I mentioned earlier.
This now gives me an excuse to educate anyone who has wondered how Google (and less relevant search engines) index information. I will also demonstrate how the company in question were ignorant of the facts I am about to explain.
With this information you can protect against your personal details being in among a search result. For the purpose of this article I will use Google as an example.
Google indexes data in one of two ways. The first and most obvious that people would know (at least vaguely in some way or another) is to give the data to Google. You have a web site and you want people to find it, .
But the web site submission form is only really necessary for a couple of reasons. 1: you can't find your web site using a Google search and you want to be sure it has been indexed, and 2: you just launched your new web site and you want it to be indexed immediately and not have to wait for the search engine spiders to stop by. Search engine spiders? Read on.
A spider (also known as a web crawler or web robot) is a program written by a search engine company which simply cruises around the Internet looking for web sites and indexes everything it finds on them. Spiders will index content and also create a site map based on the links contained in your web site. So when someone performs a Google search, the query is compared against the search engine index, and relevant results are returned.
This is the crucial part to remember, that unless a link to the article exists on your web site or any other web site for that matter, it cannot possibly be returned among search engine results. Because the spider doesn't know it exists.
Let's take this site as an example. I wrote a review about the Polyview v396 monitor. At some stage I had a link on our homepage to this article. A search engine spider (which returns to this site at regular intervals) has found the link and indexed the content of the article. Do a on pretty much any search engine and this site will be first in the results.
This leads me back to the telemarketing company I mentioned at the start. The company in all likelihood wasn't submitting individual pages of details to Google, but instead were clueless when it comes to web applications.
Telemarketers, like a lot of companies, use a web based application for recording data. It is convenient because you don't have to be at the office to work. All you need is an internet connection and a browser and you can work anywhere in the world at any time.
In order for the telemarketing company to have shared the personal information of all those people, they would have had to have done something really stupid like have a link to their unprotected web application on the homepage of their web site, which is all a search engine spider needs to get in and start indexing everything it finds. It's as simple as that. If they were smart about it they could have easily prevented the Spider from indexing by adding a small file to their web directory (robots.txt, a subject for another article) which allows you to instruct it to only go to certain directories and/or to ignore others.
So let it also serve as a lesson to any amateur web site builders. Sure that site you built with Frontpage is handy for showing photos of your dirty weekend to your buddies on the other side of the country, but if your site has a hyper link to those photos, so does the rest of the world.
Read on if you want a frightening real world example.
I have even seen this happen on a professional company web site whose name I won't mention. I often use a program called wget. Wget acts very much like a search engine spider in that you give it a web site address and it looks for all the links and downloads them all, web pages, images, and files.
On the particular web site I decided to leech back in 2000, I was very surprised to see it weigh in at around 500mb when it finished downloading. Upon closer inspection I realised I had downloaded original company branding media, internal company documents and confidential pay advice information. I was able to easily trace it back to a link on one of the pages that someone thought wouldn't be found because it was white text link against a white background which led to a page that had several links to all the aforementioned content. Obviously it was meant for internal people only. I was able to bring up the same information from an Internet search.
For the record, I did advise them and deleted the files I had. The entire site was pulled down within minutes and presumably somebody got fired.