Place of design
» SEO – Working with search engine robots

SEO – Working with search engine robots

Robots.txt

A robots.txt file is a file that tells search engine spiders what not
to index. Search engines “spider your website”, and follow instructions
left in this file

Scenario – you have a website, but you do not want the search engine to
list your customers directories (wedding images for example)

Scenario – you have a series of pictures that make up 1 picture as
a whole – you don’t want the individual pictures being indexed – as it
will look silly on goggle picture search

This is achievable using the ROBOT META tabs (on all pages), or just by editing one file, called robots.txt

You might also have a page, that has a whopping great big list on it
(genuine one), that might be considered as SPAM by a search engine The
robots.txt file is a good way to prevent this page from getting
spidered by a search engine. in this way – you can keep the list, but
not get blacklisted

You can only have one robots.txt file per domain (this means you can only use it if you own the domain e.g. you cant use it for www.mywebpage.aol.net/robots.txt, but you can use it for www.dizzyblonde.co.uk/robots.txt). The file needs to be placed in the root HTML folder at your server (usually public_html)

The actual file:

is called robots.txt, and is a simple text file (make it with notepad)

To stop all search engine spiders from indexing your website: insert this into robots.txt file:

User-agent: *

Disallow: /

The * in the user agent line means “all” search engines

The / in the disallow line means “my whole site”

To stop all spiders from indexing a specific directory(s) in your site include:

User-agent: *

Disallow: /webstats1/

Disallow: /usergallery/gallery1


The * in the user agent line means “all” search engines

Disallow: /webstats1/ – specifies the directory “www.yourdomain/webstats1″

Disallow: /usergallery/gallery1 – specifies that the folders at www.yourdomain/usergallery/gallery1 will not be searched, but the folder www.yourdomain/usergallery would have been spidered

Or to stop a specific file being spidered:

User-agent: *

Disallow: /usergallery/pic1.jpg

Disallow: /private/my_document.htm

If you want to be search engine specific..

User-agent: googlebot

Disallow: /

Here is a typical robots file

User-agent: *

Disallow: /cgi-bin

Disallow: /cgi-perl

Disallow: /cgi-store

Disallow: /images

Disallow: /includes/

Disallow: /print/

Disallow: /606/

Disallow: /messageboards/

Disallow: /apps/

What it does:

This robot.txt file basically stops search engines indexing all of
the “code bits” and “dynamic bits” of a site, whilst stopping the
images in the images folder from being listed on their own

This one is more useful for most sites with a normal file structure

User-agent: *

Disallow: /cgi-bin

Disallow: /images

Disallow: /includes/

Disallow: /user_galleries/

In summary

The robots.txt file is a text file placed in the root file of your
web space, and is used to prevent certain folders or files being
indexed by a search engine (or all search engines). Usually this is
used to prevent search engines trawling through content that is
partial, or unsuitable for direct linking from a search engine. If you
are subject to high volume of spidering from a particular problem
search engine (which can slow your site down)


Digg it | submit to del.icio.us