This article was originally published on the Spectrum blog.
They go by several titles. Generically, they’re referred to as spiders or crawlers or bots – specifically, Google calls them Googlebots. These are programs that scan, or crawl, the Internet searching for new and/or updated pages to add to the Google index. In fact, every search engine has their own version of Googlebots, which are essentially huge computer sets that browse billions of webpages every day. They use an algorithmic process to determine which sites to scan, how often to scan them and how many pages to index from them.
Of course, website owners have the right to restrict Googlebots from accessing virtually any page on their site – the benefits being that information you don’t want disclosed about yourself or your business on major search engines won’t be. The file used to do so is called robots.txt.
How to Check What Web Pages Google Sees:
Googlebots are automated to check whether or not they have access to a page before they crawl it. Robots.txt files serve as digital gatekeepers, if you will – communicating which pages, if any, a site has blocked access to. To see which pages your website is currently blocking, simply enter “/robots.txt” after your URL. For example, for www.Spectruminc.com, one would type in www.Spectruminc.com/robots.txt into their browser.
If you’re allowing Google and other search engines to index all of your website’s pages, it’ll bring you to a white page with two lines of text in the upper left-hand corner. It looks like this:
The above display indicates that every page of your site is fully disclosed to Google and Googlebots are regularly scanning it for new information to index.
“User-agent” specifies which bots, be they from Google, Bing, Yahoo!, etc. have access to your website. The asterisk signified that all bots are permitted to view the website.
“Disallow” lists the pages that are restricted. If these pages exist on your site, the name of the page would appear in between two backslashes, like this: