Skip to content

Site analysis configuration

Site analysis

4SEO works by crawling your pages just like a search engine does. We call it Site analysis and it's how we gather information about page content, allowing us to, for instance, add metadata and structured data to your pages, or detect errors.

This analysis can be done in 2 ways, which are not mutually exclusive:

  • From the Pages page, using the Analyze now toolbar button
  • Automatically, in the background, as visitors view pages of your site

Manual analysis

You will use manual analysis when:

  • your site is not public yet and so there are no visitors to trigger background analysis
  • you are developping the site on your own computer, not online
  • you want to speed up the process to test things out

A manual analysis will run until you either stop it or all pages on the site have been analyzed.

Automatic background analysis

This is the usual operation mode for live site, public sites with visitors. In this mode, 4SEO will not only analyze your site until all pages have been processed, but it will also constantly monitor content for changes.

When a change is detected in a page, or a new page is seen, it will be put on the list of pages to crawl and analyzed later, in the background.

Info

Background analysis does use some server resources, but it will never affect the speed at which a page is shown to your visitors, or search engines. The actual page viewed by visitors only serve as a trigger to start a background process to work on pages pending analysis.

Settings

Enable/Disable

If disabled, both automatic background analysis and manual analysis will stop until you set it back on.

Reset analysis

4SEO maintains a list of all the pages it found on your site, with data it gathered about them. Likewise, it also records errors and broken links.

By using the Reset analysis Now button, you can clear all this data. This may be required in case of having changed a lot of content or the site architecture, adding or removing extensions, or just solving errors for instance.

After being reset, analysis will restart automatically, provided it's enabled (see above setting).

Exclude/Include pages

Not all pages on your site needs to be fully analyzed for SEO purposes, so you can tell 4SEO to skip some using the following settings.

This will both save server resources and also makes the Pages page easier to read with only important pages listed.

Apply robots.txt exclusions

If enabled, pages that are excluded in your robots.txt file will not be analyzed. As pages excluded through robots.txt will not be used by search engines, it makes sense to not waste time and energy on them.

You may need to temporarily disable this option on rare occasions. For instance, when developing your site online, it is common to entirely disable search engines crawling until the site is made public. This would prevent 4SEO to analyze any page and in that case, you'll need to disable robots.txt rules application.

When developping your site online, it is usually more efficient to set a password to completely block access than using a robots.txt exclusion. Some robots may not respect robots.txt, while with a password you are garanteed that no one can access.

Apply meta noindex exclusions

Just like for pages excluded through your robots.txt file, 4SEO will by default exclude pages that have a noindex meta tag from being analyzed and stored as a page.

Just like search engines do, excluding a page using a noindex meta tag does not prevent 4SEO from looking for links inside it. These links will in turn be crawled and possibly stored as regular pages. Such links are excluded using the nofollow meta tag (see next paragraph). Also, we do not recommend using a global noindex,nofollow option during development to prevent search engines from indexing its (unfinished) content. Bad actors will likely still scrape its content, which may end up in Google anyway. Using a password protection is the recommended way.

Apply meta nofollow exclusions

4SEO by default does not consider links found on pages that have a nofollow meta tag for collection and analysis. By changing this option to No, you can force it to bypass the meta tag. This can be used if you temporarily enabled nofollow globally on your site during development for instance.

As mentioned in the previous paragraph, globally using nofollow to protect your site from search engines during development is not recommended as it only blocks "nice" bots. Using a password protection is a better option to effectively block search engines and bad actors scraping your content.

If enabled, links found in your pages with rel attribute containing nofollow will not be analyzed.

Using nofollow to tell search engines to not crawl some links found in your page is common practice to preserve crawl budget (although this is often better done with robots.txt exclusions). It is also useful when linking to low-quality pages, or non-html documents such as PDF, if you don't want them indexed.

As this usually signal your intent to avoid search engines crawling those pages, 4SEO will do the same and not analyze them at all. You can disable this feature if you nofollow links for a different purpose and still want pages linked that way to be analyzed and listed on the Pages page.

If you have multiple links on your site to the same page, but only some of them have a nofollow attribute, 4SEO will still see and analyze these pages, through the links that do not have a nofollow attribute.

Excluded pages

Per the title, any page request which URL is listed here will not be stored or analyzed.

As usual, you can use wildcard caracters ({*} and {?}) to reject not just a specific URL but a group of them.

For instance:

/forum/{*}

will disregard any pages from your forum, assuming all pages addresses from your forum start with /forum (this may vary depending on your site setup and forum extension).

Included pages

If you excluded some pages from analysis as described in the previous paragraph, it is sometimes interesting to include back some.

Let's assume you have excluded your entire forum as above, you may want to actually have one or a few pages still included. You can do so with this setting.

For instance:

/forum/terms
/forum/code-of-conduct

By entering these 2 lines, despite all pages of the forum being excluded, those two specific pages will be included back and analyzed.

Excluded domains

As 4SEO can find links to other websites on your pages, it will by default perform a basic analysis of it: namely, it checks if the linked page returns a valid response or an error.

You can exclude links to one or more domains of this process, to save time and resources, by listing them here.

Collect Incoming 404s

All sites exposed to the public will face a lot of (failed) attacks by robots trying to break into them. This is expected and not a problem at all assuming you maintain your site - Joomla and all extensions - fully updated at all times.

Most of these attacks just check to see if a known vulnerability works on your site. When it does not, they'll just move on to another type of attack or another website.

These random "test" attacks are why you can see requests for WordPress pages made to your Joomla site .

The attackers don't know if your site is a Joomla or WordPress or Drupal site. They just try everything they know and see if something works.

The only inconvenience about this is if you try to record 404 errors on your site, in case there's a bad link somewhere and you want to fix it: you get a lot of these garbage attacks cluttering your error logs.

The logs become hard to use as there's some many random attacks listed.

That's why you can disable external 404s recording and only keep actual errors happening on your site such as PHP errors or any other error than 404.

These errors are listed on the Errors | Recorded errors page.

No matter what, 4SEO will always record 404s errors happening as a result of a broken links on your site. These errors are listed under Errors | Broken links

Restricted access

If your site is still in development, you may be protecting it from external access using a password. This will prevent 4SEO to crawl your site pages.

If you use the HTTP Basic Auth password protection method, the most common one, you can set here your chosen identifier and password. 4SEO will use them to get access when crawling your site, without you having to remove this protection.

Crawler configuration

Validate TLS certificates

If you develop your site on your local machine, you may be using a temporary, self-signed TLS (also called SSL) certificate.

Such certificate is usually considered invalid by PHP as it's not coming from a known certification authority. 4SEO is therefore blocked from analyzing your site as long as you are working locally.

To work around this, you can disable certificates verification here, at least as long as you operate locally.

External cache bypass

In some rare cases, you may be using an external (non-Joomla) full page caching system that breaks 4SEO ability to analyze your website pages. This may be happening if the number of pages stays stuck at zero for instance.

Such external caching maybe a CDN or if you enabled fastcgi caching on your server.

It is perfectly fine to use a CDN such as Cloudflare on your website and this does not require configuring "External cache bypass". Cloudflare and other CDNs will normally only cache assets (images, javascript, css) and you do not need to configure "External cache bypass" with such a standard setup. It is only if you configured Cloudflare or your CDN to cache entire HTML pages that 4SEO analysis is blocked by the CDN. For instance, Cloudflare requires you to create Page rules to enable full page caching, it's not part of a standard CLoudflare configuration. If in doubt, please consult with us before making any change here.

When enabled External cache bypass is enabled, 4SEO will add a query variable to all its analysis requests with a random value.

The query variable name is : x-wblr-crawler-cdn-bust.

In this case, you will need to configure your CDN or caching system to use the query string when caching.

The 4SEO crawler will always set a specific request header, independently of this setting. If your caching system or CDN allows, you can bypass caching when that request header is present. The header is similar to : x-wblr-crawler: XXXXXX where XXXXXX is your secret cron key (find it under 4SEO system configuration)