SEO and AngularJS: crawling, indexing and rendering

Index:

Premise
The basics of SEO with AngularJS
Fundamentals: What we know about GoogleBot and JavaScript
JavaScript redirects, links and runtimes
The old AJAX Scanning scheme (deprecated)
Server side rendering is a must
4 different approaches to SEO with AngularJS
Phantom.js and Prerender.io
Hashbang or not?
Canonical tag
Set up the AngularJS environment
Test and validate implementations
Super secret
Resources and insights

Premise

In September, while I was at the beach in total relaxation, Roberto calls me, a friend with whom I collaborate:

– Roberto: Hi Giovanni, I have a new project for you!
– Me: Excellent Roberto, what is it about?
– Roberto: He is an important client …
– Me: Very well, continue …
– Roberto: The work is practically already yours …
– Me: Fantastic, tell me more …
– Roberto: He has a very large site …
– Me: Interesting, what is it the rip off?
– Roberto: It is developed in AngularJS…
– Me: …
– Roberto: Giovanni ?! Are you still there?
– I:Did you say with Angular?
– Roberto: Unfortunately yes! They have indexing problems …
– Me: Come on ?! Who would have thought …

This was the first reaction to AngularJS: D

My approach was very cautious, I didn’t feel prepared. The infrastructure that I should have analyzed was complex, the client required a higher level of advice than what I was currently able to offer on AngularJS (as the essay says, the more you know, the more you realize you don’t know). With passion and dedication I started reading to answer all doubts, I spent two months filling gaps by studying implementation cases and optimizations with AngularJS. Once I had deepened the skills I needed, I carried out an SEO audit of the website and, together with the team of developers, we then corrected and solved all the technical problems that prevented the site from being indexed correctly.

The following image shows the increase in impressions on Google after technical optimization: the section of the site developed in AngularJS has gone from 1,800 impressions per day to almost 14,000 . With good work you can also get important results with AngularJS.

Increase of impressions on Google

This guide is the result of the experience gained working on this project in AngularJS 1.0 and completes the speech I gave at Search Marketing Connect 2016

Application and website development technologies such as Angular (technology pushed by Google), React (technology pushed by Facebook) and Angular 2.0 are slowly proliferating. From an SEO’s point of view, understanding how search engines interact with these technologies allows you to work on new and stimulating projects .

Nearly half a million websites developed in AngularJS as of September 2016. Source: https://trends.builtwith.com/javascript/Angular-JS

Are you an SEO consultant ? Do you have to optimize a website developed in AngularJS and you don’t know where to start? There are several ways to have an indexable site. Keep reading this guide and, by the end, you may have a little clearer ideas.

The basics of SEO with AngularJS

AngularJS renders pages with client-side JavaScript, i.e. in the browser of the user who requested the page. AngularJS provides a lot of practical features to developers and is a very powerful tool for quickly creating web applications. AngularJS is defined as a templating language that downloads the rendering job almost completely to the client – your browser.

Look at this example:

The advice you found on Google Webmaster’s guidelines was to make content scalable, accessible to search engines, avoiding JavaScript-only content . This view has changed, Google tries to render websites that make heavy use of JavaScript. However, the results are not always the desired ones.

What changes compared to a normal HTML page?

Think of a classic HTML web page like the one you are reading right now. The HTML code you are viewing is a template, built and customized by functions of some PHP files and mySQL database searches. The HTML code was compiled on the web server when you requested the page, and then it is served to you via HTTP (S).

In the presence of cache levels, if someone else has requested this page before you, it is likely that you are reading the cached copy, built by the CDN service before you landed on this page. Right now you are reading a web page which is essentially an HTML file that has been served by a web server. It was delivered after you sent an HTTP GET request, and now it’s on your PC. If you would like to see another web page, my web server will be happy to render it and send it to you. If you wanted to interact with a page, perhaps by filling out a form, you would send a POST request. This is how the Internet works.

This is not quite what happens when you land on a web page integrated into a JS framework like AngularJS . Basically, when you make a request to a site in AngularJS, the content you see is the result of a manipulation of the DOM with Javascript, which took place in your browser . Sure, there are several HTTP calls between client and server (using AngularJS’s http $ service), but it’s the client that is doing most of the heavy lifting.

Differences between an HTML page and a page in JSSimple HTML page call

Browser sends HTTP request
Web server contacts Database
Web server compiles HTML
Web server provides HTML + CSS

AJAX page call

Browser sends Javascript call
Ajax engine interprets the call and sends HTTP request
Web server contacts Database
Web server provides JavaScript
Ajax Engine interprets JavaScript response
Create HTML + CSS
Edit the DOM

Client-side rendering of pages, asynchronous data exchange, content updates without a page refresh, building an HTML template – these are all useful things that brought these frameworks into JS. For this reason many developers are in favor of using the MEAN stack (Mongo, Express, Angular, Node), it is relatively simple and fast to develop advanced application prototypes. However, if you want to receive organic traffic, it is important to create a structure accessible to search engine spiders with these technologies.

The thing that amazes me is that some web developers insist on developing websites in AngularJS even when it is not necessary , for example for brochureware and single-page websites (FAQs, landing pages, etc).

Always remember that, in any case, if you create a website with a JavaScript framework you will have to budget hard times and you will need real experts on the design and development team to compete in SERP with classic websites – at least for now.

Look at this site , or this one, both developed in AngularJS. The content you see is rendered in Javascript, and for this reason, if you look at the source code (CTRL + U in Google Chrome) you will see an unusually small amount of HTML, much less than you would expect when looking at the rendered result. Here is an example:

This is all the content that is provided to spiders by a site in AngularJS, sometimes you will find even less. This is why you get the blank page when you visit the site with JavaScript disabled in your browser. The tagng-appyou find at the opening of the HTML page creates the magic: it tells AngularJs to manipulate the DOM.

ng-App: it is a directive in the AngularJS ng module that tells the framework which element of the DOM it should use as the root for our application. In the case shown above, the ng-App tag acts on the entire page (html tag) but, more generally, it can be inserted within the tag or even within a single div.

Fundamentals: What we know about GoogleBot and JavaScript

Over the past year, Google has made progress by improving its crawling capabilities of web pages and sites developed in Javascript. Essentially Googlebot renders web pages as if it were a browser .

Below is an April 2016 statement from Paul Haahr, a Google engineer, clearly saying that the direction of the search engine is to interpret JavaScript content better and better.

And then, we also do content rendering . And this is something that Google has talked about a lot in the past couple of years. It’s new for us over the past few years. It’s a big deal for us that we are much closer to taking into account the JavaScript and CSS [ 28 , 29 , 30 , 31 , 32 , 33 ] on your pages.

How Google Works – Paul Haahr (Software engineer at Google) at SMX April 2016

Google Webmaster Trends Analyst Gary Illyes, Google Software Engineer Paul Haahr and Search Engine Land Editor Danny Sullivan on the SMX West 2016

In the citation some in-depth articles are linked, in link # 29 Adam Audette describes indexing tests on Google performed with redirects and links in JavaScript. In the projects that I have come across, I have been able to confirm his considerations, finding myself with the same results that I describe in the next paragraph.

JavaScript redirects, links and runtimes

Googlebot’s progress in handling redirects, links and elements in JavaScript is rapid and will improve day by day. Having to take stock of the situation to date , we can say the following.

1. Googlebot follows redirects in JavaScript

Googlebot treats JavaScript redirects with the functionwindow.locationin the same way as a 301 redirect from an indexing perspective, with both absolute URLs and relative URLs.

2. Googlebot follows links in JavaScript

This includes URLs associated with the href attribute and inserted inside the classic “a” tag, and in some cases even outside the “a” tag.

Also in the in-depth article # 29 – The author says that Googlebot scanned URLs that were generated by the system only following anonchangeevent handler which in this case was the movement of the mouse (onmousedown and onmouseout). Still in the same test, the author claims to have noticed Googlebot following URLs generated by the execution of variables in JavaScript. In the example he had concatenated a string of characters that generated a URL only when executed.

Personally I have not run tests of this type, generally I have noticed that Googlebot does not execute events that are not links.

3. Dynamically generated content

Googlebot is able to index meta tags (title and description), images and textual content inserted dynamically in an HTML page (functiondocument.writeIn), whether the content resides on the same HTML page or on an external JavaScript file. It is important not to block these files with robots.txt.

4. Googlebot does not run all events

Despite Google’s obvious improvements in crawling pages in JavaScript, URLs still need to be manipulated to become actual links inside the “a” tag. In fact, Google is able to interpret layout elements and common elements of a web page, but it does not try to execute events in JavaScript to see what happens. To manage URLs, you need to provide Googlebot with a “classic” link to follow. Using other methods, you risk that Googlebot will ignore those elements, or even stop crawling the page altogether.

5. Googlebot has a maximum wait time to execute JavaScript

It seems that Googlebot does not index content that takes more than 4 seconds to render, or content that requires the execution of an event through an external link to the a tag ( see previous point).

The old AJAX Scanning scheme (deprecated)

On October 14, 2015 , Google announced that the crawl scheme that Googlebot used with AJAX had become obsolete, they were deprecating the old “AJAX Crawl Directive”. It was then noted that, despite this announcement, Google still respects the old directive today.

In practice, the old 2009 directive can be summarized as follows:

Source: https://webmasters.googleblog.com/2009/10/proposal-for-making-ajax-crawlable.html

Process and explanation	URL example
1. Stateful defined URL , which is the URL format that is generated by the JS framework without any optimization. The hash, the hash, however, according to HTML standards indicates an internal anchor.	http://example.it/chi-siamo.html#state
2. Google and Bing propose adding the FRAGMENTTOKEN (!) After hash (#) to recognize the content in JS and not go against the HTML standard. The newly generated URL is called the Pretty URL	http://example.it/chi-siamo.html#!AJAX
3. Search engines map Pretty URLs in the index, but in the crawling phase they request Ugly URLs from the web server , that is, they replace the hashbang`#!`with`? _escaped_fragment_ =`in order to obtain the pre-rendered page	http://example.it/chi-siamo.html?_escaped_fragment_=AJAX
4. When the web server receives the request for an Ugly URL it activates a headless browser which renders the JS page and creates a complete HTML snapshot to be provided to the spiders (with the same DOM a page rendered by a client browser would have). Search engines index content received from Ugly URLs on the Pretty URL	http://example.it/chi-siamo.html#!AJAX

Important: If you are working on a site that uses the parameter? _escaped_fragment_ =, make sure the rendering functionality is working fine.

As you can see from the diagram, the old crawling directive required pages to be pre-rendered. Now Google basically says that we can also do without it.

Can we be calm then? No, I wouldn’t say. Pre-rendering, in my opinion, is something we need for several reasons .

The server side redering is a must

Providing search engine spiders with HTML snapshots by pre-rendering the pages in my opinion is the best choice at the time of writing. I often refer to today – December 2016, because tomorrow the MoR will be much more efficient in this field.

Can Google render and crawl JS sites? Yes, or at least he tries. This, however, is not a guarantee that the result is SEO friendly , i.e. a perfectly optimized website. The technical expertise of an SEO is required, especially during implementation tests. This will ensure that the site is crawled correctly by Googlebot, has a correct structure and that the content is written with the right keywords. Corporate finances will thank you, in fact the costs will be lower than in a race for cover.

Can other search engines render and crawl JS sites? I could say NI, but I prefer to say NO. Is your target market Russia and do you live on traffic from Yandex? Render or die. Is traffic from Baidu valuable to you? Render or die. Bing? Already better, but not as good as Google, render or die.

Do I render my pages or not? To avoid indexing problems, I think it is better to keep total control of what is “rendered”, by using Google HTML snapshots, that is a pre-rendered and compiled version of your pages . In this way you will be able to apply all the classic rules of SEO and it will be easier to detect and diagnose any problems during the tests. Also remember that in this article we are talking about Google, the most advanced search engine, the other search engines obtain significantly lower results in the rendering of JS.

And then … I often look at how others do, “the good ones”, and most of the major websites developed with AngularJS (I’m not saying all of them so as not to sound trivial, but do as if I said it) use a render function server side , let’s ask ourselves a question and give us an answer .

Advantages of pre-rendering:

Better to have total control of what is “rendered”, using HTML snapshots to Google
On SNAPSHOT HTML we work on SEO in a classic way
It should be easier to detect and diagnose any problems during SEO tests
Remember: Google has the potential to render JS, but other MoRs struggle

4 different approaches to SEO with AngularJS

To optimize the crawling and indexing of a website developed with AngularJS there are 4 alternatives, I explain them in order from the most elegant to the simplest.

1. Pre-rendering with PhantomJS – the most elegant method

Generate snapshots of your pages using PhantomJS (a headless browser) and create a custom cache level
Do not use#!in URLs
Do not useescaped_fragmentin URLs
Make sure each page has a Friendly URL with the HTML5 History API (so without#!)
Enter all Friendly URLs in a sitemap.xml and submit it to GSC
Instead of serving snapshots when the parameter? _escaped_fragment_ =is included in the requested URL, serve the HTML snapshot when the Friendly (canonical) URL of the page is requested by a search engine user agent, such as GoogleBot
Users (browsers) receive the page without pre-rendering
Implement the canonical rel tag correctly

The redbullsoundselect.com website follows this technique as of the date of this writing.

2. Continue with the old crawling policy

Keep the parameter? _escaped_fragment_ =to provide rendered snapshots. Googlebot is still processing requests for this parameter
Users (browsers) receive the page without pre-rendering on the Pretty URL (with hashbang )
Provide Google Search Console with a sitemap.xml with all Pretty URLs (with hashbang )
Implement canonical rel tag correctly (with hashbang )

3. Let Google render – the “hope in GOD” method

Get AngularJS rendered by Google, without any kind of pre-rendering and see what happens. The other search engines will be sorry
Use the HTML5 history API to remove the hashbang from the URL visible in the browser, and provide the Google Search Console with the sitemap.xml with all the Friendly URLs (canonical). Most developers agree that it is not the best to have hashbang in URLs, it adds complexity to a site’s SEO.
If you don’t want to use hashbanged URLs but want to inform Google that your site contains AJAX then you need to useng-appAngularJS directive

4. Anglar 2.0

Don’t develop anything in AngularJS 1.x
Go directly to Angular 2.0. Angular 2.0 natively includes server rendering functions, with results similar to those obtained with React
To learn more, read this guide

How to make a site SEO Friendly in React: To get started you can read this guide . The choices remain the same for good or bad, delivering pre-rendered content to search engines and users. Personally I haven’t tried it, but I’ve read that webpack or Browserify is used to modify JS in npm modules and run it on the server and client. The alternative is to let Googlebot index the site on its own.

Whether you decide to use? _escaped_fragment_ =it or not, there are many other things to take into account and the work will be long and complex. For this reason I recommend using ReactJS or Angular 2.0 Universal , as they have server rendering features already included. Watch the presentation video.

Phantom.js and Prerender.io

PhantomJS allows you to perform operations that are normally done with a browser, without however showing the browser itself. It is in fact a so-called headless browser , that is a tool that allows the manipulation via JavaScript of the DOM, CSS, JSON, Ajax other client-side Web technologies from the command line, without any rendering on the screen.

Based on WebKit, PhantomJS is a cross-platform tool and can be used in all those contexts where you need to automate the typical activities of a Web browser, but not only. To give some examples, it can be used for Web scraping activities, for the automation of tests on Web sites and applications, for network monitoring, but also for SVG rendering, interfacing with Web services and even for the creation of a simple Web server.

The PhantomJS server must be installed on the same machine hosting the web server, it works on a different port. Alternatively, there are external rendering services with PhantomJS, such as Prerender.io . This paid service (free up to a maximum of 250 pages in cache) is particularly useful when the web server cannot manage the rendering of many pages with acceptable response times .

Hashbang or not?

It is very important for an SEO to understand how it can help search engines crawl JavaScript dependent websites. If you achieve this, you are a good technical SEO. As we saw at the beginning, both Google and Bing support a directive that allows web developers to provide HTML snapshots of the content in JS, through a modified URL structure.

Specifically I’m referring to the hashbang parameter that search engines replace with? _escaped_fragment_ =in a URL.

I repeat the process because it is critical to understand it.

Imagine you have hashbanged URLs on your site in Angular:

The spider of a search engine recognizes the hashbang and automatically requests the URL from the web server with? _escaped_fragment_ =:

http://esempio.com/?_escaped_fragment_=/1/2/3/products/content

The web server recognizes the request from search engines since only they request URLs with? _escaped_fragment_ =and provides the pre-rendered snapshot of the page, the pre-compiled HTML file. Of course you have to make sure that the pre-rendering is working correctly on your web server, and this usually doesn’t happen by accident;)

Most of the developers I have dealt with use PhantomJS to render pages in AngularJS.

If you can successfully render your pages as HTML, the only thing left to check is that all requests they contain? _escaped_fragment_ =are redirected to your web server’s cache directory.

I don’t want to use the hashbang on my Angular site
Look at the head tag of the Redbull site I linked above – https://www.redbullsoundselect.com/ you will see a new meta tag, unknown to anyone who has never worked with Angular, a crucial meta tag if you don’t use hashbang .

<meta name="fragment" content="!" />

Scroll through the other meta tags, wouldn’t it make you think that with all those curly brackets it’s more of a disaster than anything else? You’re wrong, those curly brackets will be filled by JS with HTML code. The developer of this site was very good, set the URLs without hashbang , and to warn the search engines, put the “meta fragment” in the head tag. The fragment tag invites search engine spiders to request the HTML snapshot on the URL with the parameter? _escaped_fragment_ =.

Note: To generate Friendly URL (URL without hashbang ) use the AngularJS $ location service and HTML5 History API.

Canonical tag

The Canonical rel tag should always be inserted, whatever technical setup you choose. The important thing is that if you decide to use this tag, make sure it is correct. Specifying the wrong URL in the Canonical tag could affect the indexing of the entire website.

In the Canonical rel tag enter the Pretty URLs , i.e. the URLs used by users and visible on the site.
Remember: Pretty URLs are addresses with:

Hashbang
Friendly URL in case you used the HTML5 history API to rewrite URLs without#!

Never put URLs with? _escaped_fragment_ =in the Canonical rel tag and in Sitemap.xml. Another common mistake to avoid is to always indicate the homepage in the Canonical tag.

Set up the AngularJS environment

To configure the development environment you can use Yeoman, an application scaffolding tool , which is an all-in-one solution for developing and testing AngularJS applications. Open the terminal and type:

Note: I used the name “testangular” for the folder, but you can give it whatever name you like.

Wait for the setup process to begin, answer the various questions the system asks you, choose whether to include Bootstrap and any of the other AngularJS modules. Once the process is done, you will have a development environment set up. From the terminal typegrunt serverand the browser should open with the default Yeoman template . From here you can start creating a Single Page Application (SPA), but let’s go further to create the HTML snapshot of the AngularJS page.

You need a method to tell search engine spiders to request pre-rendered pages, for this purpose you can use a pre-developed SEO package for AngularJS.

git clone https://github.com/steeve/angular-seo.git

In the folder you will find two important files:

angular-seo.js , which must be placed in the “/ testangular / app” folder
angular-seo-server.js , which must be placed in the “/ testangular” folder or in the root of your application (the folder that contains the Gruntfile.js file )

Note: You will find complete instructions in the GitHub repository.

The setup must make sure that the system uses two ports:

Application port: to manage the application
Snapshot port: to manage the application instance on PhantomJS

Requests from non-bots are served by the Application port (no matter which), while requests from bots and search engines are served, in pre-rendered HTML, by the Snapshot port. The next steps are:

tell the application to enable indexing by spiders
include the SEO module
tell the application to warn you when it has finished rendering
install and run PhantomJS
notify the search engines by inserting the fragment meta tag in the index.html page, see above , the spiders will search the HTML snapshot on the URLs with? _escaped_fragment_ =

Now open the app.js file and find the inclusion of the module within the declaration. Whichever way you decide to do this, you need to include the SEO module placed in the angular-seo.js file (which we placed in the testangular / app folder. For example, the declaration looks like this:

Now you need to define how HTML is rendered by calling the function$scope.htmlReady(), at a point where you are certain that the HTML page has finished loading. It depends on how you’ve organized your controllers, but it’s typically done at the end of the main controller. For example, with the controller included in Yeoman , the main.js file looks like this:

Finally you have to include the angular-seo.js file in your index.html file, towards the bottom, where it is correct to put the controllers:

You have finished the application setup, now install and set up PhantomJS .

From the command line, typenpm install phantomjs. Once the installation process is finished navigate to the root directory (the folder that contains the angular-seo.js and angular-seo-server.js files ) and run this command:
phantomjs --disk-cache=no angular-seo-server.js 9090 http://127.0.0.1:9000

The command activates the PhantomJS server without disk caching (for now be satisfied ^^), active on port 9090. The port used by PhantomJS must necessarily be different from the port used by your application. The port is set in Yeoman ‘s native grunt file . In other words,yo angularit offers the option to run agrunt server, which sets up a web server on localhost to test the application on port 9000.

PhantomJS turns on port 9000 and listens for requests on port 9000
when requests contain?_escaped_fragment=instead of the hashtag, PhantomJS pre-renders the page and serves it to the requestor, who is a crawler .
when requests contain hashbanged URLs it means that the requestor is a human (browser) so it bypasses PhantomJS.

Now that PhantomJS is up, launch the development servergrunt server.

The development server has address 127.0.0.1 with port 9000 (or localhost, depending on what you prefer to call it)
a second web server with port 9090 listens for requests on port 9000 to identify the traffic coming from the crawlers

Before going live make sure that identifying URL requests works correctly? _escaped_fragment_ =so that they run on PhantomJS . URLs with? _escaped_fragment_ =must not be served by the main server listening on port 80. On Nginx you can use this rule:

Test and validate implementations

Once you have tested and validated the pre-rendering with a crawler, you can switch to “classic SEO” mode, optimizing the usual aspects of an HTML site that we already know. But first you need to check that the pre-rendering works perfectly. For the test you can use the command line and run (if you are local):

The web server should show you the HTML snapshot of the homepage (address “/”). Alternatively, get a crawler and install Google Search Console. The necessary tools are few but fundamental:

Use Fetch As Google
Check the Google cache
Use Google Search Console
Scan the site with a JS friendly crawler

As I always say: “luckily there is Screaming Frog”, a crawler capable of scanning even pages in AJAX. The AJAX crawler respects AJAX directives, this means that whenever it finds a hashbang , it will request the URL with? _escaped_fragment_ =– from Pretty URL to Ugly URL to get the rendered page from the web server. To scan a site with Screaming Frog download the sitemap.xml ( how do I find it? ) And enter it in list mode. The crawler emulates the behavior of search engine spiders: it finds the hasbang and requests the Ugly URL version to have the HTML page pre-rendered by the web server.

For a complete guide of Screaming Frog in Italian, I refer you here .

Make sure URLs with status code? _escaped_fragment_ =200 respond . As mentioned earlier, the main problem with this framework is handling the pre-rendering functionality.

Super secret

Technical SEO isn’t everything. Now that you have a website developed in AngularJS that is perfectly optimized for search engines, you need to fill it with quality unique content. A technically perfect site without content is like a racing car without gasoline!

Resources and insights

The Basics of JavaScript Framework SEO in AngularJS
Angular JS and SEO
How do search engines deal with AngularJS applications?
Warning: You’re Killing Your SEO Efforts by Using Angular JS
angularjsseo.com
AngularJS SEO with Prerender.io
EO for Universal Angular 2.0
Building Search Friendly Javascript Applications with React.js
Deprecating our AJAX crawling scheme
AngularJS Developer Guide
Getting Started with Angular
PhantomJS Documentation
Google: One Day We Will Deprecate Our AJAX-Crawling Proposal

This guide took a long time to write and is here for you, free of charge. If you think it might be useful to your colleagues or friends, share it on social media

and leave a comment. The author thanks;)