Automated Link Checking on an AJAX Website

Posted by Steve Green on 13 February 2013.

Using an automated link checker is a standard part of pretty much any website testing project, and our tool of choice is Xenu's Link Sleuth.

On a recent project, I knew that the website (let's call it http://www.abc.com) contained about 1000 pages, so I started Xenu but it finished spidering the website almost immediately. And it had found precisely one page.

This is not uncommon, for instance where a URL redirects to a different domain. But I knew that this one didn't do that.

So what's going on?

From the source code, I could see that the home page contained many links to other pages, but Xenu had not listed any of them. Inspection of the URLs showed that they were all of the format http://www.abc.com/#!/animals/walrus. The URLs were all on the same domain as the starting URL, but the # symbols indicate that they all link to somewhere in the current page. The problem is that Xenu completely ignores this type of link.

Our first attempt

For problematical websites, we generally use the 'wildcard' version of Xenu, which supports wildcards in the inclusion and exclusion lists.

We tried adding http://www.abc.com/* as an internal URL on Xenu's Starting Point dialog, but still it would only find the one page.

Some time later...

Through experimentation we found that if we cut out the string #!/ from the URL of a page inside the website, the resulting URL would redirect to the initial one. An example would be cutting http://www.abc.com/#!/animals/walrus down to http://www.abc.com/animals/walrus

Crucially, this meant we could put the shortened URL into Xenu and it now found all the links on that page. However, it would not spider beyond that page because all other URLs appeared to be external because they were not in the /animals/ folder.

The final part of the puzzle was to add the wildcard that we had tried initially. This tells Xenu to treat all URLs on the http://www.abc.com/ domain as internal.

In conclusion

This technique won't work on all AJAX websites and may not even work with other link checking tools. However, this experience is a reminder that you cannot take the results of a link checker at face value, and it shows that it may be possible to get the tool to work with websites that initially appear un-spiderable.