Website optimization [03] Caching and compressing

Website optimization [03] Caching and compressing

Updated: 20/03/2019.

After a post about measuring web-site performance and discovering “slow-downs”, followed by a post about eliminating multiple redirects, this post deals with website caching. It will explain briefly what caching is and why it is important. Followed by the same for compression. Finally, I’ll explain how it’s been implemented on this (relatively small and simple) WordPress web-site. A separate post deals with WordPress plugins, in terms of speed and stability.

Contents:
1. Caching
…1.1. What is caching?
…1.2. Website caching
2. Website caching policy
…2.1. Validity and cache validation
…2.2. Caching scope
3. Caching implementation
…3.1. Caching implementation on the server
…3.2. Web site design caching implementation
…3.3. CDN cache implementation
4. Content compression – zip and gZip
5. Minimization
6. What did I do on BikeGremlin web-site?
7. Testing
Sources

1. Caching

1.1. What is caching?

Caching in general

Check what time is it now, the moment you are reading this. Ready? What is the capital of the United States of America? You have probably just used something analogue to caching. Most people know the capital of this country by heart. And it’s analogue to RAM memory (pleonasm, author’s remark) caching. Unlike that, try answering: what is the capital of Rhodesia? Most people would have to use Google (and learn something about both history and geography in the process).

These two examples show the importance of caching for speed. Once you have Googled Rhodesia, that information stays in your “RAM” and you will answer the capital of Rhodesia question quickly next time – until you forget it, which is analogue to clearing cache memory.

Caching dynamic resources

Another example important for understanding the subject. Imagine you are getting prepared for a quiz about capital cities. You are most likely to look up and make a list of all the countries and their capitals – sorted in whichever way is easiest for you to memorize (store it in your “RAM”). Making the list takes a lot more time than just reading a completed list. This is similar to something I’d call “a (server) process cache” – once you make a list it stays written and ready to be used next time. No need to make the same list over and over again. Having said that, written down list is like (hard) disk caching, while learning all the capitals by heart is like RAM stored cache.

Cache expiration

What time is it? Did you check the time again, or rely on the (now inaccurate) data you checked when starting to read this chapter? This demonstrates another important characteristic of caching: the more quickly and often something changes, the less it makes sense to cache it.

Caching policy

Previous examples have shown the importance of some process and data caching, as well as the meaningless of caching correct time info. Setting this all up and defining: what is to be cached, where the caches will be stored and how often will they be refreshed – is called “caching policy”. If the caching policy is defined and implemented properly, a web site will run fast and without errors. Set it up wrongly, and you have a website that tells the correct time only once in every 24 hours.  🙂  Caching policy is explained in more detail in the chapter 2.

There are a lot of different caching types and methods, but that exceeds the scope of this text. Here I will stick to caching in terms of web-site performance improvement.

1.2. Website caching

When a visitor comes to a site, unless it’s a prehistoric type of site with static html pages, server needs to run some code in order to generate the page for the visitor to see (generating content, drop-down menus etc.). This takes time. Then the visitor needs to download all that to their computer. This too takes some time. Finally, some scripts (programs) often need to be run on the visitor’s computer. Only after all that is done can a page be seen and function properly on the visitor’s screen.

There are two basic types of “website caching”: server-side caching and web-browser (visitor-side) caching.

Browser caching

Most browsers have the ability to store once downloaded data for future use, so they don’t have to download it again. For example, the background picture of this site (a nice fixie bike 🙂 ) was downloaded only during the first visit to this site, afterwards being shown from your browser’s cache (unless you’ve set your browser to not save/use cache).

Also, browsers can cache Java scripts, used to help customization of web page design and final look. Drop-down menus at the top of the page you are looking at are one example. Apart from this, it is possible to “tell” the scripts to run a bit later, after the most of the page content has been loaded and shown. This enables visitors to start reading/viewing a page right away, before it gets all the functionality enabled (“in the background”, or “asynchronously run scripts“).

All these elements are cached locally, on visitor’s hard disk/SSD. Each visitor will download them at least once (during the first visit to the web-site), and not have to download them again (for a long time). There are several different ways to tell a visitor’s browser what to cache and how long that cache is “valid”.

For example: it is good that visitors don’t have to always download BikeGremlin logo and background image, but it makes no sense for a logged in user to write a comment on this page and then be served with an old, cached page version, without their comment shown. If browser caching policy is not explicitly defined on the host-server, browsers will use their default caching policy to cache certain data. This can create some web-site functionality problems, so it is best to provide correct caching instructions to browsers on your server.

Proxy caching

Proxy cache can be simply explained as a shared browser cache. It is set up by Internet service providers, or large companies, for computers in their networks (usually along with a firewall). To put it simply – if a user visits bikegremlin.com, their “proxy neighbours” will be able to quickly download most content directly from their proxy, when they visit bikegremlin.com – with no need to download it from the host-server.

Server caching

On the hosting server, instructions (caching policy) for others are set up. That is: browser stored cache is defined, proxy cache etc. This part is very important, as I’ve tried to explain above.

In addition to this, important part is setting up caching on the server itself. What pages, or process results (usually results of running PHP code) does the server keep on it’s hard disks (or SSD-s), in their RAM, etc, and for how long.

For example, while you are reading this, there is no bikegremlin.com/caching.html page containing this post, but the server runs PHP code using database data in order to show all this driveling. 🙂  Since posts are written once, seldom updated/changed, they are perfect candidates for caching – saving the results of once run code with a finished page ready to be displayed.

Gateway cache (CDN)

Gateway cache is a sort of “reversed proxy” cache, made to spare load and bandwidth to servers. Just like proxy cache spares a user of downloading data, gateway cache spares the server of uploading it. That is, if content requested by the user is cached in gateway cache, it is served from it, not from the server.

CDN (Content Delivery Network) services provide this function, along with other functions (protection, optimization etc). Main advantage is that CDN-s store cache copies at several different geo-locations, routing visitors to the gateway cache location that they have the best/fastest connection to. So you can place a hosting-server in Novi Sad for example, and let the visitors living on the wrong side of the Atlantic ocean 🙂 visit a Los Angeles based gateway cache (copy) provided by the CDN service your site is using.


2. Website caching policy

For cache implementation it is crucial to make a good caching policy first. Good caching candidates are:

  • Content that is not changed often and even serving old versions is not problematic – background image for example.
  • Content that doesn’t change often, but it must be confirmed that the cached version is still relevant (stale content mustn’t be served) – comments and posts for example.

2.1. Validity and cache validation

Best before

Cache can be defined with a preset expiry date (Expires), or with a duration period (Max-Age). In that case, visitor’s browser will check the expiry info and if the file is considered still good, it won’t even contact the host-server to check if there were any changes. This is super-fast, but can be tricky if the data gets changed and it’s an important change for visitor experience. For example: I could change the background image to show a coupon for a bicycle-shop discount. Old visitors, upon re-visiting the site, would still be seeing the old, cached image of a fixed gear bicycle. The only way of solving this is renaming. I could load the new background image as background2.jpg, instead of the old background1.jpg. Then, a returning visitor’s browser would see it needs to download background2.jpg and show it. Since that image didn’t exist before, the browser surely doesn’t have any cached version of it, so it is certain to download and show the new background.

This is the main downside of both “best-before” (Expires) and “good for x seconds” (Max-Age) cache implementation.

Downside of set expiry date (Expires) is that the date needs to be updated. Otherwise, once the date has passed, the content will practically be non-cached, i.e. a “fresh” version will always be downloaded from the server. In those terms, setting a duration period (Max-Age) is a simpler option. Unless there is a good reason for a set expiry date/time.

Changes

Other method, a bit slower since it requires checking with the server, but safer in terms of being up-to-date is setting the cache to check if there have been any modifications. This means that cache is saved with a “last modified” date (Last-Modified), or with a tag that is unique for each file (ETag) and changes whenever the source changes. Browsers checks with the host-server if the Last-Modified/ETag differs from the one it has cached. If not, there’s no need to download a new version. These tags are much smaller than files, so they are quickly exchanged and checked, but they still require some time to exchange (and check). That is the main downside of this caching method.

Public/Private

In addition to aforementioned validity controls, it is possible to set cache as:

  • Public (Cache-control: public) – cache that can be saved on public proxies etc.
  • Private (Cache-control: private) – cache that is only saved in unique visitor’s browser cache. Good solution for content that differs for each (logged in) visitor.
  • No caching (Cache-control: no-cache) – no caching, for data that often changes, keeping the same name.

2.2. Caching scope

Chapter 2.1. explained the reason why it is important, when creating a caching policy, to choose the correct method for files/objects. Web sites often have different kinds of content and the same caching method is often not optimal for an entire website. This is a very important consideration.

When planning a caching policy, separate rules can be defined that are applied to:

  • entire website – images, for example, can be cached using the same rule for all the images on the site.
  • some pages – a certain group of pages (or objects), based on directory, or URL address can have a separate caching method defined. Posts, for example, that are not updated very often, can have a caching period of a week, or two (Expires, Max-Age), while comments should be kept up-to date (Last-Modified, ETag).
  • a particular page/object – for example, if all the pages are set with a two week cache validity, but a page with open-hours, or price list must always be up-to-date.


3. Caching implementation

3.1. Caching implementation on the server

Once caching policy is properly defined, it’s left to be implemented (and tested, which is very important). This should first be done on the host-server of the web-site. Technique depends on server type (Apache, Microsoft etc.) and explanation of that is beyond the scope of this post.

3.2. Web site design caching implementation

During the web-site design, attention should be paid to caching as well. Some general rules:

  • URL consistency  – same content should always have the same URL address. At least unless there is a very good particular reason not to. Not much sense in making two pages: about_the_author.html and about-the-author.html (if they contain the same info) and linking them alternatively across the site when required. One will do. Same goes for other objects and images. Put them all into one library and refer to them at the same location wherever they are needed.
  • If an image, or a downloadable file change, alter their name – date, or version number added to a file name are a good idea.
  • Don’t change files without a good reason. If updating a site via FTP for example, upload only changed files/content, not all of them. Otherwise, file last-modified date will be “forcibly” changed, so even the unchanged files will most probably be downloaded again, even though they have been cached.
  • Use cookies only when they are necessary. They are difficult to cache and are generally needed only on dynamic pages.

Cache control can be defined in HTML headers of pages. This is useful if cache methods differ for certain pages, otherwise it can be implemented on the server.

If a scrip produces the same output for the same/similar input, it is a good candidate for caching and it should be implemented. This can be done either by placing scripts into plain-text files, so server can “see” when they are changed and refresh cache, or placing a Max-Age, or even making a change-check validator (If-Modified-Since, responding with 304 Not Modified when appropriate).

A special category, related to WordPress web-sites, is using caching plug-ins. They can make changes in header files and .htaccess file in the WordPress install directory. Taking care of caching “automatically” (some require/allow more customization, some don’t).

3.3. CDN cache implementation

CDN-s will respect cache setup from cache control headers on the host-server. They can allow for change of server-set cache expiry rules – increasing them using CDN-s’ caching set up panel. Ideally, CDN should enforce a (properly) set up caching policy of the host-server, spreading data to servers across the world, closer to visitors from around the globe.

Setting a longer cache expiration time on a CDN (than on the host-server) means that caching policy on the host-server isn’t optimally set up (otherwise there would be no need to “prolong” cache on the CDN). Similar goes for shorter CDN cache expiration (than on the host-server).


4. Content compression – zip and gZip

While images and videos are usually compressed, text leaves a lot of room for “zipping”. Server can be configured to send compressed content to all the browsers who can handle such content. It makes sense to set this for text, html, javascript, css, xml files and the like. All the basically plain-text files.

Compressed files are a lot smaller and can be downloaded faster. Process of compression (server side) and decompression (client side) does put some load on the processor, but modern CPU-s can handle this relatively easily, while Internet connection speed (and bandwidth) is still often a bottleneck. So, the bottom line is that using compression increases website speed in most cases.

Commands for setup depend on server type, I won’t go into that. Links in the Sources chapter provide more detailed explanation of that.


5. Minimization

All the code written on a web-site (html, css, JavaScript etc.) is written so that humans writing/reading/editing it can understand it more easily.  That looks something like this:

<style type='text/css'>
			#gallery-2 {
				margin: auto;
			}
			#gallery-2 .gallery-item {
				float: left;
				margin-top: 10px;
				text-align: center;
				width: 50%;
			}
			#gallery-2 img {
				border: 2px solid #cfcfcf;
			}
			#gallery-2 .gallery-caption {
				margin-left: 0;
			}
			/* see gallery_shortcode() in wp-includes/media.php */
		</style>

However, to a computer, comments, new lines, spacing, or indentation are needless (the last item is needless to some programmers as well, but that’s another topic 🙂 ).

Minimization is removal of all the needles text (needless to a computer) and serving a minimized copies of scripts to web-browsers. Minimized version of the code above looks like this:

<style type=’text/css’>#gallery-2{margin: auto;}#gallery-2 .gallery-item{float: left;margin-top: 10px;text-align: center;width: 50%;}#gallery-2 img {border: 2px solid #cfcfcf;}#gallery-2 .gallery-caption {margin-left: 0;}</style>

For a computer, it is all the same, all clear. Minimized code is smaller, hence faster sent and downloaded. Minimization can be done after all the scripts are written (changing source files on the server), or setting it up so that minimized cached copies are served, with the source code in tact (for easier later modifications and troubleshooting).

Again, I won’t deal with implementation in this post.


6. What did I do on BikeGremlin web-site?

Since BikeGremlin is currently at a not very good shared hosting server and I’m planning to move it soon, I didn’t bother setting up the server.

I have used Cloudflare free service, which allows a limited CDN and website protection. Cloudflare also handles code minimization, so that is handled that way for now. Also, protection from sharing images from my site at other sites (using my server) was activated (“Hotlink protection”).

Caching, along with compression, was set up using Hyper Cache Extended plugin. I’ll write a separate post about various WordPress site caching plugins. For now it suffices to say that this one works best for a relatively simple web-site on a relatively slow server, as it is the case for me now. It implements caching policy such that all the pages get a defined Max-Age duration (so they are saved in visitor browser’s cache for the period), but with an addition of Last-Modified field, so that in case a post is changed (and only then), browser will download the updated version.

EDIT 2019:
A separate post explaininig WordPress caching plugins and which one I am currently using: Caching a WordPress website.


7. Testing

Safest way to do the testing is locally, to make sure everything works before uploading it to a “live” web-site. Changing one setting at a time, so that potential problems can easily be determined and solved.

GTmetrix tool gives data on caching, at least some basic ones (whether it is implemented, which objects aren’t cached, or have a very short cache expiration etc.). This is what a GTmetrix test result of a typical bikegremlin.com page looks like:

Test using GTmetrix tool
Test using GTmetrix tool

Google analytics stats from the first two weeks in May and the first two weeks in September (after sorting out redirections, implementing caching and weeding out plugins – more on the last in a separate post):

May (first) and September (second) page load time stats from Google Analytics
May (first) and September (second) page load time stats from Google Analytics

– Relja Gonzales Novović


Sources

Leave a Comment