Creating a highly available file server cluster for a web farm using Ubuntu 12.04 LTS

Following on from my previous post about setting up a highly available web server cluster, this post covers the next step, which is setting up a highly available file server cluster, after some research I’ve decided to use GlusterFS, GlusterFS is a  an open source, distributed file system capable of scaling to several petabytes (actually, 72 brontobytes!) and handling thousands of clients. GlusterFS clusters together storage building blocks over Infiniband RDMA or TCP/IP interconnect, aggregating disk and memory resources and managing data in a single global namespace. GlusterFS is based on a stackable user space design and can deliver exceptional performance for diverse workloads.

So using Ubuntu 12.04 LTS server again, ensure that both servers, in my set-up these are named storage1 and storage2 can resolve each others’ hostnames either being via. DNS or ensure that you adding them to to their /etc/hosts file (like I did at the start of my last post.)

I’ve personally logged into my servers as ‘root’ but if you are not going to do so, and are using a standard user, ensure that you prefix all my commands below with sudo to ensure you have the appropriate permissions to make the necessary changes.

Setting up the Gluster servers

So the first step is to actually install the Gluster Server software onto your storage servers, to do this simply run the following command:-

apt-get install glusterfs-server

Once that is complete, check that GlusterFS Server is installed and confirm the version number like so:-

glusterfsd --version

It should display some version information etc. at the time of writing my version that I have installed is 3.2.5

Now on storage1 run the following command, this will add storage2 to the trusted storage pool,

gluster peer probe storage2

Now lets take a look at the status of the storage pool:-

gluster peer status

It should look like as follows:-

Number of Peers: 1

Hostname: storage2
Uuid: 7cd93007-fccb-4fcb-8063-133e6ba81cd9
State: Peer in Cluster (Connected)

Lets now create the storage share, we’ll name this clusterdata, we’ll confugure it to have 2 replicas! (please note that the number of replicas is equal to the number of servers in this case because we want to set up mirroring) on storage1 and storage2 in the /data directory (this will be created if it doesn’t exist):

gluster volume create clusterdata replica 2 transport tcp storage1:/data storage2:/data

After executing the above command, you should then see confirmation that the volume has been created, now start the volume using the following command:

 gluster volume start clusterdata

If the above command does not report that it started successfully, make sure you restart GlusterFS Server on both your storage servers (storage1 and storage2) and then try the above command again!, the command to restart GlusterFS Server is as follows:-

/etc/init.d/glusterfs-server restart

Now that you should have GlusterFS Server up and running and both servers should now be replicating each other, you can use the following command to check the status of your storage bricks:-

gluster volume info

It should show the details of the replication as follows:-

 Volume Name: clusterdata
 Type: Replicate
 Status: Started
 Number of Bricks: 2
 Transport-type: tcp
 Bricks:
 Brick1: storage1:/data
 Brick2: storage2:/data

Fantastic, we’re nearly there! – Now just to configure the web servers to use the share to server data from a shared storage pool.

Securing access to the cluster share

By default, all clients can connect to the volume. If you want to grant access to the web servers in our web server farm eg. web1, web2 and web3 (IP addresses being 172.25.87.192, 172.25.87.193 and 172.25.87.194) only, we need to run:

gluster volume set clusterdata auth.allow 172.25.87.192,172.25.87.193,172.25.87.194

We only need to run the above command on storage1, as the settings are automatically replicated to our second storage server too! (…yeah pretty neat huh :))

Please note that it is possible to use wildcards for the IP addresses (like 172.25.*).

The volume info (running gluster volume info) should now show the updated status (see list of allowed servers below):

Volume Name: clusterdata
 Type: Replicate
 Status: Started
 Number of Bricks: 2
 Transport-type: tcp
 Bricks:
 Brick1: storage1:/data
 Brick2: storage2:/data
 Options Reconfigured:
 auth.allow: 172.25.87.192,172.25.87.193,172.25.87.194

Great, we are now ready to configure our web servers to mount and server data via our shared, highly available storage cluster….

Configuring our Web Servers to connect to the shared storage cluster

Now before we do the configuration on the web server  ensure that you add entires into /etc/hosts on your web-servers so that they can resolve the storage servers (you obviously don’t need to do this if your using DNS and names resolve to the server IP addresses without any issues.)

On each of the web servers (as setup in my previous tutorial), we need to install the GlusterFS client by running the following command:

apt-get install glusterfs-client

Then we create the following directory:

mkdir /mnt/storagecluster

That’s it! Now we can mount the GlusterFS filesystem to /mnt/glusterfs with the following command:

mount.glusterfs storage1:/clusterdata /mnt/storagecluster

Its worth noting that instead of storage1 you can as well use storage2 in the above command, either will work fine.

You should now see the new share in the outputs of…

mount

The output of running the ‘mount’ command should look something like this:

/dev/mapper/web1-root on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
fusectl on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
/dev/sda1 on /boot type ext2 (rw)
storage1:/clusterdata on /mnt/storagecluster type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072)

and running…

df -h

should show something like:-

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/web1-root
 29G 1.1G 27G 4% /
udev 238M 4.0K 238M 1% /dev
tmpfs 99M 212K 99M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 247M 0 247M 0% /run/shm
/dev/sda1 228M 24M 193M 11% /boot
storage1:/clusterdata
 29G 1.1G 27G 4% /mnt/storagecluster

Instead of mounting the GlusterFS share manually on each of the webservers when they are restarted, you could modify /etc/fstab so that the share gets mounted automatically when the client boots.

So now we need to open up ‘/etc/fstab’ with a text editor (I generally use Nano):-

nano /etc/fstab

Now add the following line to the end of the file:

storage1:/clusterdata /mnt/storagecluster glusterfs defaults,_netdev 0 0

Now save the file and restart the webserver, once it has restarted login and try to list the contents of /mnt/storagecluster like so:-

ls -l /mnt/storagecluster

Great! – Last thing to do is to set Nginx to server content from the shared cluster, for this simple test, I’ve decided to symlink the clusterdata mount to a subfolder in /usr/share/nginx/www/

cd /usr/share/nginx/www/
mkdir clustertest/
ln -s /mnt/storagecluster/* clustertest/

Quickly create a new file in /usr/share/nginx/www/clustertest/ named ‘index.html’ with the following content:-

<h1>This is served from the shared storage cluster</h1>

Now load up your web browser and lets access our web server using: http://192.268.87.180/clustertest/

You should see a page saying ‘This is served from the shared storage cluster’ – If you have all three of your web servers online and  each one (when refreshing the page) should be serving the file from the storage cluster!

Wooohooo! – Now we can test the high availability of our storage cluster.

Testing the high availability of our new storage cluster

To be complete!

Other GlusterFS configurations

Although in this tutorial I set up two servers, in a ‘mirror’ – basically a RAID1 configuration there are other configuration types you could use and easily add other servers to your cluster too! You can also ‘stripe’ data across your servers but for best results, you should use striped volumes only in high concurrency environments accessing very large files.

See: http://gluster.org/community/documentation/index.php/Gluster_3.1:_Creating_New_Volumes and http://gluster.org/community/documentation/index.php/GlusterFS_Concepts

Firewall ports

If your servers are behind a firewall, its worth noting that GlusterFS needs the following ports open on both servers to work correctly:-

  • TCP 111
  • TCP 24007
  • TCP 24008
  • TCP 24009 (+ number of bricks across all volumes)