Tag Archives: zfs

an independent review of ZFS

Some of the best reviews I’ve found online have been at AnandTech, on SSDSSD to displays and anything else computer related, now AnandTech have done a review of ZFS.

From a quick read this morning they’ve done a thorough review of using ZFS both from OpenSolaris and Nexenta compared to their current Promise storage system.

I’m always impressed when folks take the time to understand, investigate, review and learn from testing. They even set up a website from their progress and learnings: http://www.zfsbuild.com/ As they say: “a friendly guide for building ZFS based SAN/NAS solutions”

Personally, I can only hope that now that the ZFS lawsuit, between NetApp and Sun/Oracle, is settled that ZFS finds its way back into Mac OS X ūüėČ

Advertisements

essbase on solaris – are you mad?

Several years ago Sun began a project to update and consolidate the business intelligence tools used at Sun and we decided on Hyperion as we had a variety of Hyperion tools already: Brio, Essbase.  This was a few years before the Oracle acquisition of Hyperion and we wanted to run on Solaris.

This meant we were one of the first customers to run the Hyperion suite on Solaris and I frequently had conversations with other Hyperion experts similar to the title of this post and was also told that Essbase was designed on windows and would therefore run best on windows.

While the original implementation had it’s difficulties being one of the first on Sun hardware, software and operating system it undoubtedly laid the groundwork for the recent Hyperion Essbase ASO World Record on Sun SPARC Enterprise M5000.

The Oracle Essbase Aggregate Storage application employed for the benchmark was based on a real customer application in financial industry with 13 Million members in the customer dimension.

The benchmark system was a M5000 server with 4 UltraSPARC64 Vii 2.53 Ghz (quad core) and 64Gb ram running Solaris 10 update 8 and Oracle Essbase 11.1.1.3 (64 bit) combined with a Sun Storage F5100 Flash Array consisting of 40 x 22 GB flash modules in a single zpool.

The benchmark compared 500 and 20,000 users and showed that usage based aggregation improved response times, while adding the extra users showed similar performance with no signs of degradation in query timings.

What is interesting in the benchmark is that this seems to be one of the first to combine a variety of Oracle technology and provide a benchmark for John Fowler to beat:  Solaris ZFS, SPARC M9000 server, Storage F5100 Flash Array and Essbase.

For more information check out the whitepaper here(pdf), Note: the BestPerf blog has an incorrect link since the recent update to the Oracle Technology Network.  More details on Hyperion applications here.

focusing on the strengths

One thing I’ve noticed from the Oracle acquisition is the re-focusing on Sun strengths around the engineering talent that Sun had:¬† Solaris, Sparc, servers and technology integration and innovation.¬† This talent also developed such cool things as zfs, dtrace, F5100 storage array, hybrid storage pools and unified storage.

As my MBA tutor tells me, one way to harness and move a company forward is to focus on the key strengths or core capabilities that an organization has.  There can be a problem if you rely on the core capabilities too much they become core rigidities Рwhich can be evidenced in Suns past, focusing too much on Sparc and proprietary servers, and at one point even dropping Solaris on x86.

It’s taken a while for some information to flow out but in the last week 2 items have come out which shows the ongoing work and strategies are there:
Oracle Solaris Podcasts
This is a new monthly podcast series hosted by Dan Roberts, giving a general update on Oracle Solaris including industry news, events and technology highlights.¬† This episode features Bill Nesheim and Chris Armes, and provides an update of what’s been happening over the last few months and details on why Oracle Solaris is the best OS for x86-based servers: scalability, reliability and security.¬† It also includes a brief overview of the new support offering for Oracle Solaris on third party x86 hardware.

Strategy for Oracle’s Sun Servers, Storage and Complete Systems: 9AM Tuesday, August 10, 2010 Join John Fowler, Executive Vice President, Systems, for a live update on the strategy and roadmap for Oracle‚Äôs Sun servers, storage and complete systems including Oracle Solaris.

Sign up here.

With some of these developments and others, the technology future certainly looks bright at Oracle.

SSD experiment

After some of my recent blogs on SSDs, I was excited to have in my hand an OCZ Vertex 60GB SSD. However I neglected to remember that in my other hand I needed to have both SATA data and SATA power cables.  After a trip to the local store, I come back with the required cables!

Rather than dive straight in – as is my normal process, I knew I should do some planning and put some thought into the installation/upgrade process:

  • Remove old/unwanted programs files.
  • Schedule the required downtime for installation and re-installation copying of files and applications
  • Read the manuals for the SSD, BIOS and HDD.

So after the required planning I was ready to go:

  1. Open up workstation and replace SATA cables to HDD to SSD.
  2. Added new SATA cables and connect HDD.
  3. Power up workstation and enter BIOS menu, turn off auto detection of HDD (This means the workstation will no longer boot from the HDD but rather choose the SSD.
  4. Install Windows:  20 minutes for install and 5 reboots later I had a working machine to download the required updates.
  5. Install OpenSolaris: 20 minutes and 1 reboot later I had a working machine, as I pulled the latest /dev image there were no updates I was ready to work.

First Impressions:

  • Wow, this is fast!¬† No really I mean this SSD is REALLY fast.
  • Boot times are now significantly faster:¬† Windows = 30 seconds, OpenSolaris = 35 seconds
  • Applications are 30% to 80% faster to open.
  • Benchmarking results to follow, it seems there’s lots to consider from a Solaris perspective. Thanks to Lisa for her post.

UPDATE:Just a note to say that Windows automatically recognised the HDD and created drive letters E: and F:.  On OpenSolaris as I had previously created pools, it was a simple matter of entering the command:

zpool import -R /mnt tank

This mounted the pool and I was able to copy and use as required.¬† I love it when a plan comes together.¬† You can also just enter the command “zpool import” without any options to discover all pools available.

These are very noticeable differences, although given the age of the workstation (October 2005) and components users with newer machines should expect more performance increases:

Given the old nature of components, the workstation is also limited to SATA v1, so 150MB or in reality 130MB. So I’m not really reaching the capacity from the OCZ Vertex SSD, which has potential Read: Up to 230 MB/s and Write: Up to 135MB/s.

If you think SSDs could help your desktops, servers or applications look at the following sites for more info:

As with everything in the computing world, SSDs are not standing still. OCZ announced at CES that their Vertex 2 Pro SSDs (2nd generation) are on schedule for Q110, with the new SandForce controller and have preliminary specs of: Read and Write of 270 MB/s.  AnandTech have a preview here.

OCZ have also produced the first 1TB SSD, under the “Colossus” moniker.¬† Other manufacturers are sure to push the limits too.

SSDs to the forefront

Following up my recent posts concerning SSD and flash based disks, there seems to be a growing understanding of the power of SSDs and also some confusion over the pricing and whether some are faster than others.  I’ve compiled a summary of some other posts and info:

Are some SSDs faster/better than others?  YES, it starts with the cell memory:  single-level cell (SLC) flash memory is better (and hence more expensive) than multi-level cell (MLC) flash memory. Then there are the other components that make up the SSD.  From some recent reviews/blogs Intel, Samsung, OCZ and RunCore seem to make some fast ones.

Check out this comprehensive AnandTech article and another recent article.  Both comprehensive and very detailed.

Last FM: Installed SSDs into a SunFire X4170 to massively increase their streaming customers served from around 300 for a 7200 rpm SATA disk to 7000 for a 64Gb X25-E Intel SSD.

ZFS super charging: L2ARC for random reads,
and the ZIL for writes. OpenSolaris 2009.06 and Solaris 10 U6 with ZFS have super capabilities
for very intelligent use of fast storage technology, especially when serving files. Thanks again to Brendan.
Correction: while some items for ZFS were added to Solaris 10 update 6, it was only in the delivery of ZFS version 13 that it was complete, these changes made it into Solaris 10 update 8.

Setting up a mirrored SSD OpenSolaris system:  A very comprehensive how to guide for migrating from an existing system.

Making the most of your SSD: Easy steps to set up SSD on OpenSolaris –  thanks Arnaud.

Future of Flash: How flash storage is a disruptive technology for
enterprises. Hal Stern, VP Global Systems Engineering @ Sun hosts this very
informative podcast.

Seeing the performance upgrades that others are getting out of flash makes me want to try it out to see the impact to my 4 year old desktop, which is still going strong (AMD 4200 Dual Core, 4GB Memory and 200GB HDD).  Alternatively if anyone has a 64GB SSD they’d like me to test I’d certainly appreciate it ūüėČ

more innovation ‚Äď ZFS Deduplication

When asked about Sun Microsystems, one word will always spring to the top of my mind: innovation

There is such a fantastic DNA in this company that looks to push boundaries and make things better – ok, we often do not got the message across well but the effort and dedication shown by employees always makes me proud.

To emphasis this point again there is great news as told by Jeff Bonwick earlier this week: "ZFS now has built-in deduplication"

Deduplication is a process to remove duplicate copies of data, whether it’s files, blocks or bytes.

It’s probably easier to explain with an example: suppose you have a database with company addresses, the location ‘London’ will exist for quite a few customers, so instead of having this entry 100 times, there will be one entry and the other 99 references to the original entry. So it saves space and lookup time as it’s likely that the reference will already be loaded in cache.

How easy is it to set up?

Assuming you have a storage pool named ‘tank’ and you want to use dedup,
just type this:

zfs set dedup=on tank

There is more to it, so read Jeffs blog for the whole story.

I’m guessing this should appear shortly in the OpenSolaris /Dev builds, which will feed into the next OpenSolaris release (2010.03) and possibly into a later Solaris 10 update. Once it’s released, I’ll try and run some tests to see the savings I get.

This should also feed into the FreeBSD project. Such a shame OSX has dumped their ZFS project.

The secret flash sauce

After the announcements from Oracle Open World and new TPC benchmark, a lot of focus has been on Sun and the innovation DNA that drives the company.  The announcements focus on flash and their increasing use in computing:

So what is the secret sauce in these?  These are essentially caching data and are made up of 94GB (4 x 24GB modules) of single-level cell NAND flash, in the F20 card and a staggering 1.92TB (80 modules) for the F5100 flash array.

The F5100 Flash Array has 64 SAS lanes (16 x 4-wide ports), 4 domains and SAS zoning, It can perform 1.6m read IOPS and 1.2M write IOPS, with a bandwidth of 12.8GB/sec.

This read IOPS figure is equivalent to 3,000 hard drives in 14 rack cabinets. The F5100 uses 1/100th of the space and power, of such a collection of hard drives.

This is an amazing database accelerator for Oracle and MySQL. The unit can be zoned into 16 partitions, one for each of up to 16 hosts. The device can form part of a Sun ZFS hybrid storage pool, embracing solid state and hard disk drives.

Further Notes: Sequential Read = 9.7GB/sec; Read/Write Latency (1M transfers) = 0.41ms/0.28ms; Average Power 300 watts (Idle = 213W ; 100% = 386W).  More spec info here.

So if you have need to speed up your Databases, Storage grids, HPC computing or Financial modeling look at what flash SSDs can offer.

Download the Sun Flash Analyzer and install on your server and see where SSDs can help accelerate system performance today.

It won’t be long before all computers come with flash as standard as either a separate or hybrid disk to speed up response times . . . OpenSolaris can already do this today with ZFS Storage Pools.

ZFS:Hybrid Storage Pools

There has been a few announcements recently (and more to come) and here’s one that can really be a game changer and enabler for future tech advances:

Hybrid Storage Pools (HSP) are a new innovation designed to provide superior storage through the integration of flash with disk and DRAM. Sun and Intel have teamed up to combine their technologies of ZFS and high performance, flash-based solid state drives (SSDs) to offer enterprises cutting-edge HSP innovation that can reduce the risk, cost, complexity, and deployment time of multitiered storage environments.

Sun’s ZFS

Sun’s ZFS file system transparently manages data placement, holding copies of frequently used data in fast SSDs while less-frequently used data is stored in slower, less expensive mechanical disks. The application data set can be completely isolated from slower mechanical disk drives, unlocking new levels of performance and higher ROI. This ‚ÄėHybrid Storage Pool‚Äô approach provides the benefits of high performance SSDs while still saving money with low cost high capacity disk drives.

Solaris ZFS can easily be combined with Intel’s SSDs by simply adding Intel Enterprise SSDs into the server‚Äôs disk bays. ZFS is designed to dynamically recognize and add new drives, so SSDs can be configured as a cache disk without dismounting a file system that is in use. Once this is done, ZFS automatically optimizes the file system to use the SSDs as high-speed disks that improve read and write throughput for frequently accessed data, and safely cache data that will ultimately be written out to mechanical disk drives.

Intel’s SSDs

Intel’s SSDs provide 100x I/O performance improvement over mechanical disk drives with twice the reliability:

  • One Intel Extreme SATA SSD (X25-E) can provide the same IOPS as up to 50 high-RPM hard disk drives (HDDs) — handling the same server workload in less space, with no cooling requirements and lower power consumption.
  • Intel High-Performance SATA SSDs deliver higher IOPS and throughput performance than other SSDs while drastically outperforming traditional hard disk drives. Intel SATA SSDs feature the latest-generation native SATA interface with an advanced architecture employing 10 parallel NAND Flash channels equipped the latest generation (50nm) of NAND Flash memory. With powerful Native Command Queuing to enable up to 32 concurrent operations, Intel SATA SSDs deliver the performance needed for multicore, multi-socket servers while minimizing acquisition and operating costs.
  • Intel High-Performance SATA SSDs feature sophisticated ‚Äúwear leveling‚ÄĚ algorithms that maximizes SSD lifespan, evening out write activity to avoid flash memory hot spot failures. These Intel drives also feature low write amplification and a unique wearleveling design for higher reliability, meaning Intel drives not only perform better, they last longer. The result translates to a tangible reduction in your TCO and dramatic improvements to system performance

Benefits of HSP

Architectures based on HSP can consume 1/5 the power and 1/3 the cost of standard monolithic storage pools while providing maximum performance.

For example, if an application environment with a 350 GB working set needs 30,000 IOPS to meet service level agreements, 100 15K RPM HDDs would be needed. If the drives are 300GB, consume 17.5 watts, and cost $750 each, this traditional environment provides the IOPS needed, has 30TB capacity, costs $75,000 to buy, and consumes 1.75 kWh of electricity.

Using a Hybrid Storage Pool, six 64 GB SSDs (at $1,000 each) provide the 30,000 IOPS required, and hold the 350GB working set. Lower cost, high-capacity drives can be used to store the rest of the data; 30 1TB 7200 RPM drives, at $689 each ($20,670) and consuming 13 watts, provide cost-effective HDD storage. The savings are dramatic:

  • Purchase cost is $26,670, a 64-percent savings
  • Electricity consumed is 0.392 kWh, a 77-percent savings

Link to docs:

Solaris ZFS Enables Hybrid Storage Pools – Shatters Economic and Performance Barriers

UPDATE: Brendon from the Fishwork team has posted some speed and performance notes here

upgrade to all zfs root

Now that I’m successfully running a zfs root, I don’t need my old usf root anymore, so it should be a simple matter of removing the old usf boot environment and increasing the size of the new zfs root pool. 

Right?  Well no actually, there seems to be a bug or 3.

# lustatus
Boot Environment           Is       Active Active    Can    Copy     
Name                       Complete Now    On Reboot Delete Status   
————————– ——– —— ——— —— ———-
snv_98                     yes      no     no        yes    –        
snv_102                    yes      yes    yes       no     –        
# ludelete snv_98
System has findroot enabled GRUB
Checking if last BE on any disk…
BE <snv_98> is not the last BE on any disk.
Updating GRUB menu default setting
Changing GRUB menu default setting to <3>
ERROR: Failed to copy file </boot/grub/menu.lst> to top level dataset for BE <snv_98>
ERROR: Unable to delete GRUB menu entry for deleted boot environment <snv_98>.
Unable to delete boot environment.

This is CR6718038/CR6715220/CR6743529. A quick workaround would be to edit /usr/lib/lu/lulib and replace the following in line 2937:
lulib_copy_to_top_dataset "$BE_NAME" "$ldme_menu" "/${BOOT_MENU}"
with
lulib_copy_to_top_dataset `/usr/sbin/lucurr` "$ldme_menu" "/${BOOT_MENU}"

then rerun the ludelete:

# ludelete snv_98
System has findroot enabled GRUB
Checking if last BE on any disk…
BE <snv_98> is not the last BE on any disk.
Updating GRUB menu default setting
Changing GRUB menu default setting to <3>
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <snv_102> as <mount-point>//boot/grub/menu.lst.prev.
File </boot/grub/menu.lst> propagation successful
File </etc/lu/GRUB_backup_menu> propagation successful
Successfully deleted entry from GRUB menu
Determining the devices to be marked free.
Updating boot environment configuration database.
Updating boot environment description database on all BEs.
Updating all boot environment configuration databases.
Boot environment <snv_98> deleted.
#

Then I needed to remove the old usf boot and swap slices, old and new layout:

partition> print
Current partition table (original):
Total disk cylinders available: 12047 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       3 –  1962       15.01GB    (1960/0/0)   31487400
  1       swap    wu    1963 –  2224        2.01GB    (262/0/0)     4209030
  2     backup    wm       0 – 12046       92.28GB    (12047/0/0) 193535055
  3 unassigned    wm    2225 –  4182       15.00GB    (1958/0/0)   31455270
  4 unassigned    wu       0                0         (0/0/0)             0
  5 unassigned    wu       0                0         (0/0/0)             0
  6 unassigned    wm    4183 –  6793       20.00GB    (2611/0/0)   41945715
  7       home    wm    6794 – 12046       40.24GB    (5253/0/0)   84389445
  8       boot    wu       0 –     0        7.84MB    (1/0/0)         16065
  9 alternates    wu       1 –     2       15.69MB    (2/0/0)         32130

partition>

partition> print
Current partition table (original):
Total disk cylinders available: 12047 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       3 –  4182       32.02GB    (4180/0/0)   67151700
  1 unassigned    wu       0                0         (0/0/0)             0
  2     backup    wm       0 – 12046       92.28GB    (12047/0/0) 193535055
  3 unassigned    wu       0                0         (0/0/0)             0
  4 unassigned    wu       0                0         (0/0/0)             0
  5 unassigned    wu       0                0         (0/0/0)             0
  6 unassigned    wm    4183 –  6793       20.00GB    (2611/0/0)   41945715
  7       home    wm    6794 – 12046       40.24GB    (5253/0/0)   84389445
  8       boot    wu       0 –     0        7.84MB    (1/0/0)         16065
  9 alternates    wu       1 –     2       15.69MB    (2/0/0)         32130

partition>

The size of my pools before:

# zpool list
NAME       SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
rootpool    15G  8.54G  6.46G    56%  ONLINE  –
tank        40G  5.65G  34.3G    14%  ONLINE  –
tank2     19.9G   652K  19.9G     0%  ONLINE  –
#

Then reboot, oops!!!

It just defaults to >grub prompt, because my old ufs slice held all the boot info and I just deleted that so it can’t find any  . . . but this is a simple process to restore (as long as you have a recent dvd image handy (so it can recongnise and mount the zfs pool).

Insert and boot from the dvd image, select single user mode.

Mount the rootpool as r/w on /a (it should prompt automatically for this).

At the command prompt, type:

installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c2t1d0s0

then reboot.  I love it when a plan comes together, pool sizes after the reboot:

# zpool list
NAME       SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
rootpool    32G  8.54G  23.5G    26%  ONLINE  –
tank        40G  5.54G  34.5G    13%  ONLINE  –
tank2     19.9G   722K  19.9G     0%  ONLINE  –
#