YAFFS Robustness and Testing
Cheaper CPUs and flash have driven up embedded system functionality. These increased functions often require file system storage.
The original flash file storage mechanisms were the use of a flash translation layer (FTL) – a driver model which makes flash memory appear as a disk drive – in conjunction with a disk-oriented file system such as FAT. This storage methodology has various performance and robustness limitations, leading to the development of file systems designed specifically for flash memory. These are known as Flash File Systems.
Flash memory has various limitations when compared with a disk. For example, flash memory pages cannot be individually re-written but instead a whole block must be erased and rewritten with the modified page. Typical FTL achieve this with various logical to physical mapping schemes. Thus emulating disk-like behaviour on flash memory adds an extra layer of software. This slows down write performance and adds extra state which is prone to corruption due to power failure.
The traditional FTL-based solution always stores FA Tables in the same disk blocks. Since flash memory blocks can endure a limited number of write/erasure cycles, the FTL must move the physical location of the block around to prevent the high traffic blocks from wearing out. This requires wear levelling – a further burden.
Some File Systems, FSs, try to mitigate the performance penalties by caching a lot of data to reduce writes. While this can reduce the apparent write time, it does make the FS prone to data loss in the event of a power failure.
But flash memory also has advantages when compared to rotating media. There is no spin up time, making it more responsive. There is no read head so there is no read penalty if data is fragmented. Flash FSs can be designed to exploit these advantages.
A true Flash File System, FFS1 on the other hand is a single code body designed to take the features and limitations of flash into consideration. FFSs can avoid using unsuitable techniques such as allocation tables and in-place rewriting.
Yaffs is a FFS originally designed specifically for NAND flash, and also used with great success on NOR flash too. Yaffs development started in 2001 when 32Mbytes of flash was considered large. Since then, yaffs has been expanded to work properly with different flash types and far larger memory arrays (many GBytes).
Yaffs was originally deployed in Linux, but was written in an OS-neutral way to facilitate modular development and testing. This turned out to be convenient when we ported it to other OSs such as WinCE, eCos, pSOS, ThreadX, vXworks
Since then Yaffs has been used in many application areas including point-of-sale equipment, telecoms equipment (including Android phones), industrial control systems and other, many of them criticially dependent on reliability for their value.
2 Flash File System considerations
There are many criteria that are important when selecting a FFS. These are some that out customers have found to be most important:
Robustness: It really is not worth having a file system that loses or corrupts data and prevents the system from working. Some flash file systems are only robust when synced (flushed) after files have been written. That requires extra code, slows the system and leaves the file system in a non-deterministic state until the sync has been completed. Some file systems need to perform disk repair operations (eg. check disk) in the event of a power failure. This can add considerable time to the start up, delaying system operation.
Performance: In many embedded system designs, slow read/write performance can delay tasks and cause system degradation or even failure. It is thus important that read/write performance be acceptable. Note that many FFSs need to perform extra actions such as garbage collection. In some cases these actions can cause the file system to stall for a long time. It is thus important that these factors be taken into consideration. Performance can sometimes be improved by adding a write caching layer. These caching layers make the file system calls (write etc) proceed quickly, without actually writing the data to flash. This is risky as data in the cache will be lost on a power failure and syncing the file system will take a long time. Thus, most flash file systems offer either speed or robustness – not both.
Proven: It is very easy for a file system vendor to provide a list of claims saying “100% power safe” or similar, but what evidence do they have to back this up? All file systems, including flash file systems, are complex bodies of software with extremely complex state. They need very significant testing to prove that they work correctly.
Portable: Software is a huge investment, and it is increasingly important to be able to port the software between OSs and CPU architectures. Most flash file systems are limited to a single OS which makes it difficult to migrate software between different platforms. The portability of yaffs extends to being able to seamlessly migrate the core yaffs code into a test framework (which enhances testing) and being easy to adapt to fresh platforms
3 How YAFFS achieves robustness and performance
Yaffs was designed from the ground up specifically to work well with NAND flash. The author had already developed another FFS and was familiar with the design challenges of flash.
Yaffs is not just a disk-oriented FS adapted to work with flash, so it is not surprising that yaffs is very different in design to most other file systems. The key difference is that yaffs is a log-structured file system2.
In essence, a log structured file system writes file system changes as a sequential log. The log structure is particularly suitable because:
All writes are at the end of the log. When a file is modified, there is no need to erase and rewrite a part of the flash. Changes are just appended to the log. Thus there is no need to erase and copy old data just to change some existing pages in a file. This tends to make writes much faster. There is no need to do a lot of caching to achieve performance. Data can be written to the log immediately meaning that sync time is very low. Data robustness is improved dramatically.
There are no allocation tables; this reduces the amount of data being written to the flash, and there are no tables to corrupt.
Since all writes are to the end of the log there are no “high traffic” blocks. This means yaffs only needs a simple wear levelling strategy.
In the event of a power failure, the file system state is easily re-created from the log. This means there is no need to perform disk repair operations to correct for a bad shutdown.
So why don’t we see many log structured file systems for disk file systems? The answer is that log file systems tend to spread the writes around on the media. That is typically very problematic for mechanical storage which must physically move the read/write head from one location to another making the access time very slow. Of course that does not apply for flash memory. By designing specifically for flash, yaffs can write fast.
Many FFSs need to compromise robustness and performance. Thanks to the log structure, yaffs does not need to; it can provide high performance without giving up on robustness.
Those familiar with log structured file system design will know that some log structured file systems have problems with garbage collection, GC3. Garbage collection cleans up the log and makes more free space available. Some log-structured file system designs did not pay enough attention to GC and can stall for a considerable time while it happens. Yaffs was designed differently. The potential impact of GC was considered throughout the design process. As a result, yaffs has a very simple GC model that allows a lot of flexibility in scheduling. This prevents the GC from making the file system slow and unresponsive.
Most file systems are developed within the context and framework of a single OS. This typically makes the code difficult to explore, debug and port. Yaffs on the other hand was developed from the start inside a portable development/testing framework. Yaffs is then ported to various operating systems through the addition of glue code.
There are many benefits to this approach:
A development framework runs in user space which provides a richer development environment than inside an OS kernel. It is much easier to attach debuggers and perform logging and create reproducible testing.
The code is structured to be portable. That makes it relatively easy to port to a new OS or RTOS with a high level of confidence.
There is a choice of porting interfaces allowing more flexibility to determine which porting method will be simpler. For example, the Linux wrapper accesses the yaffs core directly while the Windows CE wrapper access the Yaffs Direct Interface4.
Each different environment provides different test tools. For example, Linux provides a raft of file system test tools that do different kinds of testing to the power fail test harness. Since the vast majority of the code is in the portable yaffs core, that code gets the advantage of all test methods combined. The many millions of Android phone users that use yaffs-based devices are helping test the core code that is also used on a wide variety of different operating systems and a wide range of different device types.
Its Open Source status means any interested parson can take a version from git and test it to make sure it meets their needs.
It is all very well to design software to be robust, but it needs thorough testing to build the confidence of users in the author and the supplier.
File systems have incredibly complex state so it is not at all surprising that OS designers find FSs some of the most challenging code to test.
There is no single test methodology that can test all cases, and exhaustive state testing is impossible. The best we can do is to have a suite of test methodologies that combine to provide thorough testing.
Approximately 60% of yaffs development effort is now directed towards testing and improving the test environment. We are constantly researching new methodologies.
The following are descriptions of the most important test strategies used in yaffs.
Power Fail Stress Testing
By far the most important consideration in yaffs testing is that it is robust to corruption caused by power failure. We want yaffs to provide the basis for reliable products in typical embedded systems.
The first power fail testing was done by companies integrating yaffs into their products. Some of these built test jigs which would automatically interrupt power while running a real hardware device. This approach is better than nothing, but has some limitations:
It is relatively costly to set up. Flash has a limited lifetime meaning that test devices wear out with time and must be replaced.
It is relatively slow since each cycle takes many seconds to boot, run the application for a while and then reboot.
Many, and perhaps most, of the power interruptions will happen at times when the file system is inactive and thus be wasted cycles.
In 2008 we developed a power-fail-test environment. This simulates power failures using simulated flash memory while running various consistency checks.
The benefits of a software-based test harness are numerous:
The test harness controls the point at which the simulated power failures happen. That means every test cycle counts.
No special hardware is required.
Test loops run faster; a quad-core computer can simulate ten to twenty power failures per second. That means about a million power failure simulations per day.
The effectiveness of the new software-based testing is dramatic. A few times we have tested some user-suggested changes that had passed a week or more of real-world power fail testing only to have problems caught by the software simulation in a matter of minutes.
We regularly run the power fail simulation test over a weekend, testing millions of power fail cycles.
Testing the Yaffs Direct Interface API
Yaffs Direct Interface (YDI) is a POSIX-like wrapper around the yaffs core. This provides a set of function calls; yaffs_open(), yaffs_write() and the like.
Each function can return numerous error codes and has to handle numerous different parameters.
Although there were already significant tests for the YDI, an extensive test harness was developed in 2010 to comprehensively test each of the error paths and generate all of the required errors.
This provides a high level of confidence that the test interfaces do what they should and that POSIX-like functionality provides relatively clean porting of existing code.
The main purpose of the Linux testing is to test the Linux wrapper code. It does, however, still exercise the yaffs core code in different ways from other testing, and thus adds to the strength of the test fabric.
The Linux testing does not do power fail testing but instead tests other aspects such as clean shutdowns, cache integrity and write performance. This gives a useful way to assess the effects of modifications.
Linux testing is done on both real hardware and on simulated hardware (nandsim).
As with the Power Fail Stress Testing described above, the simulated testing is done using:
Fuzz testing deliberately corrupts the flash image and ensures that yaffs still mounts. This is done to ensure that yaffs does not crash on corrupted data.
Clearly fuzz testing can corrupt the contents of individual files (that’s its job), but the file system integrity should not be compromised.
Although code checking is not actual testing per se, it does help to verify code changes.
The yaffs code has been checked with Coverity5 – a market leader in code checking.
We are investigating the use of KLEE and other tools.
Open Source Use
Each different project uses yaffs in a different way, so that collectively a wider range of operation sequences are tested.
Many of these users participate in some way on the yaffs mailing list, helping to uncover issues and generally provide feedback.
Over its lifetime, yaffs has been used as a teaching framework for various university course in many counties around the world.
For example, Oregon State University uses Yaffs as a test bench for their post-graduate software testing course. They essentially test the test tools by injecting errors into the yaffs code and seeing if these are discovered. Of course this has the side effect of running these tests over the original Yaffs code too. Those interested in test methods are encouraged to read some of the OSU papers
We expect to adopt and extend some of the OSU ideas and use these to extend our in-house test suite.
Many other examples of academic studies of Yaffs can be found using a standard search engine. Some examples are:-
Utah State University, School of Computing, used Yaffs as a Case Study in their work on Swarm Testing in 2011. www.cs.utah.edu/~regehr/papers/swarm12.pdf
National Taiwan University, Dept of Computer Science, published on Effiicient Initialization and Crash Recovery for Log-based File Systems Over Flash Memory Copyright 2006 ACM 1-59593-108-2/06/0004 and http://www.cis.nctu.edu.tw/~lpchang/papers/SAC_wu_sac06.pdf
DeDupe in YAFFS, K Narayan & P S Vijayakrishnan, Comp Sci, U Wisconsin. http://pages.cs.wisc.edu/~knarayan/dedup-yaffs.pdf (2011)
Software development is an iterative process of design and testing. Robust software solutions are only possible with extensive testing.
Embedded systems increasingly need to store data in file systems as a critical part of their operation. It is thus imperative to have a file system that is predictable and robust to power failure and similar interruptions.
Yaffs has been designed from scratch to be robust to power failure and that robustness has been verified with a multi-pronged test strategy that is continuously being improved.
1Note that flash translation layers (FTLs) are erroneously referred to as flash file systems. For example, despite the name, M-Systems TrueFFS is not a flash file system but a flash translation layer (FTL).
3Refer to the “How Yaffs Works” document to understand the need for garbage collection and how yaffs does it.