Re: [Yaffs] YAFFS Disk filling up

Attachments:
Message as email (text/plain)

Author: Charles Manning
Date:
To: yaffs
Subject: Re: [Yaffs] YAFFS Disk filling up - GC stuck?

On Tuesday 03 September 2013 08:50:55 walter wrote:
> Greetings all,
>
> I have been using yaffs for a few years now (with no issues) and have
> recently been experiencing a bit of a mystery. I am using openwrt and I
> should probably be posing this question/issue over there, but I have a
> feeling you guys will have some better insight into this particular
> problem. (note: the files are written directly to flash, there is no
> overlay filesystem like some OpenWRT images)
>
> The basic symptom is that the disk is reporting full, but there are not
> enough files to legitimately take up that much space. These systems were
> previously working, and are still running (in the last few months they have
> been constrained by "no free space").
>
> When I do "df -h", it reports the system is full, but when I do a sort of > crude totaling of file sizes (du -hcs /), it is reporting significantly > less usage. > root@errored_system:/root# du -hcs / > 16.3M / > *16.3M* total > root@errored_system:/root# df -h > Filesystem Size Used Available Use% Mounted on > rootfs 58.0M *58.0M* 8.0K 100% / > /dev/root 58.0M 58.0M 8.0K 100% / > tmpfs 30.3M 444.0K 29.8M 1% /tmp > tmpfs 512.0K 0 512.0K 0% /dev

>
> (~41.7MB unaccounted for.... thats more than can be justified by blocks
> that are not full)
>
> I have looked for files that might be sucking up this much space, but have
> not found any (big files, or lots of small files). This leads me to believe
> there is something amiss with the file system and why I am seeking your
> insight.
>
>
>
>
>
> I don't know much about debugging yaffs specifically, but this seems like a
> good clue:
> root@errored_system:/root# echo "-all+gc">/proc/yaffs
> root@errored_system:/root# dmesg | tail -f
> [ 3419.220000] yaffs_block_became_dirty block 3048 state 8
> [ 3419.220000] GC Selected block 3437 with 1 free, prioritised:0
> [ 3419.220000] yaffs: GC erasedBlocks 5 aggressive 1
> [ 3419.260000] yaffs_block_became_dirty block 3437 state 8
> [ 3419.270000] GC Selected block 2201 with 1 free, prioritised:0
> [ 3419.270000] yaffs: GC erasedBlocks 5 aggressive 1
> [ 3419.310000] yaffs_block_became_dirty block 2201 state 8
> [ 3419.310000] GC Selected block 2231 with 1 free, prioritised:0
> [ 3419.310000] yaffs: GC erasedBlocks 5 aggressive 1
> [ 3419.370000] yaffs_block_became_dirty block 2231 state 8
>
> These "yaffs_block_became_dirty block" records rapidly fill up the kernel
> log... (on the good systems, I do not get these messages, at least not in
> rapid succession)

block_became_dirty means that the block no longer has useful info in it and
can be erased and reused. This happens after a block has been gc'ed or when
the files on that block have been overwritten or deleted.

What is worrying is the "GC Selected block 2231 with 1 free, prioritised:0".
For some reason, the GC is being contrained to only select blocks with very
little free space on them. This means a gc cycle is not releasing much space.

>
>
> Heres the interesting thing: they are all repeating on the same few blocks.
> (2201, 2357, 2819, 3048 & 3437... same story on the other error'd systems,
> only different block numbers) So my hunch is that something is preventing
> these blocks from getting GC'd and are somehow backlogging the other blocks
> that need freeing. Is that even possible? I am able to delete legitimate
> files and regain their space, but that doesn't really address the issue of
> the mystery disk usage. It does tell me that the GC is working at least in
> part.

What is happening is that there is very little free space in the system and
these blocks are "churning".

>
>
> Incidentally, I cross-referenced the yaffs_block_became_dirty block against
> the "bad eraseblocks" and "blocks marked bad" as reported on startup and
> they were different (i.e. the blocks in the above GC messages were not
> "known bad blocks"

The block_became_dirty blocks are good blocks.

>
>
>
>
> Is there any way to forcibly verify the integrity of the data blocks in
> yaffs and free up this mystery space? In other words, I have remote access
> to these systems, but re-flashing isn't exactly an option for me as they
> are geographically distributed (i.e. data collectors). Can I somehow force
> those block numbers to be marked as bad? (I'd imagine that wouldn't be
> wise, even if I could)

Well looking further down at the /proc/yaffs outputs it looks like these are
legitimately full. The question is: what with?

>
>
>
>
>
>
> Some more (hopefully) useful information:
>
>
> I am not writing much to flash. About once an hour I update a couple of
> values in a sqlite database. I dont know if the update operation is the
> culprit (since its behaving on other systems), but its the only source of
> disk IO I can think of.
>
>
>
> There are not many reported bad blocks (5-20 depending on which system I'm
> looking at)
>
>
> The board manufacturer apparently has changed the NAND chip somewhere along > the line. I am experiencing the issue on both types (so I'm fairly certain > its not exclusively hardware related) > One system: > [ 2.520000] NAND device: Manufacturer ID: 0xad, Chip ID: 0x76 (Hynix > NAND 64MiB 3,3V 8-bit) > Another system: > [ 2.510000] NAND device: Manufacturer ID: 0xec, Chip ID: 0x76 (Samsung > NAND 64MiB 3,3V 8-bit)

>
>
>
>
>
>
>
> Here is the /proc/yaffs from a good system (one that isn't full)
>
> root@good_system:/root# cat /proc/yaffs
> Multi-version YAFFS built:May 25 2012 01:48:02
>
>
> Device 0 "rootfs"
> start_block.......... 0
> end_block............ 3711
> total_bytes_per_chunk 512
> use_nand_ecc......... 1
> no_tags_ecc.......... 0
> is_yaffs2............ 0
> inband_tags.......... 0
> empty_lost_n_found... 0
> disable_lazy_load.... 0
> refresh_period....... 500
> n_caches............. 10
> n_reserved_blocks.... 5
> always_check_erased.. 0
>
> data_bytes_per_chunk. 512
> chunk_grp_bits....... 0
> chunk_grp_size....... 1
> n_erased_blocks...... 294
> blocks_in_checkpt.... 0
>
> n_tnodes............. 7901
> n_obj................ 5653
> n_free_chunks........ 55483
>
> n_page_writes........ 597395
> n_page_reads......... 906712
> n_erasures........... 11524
> n_gc_copies.......... 22402
> all_gcs.............. 32575
> passive_gc_count..... 32575
> oldest_dirty_gc_count 0
> n_gc_blocks.......... 11524
> bg_gcs............... 0
> n_retired_writes..... 0
> nRetireBlocks........ 0
> n_ecc_fixed.......... 0
> n_ecc_unfixed........ 0
> n_tags_ecc_fixed..... 0
> n_tags_ecc_unfixed... 0
> cache_hits........... 100140
> n_deleted_files...... 37
> n_unlinked_files..... 22541
> refresh_count........ 0
> n_bg_deletions....... 0
>
>
>
>
>
>
>
>
> And heres one from a "disk full" system:
>
> root@errored_system:/root# cat /proc/yaffs
> Multi-version YAFFS built:May 25 2012 01:48:02
>
>
> Device 0 "rootfs"
> start_block.......... 0
> end_block............ 3711
> total_bytes_per_chunk 512
> use_nand_ecc......... 1
> no_tags_ecc.......... 0
> is_yaffs2............ 0
> inband_tags.......... 0
> empty_lost_n_found... 0
> disable_lazy_load.... 0
> refresh_period....... 500
> n_caches............. 10
> n_reserved_blocks.... 5
> always_check_erased.. 0
>
> data_bytes_per_chunk. 512
> chunk_grp_bits....... 0
> chunk_grp_size....... 1
> n_erased_blocks...... 5
> blocks_in_checkpt.... 0
>
> n_tnodes............. 23388
> n_obj................ 19836
> n_free_chunks........ 206
>
> n_page_writes........ 2912051
> n_page_reads......... 2907834
> n_erasures........... 88768
> n_gc_copies.......... 2751776
> all_gcs.............. 88768
> passive_gc_count..... 0
> oldest_dirty_gc_count 0
> n_gc_blocks.......... 88768
> bg_gcs............... 0
> n_retired_writes..... 0
> nRetireBlocks........ 0
> n_ecc_fixed.......... 1
> n_ecc_unfixed........ 0
> n_tags_ecc_fixed..... 0
> n_tags_ecc_unfixed... 0
> cache_hits........... 1851
> n_deleted_files...... 2
> n_unlinked_files..... 35064
> refresh_count........ 0
> n_bg_deletions....... 0
>

What is telling is looking at the n_free_chunks value. The full system says
that there are almost none free.

Look too at the difference in the number of objects (n_obj) and n_tnodes.
These indicate there are far more object in the full system.

Of course it could be that these are "zombied" objects waiting for the final
deletion of some data, but those should be cleared by a remount. It would be
interesting to know if doing a reset would clear this.

-- CHarles

This message is part of the following thread:
	the complete thread tree sorted by date
	walter at