#2944 closed defect (fixed)

FAT data corruption during unmount()

Reported by: Sebastian Huber Owned by: Sebastian Huber
Priority: normal Milestone: 4.11.3
Component: fs/fat Version: 4.11
Severity: normal Keywords:
Cc: Blocked By:
Blocking:

Description

https://lists.rtems.org/pipermail/users/2017-March/031101.html

In msdos_shut_down ( msdos_fsunmount.c ) there is a call to fat_file_close( .. ) which attempts to close a file
descriptor and write a range of metadata to that file's director entry located in another cluster:

  • fat_file_write_first_cluster_num
  • fat_file_write_file_size
  • fat_file_write_time_and_date

The problem is that this is the root node, and of course doesn't have a corresponding parent directory entry.

In addition, the "parent directory entry" cluster number is initialised to 0x1 (FAT_ROOTDIR_CLUSTER_NUM)
which is not working according to the FAT specification (cluster numbering starts at 2).
This actually creates a critical bug that overwrites random data to above sectors, because 2 is subtracted from 1
to calculate the sector number of the cluster -> through a series of function calls -> leads to a sector number at
the end of FAT2 (just below the start of the cluster region). The driver believes this is a FAT region (in fat_buf_release),
writes the sector to what it "thinks" is FAT1, proceeds to copy the changes to FAT2 -> adds FAT_LENGTH (8161) to sector,
leading to a write well into the cluster region, randomly overwriting files.

The three function calls above lead to fsck complaining about disk structure:

#######

fsck from util-linux 2.27.1
fsck.fat 3.0.28 (2015-05-16)
0x41: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
1) Remove dirty bit
2) No action
? 2
There are differences between boot sector and its backup.
This is mostly harmless. Differences: (offset:original/backup)

65:01/00

1) Copy original to backup
2) Copy backup to original
3) No action
? 3
/ and
/APPLICAT.ION

share clusters.
Truncating second to 0 bytes because first is FAT32 root dir.

/APPLICAT.ION

File size is 4096 bytes, cluster chain length is 0 bytes.
Truncating file to 0 bytes.

Perform changes ? (y/n) n
/dev/sdm1: 14 files, 1600/1044483 clusters

########

In particular the "shared cluster" problem is caused by fat_file_write_first_cluster_num, which adds a directory
entry to the root directory cluster pointing at itself; e.g. there is a directory entry in cluster 2 pointing to
a file in cluster 2. (Note: this occurs because we have fixed the "point to cluster # 1 issue" by reading the relative
location of the root cluster node from the FAT volume info strcture).

Removing the function call in msdos_shut_down ( .. ) to close the root file descriptor solves the problem perfectly
(clean fsck). However, we're a bit unsure about the intent behind closing the root directory.

Change History (9)

comment:1 in reply to:  description ; Changed on Mar 20, 2017 at 9:21:33 PM by Chris Johns

Replying to Sebastian Huber:

Removing the function call in msdos_shut_down ( .. ) to close the root file descriptor solves the problem perfectly (clean fsck).

I assume you mean fat_file_close?

However, we're a bit unsure about the intent behind closing the root directory.

There is memory allocated in fat_file_open which you would leak.

I see the fat_file_close calls fat_buf_release and if the fs_info cache is not empty it is holding a bdbuf buffer so this would cause a leak of buffers.

What about the fat_file_close calls in the msdos init call on error? Would they also cause the same problem?

comment:2 in reply to:  1 Changed on Mar 28, 2017 at 8:15:10 AM by slemstick

Replying to Chris Johns:

Replying to Sebastian Huber:

Removing the function call in msdos_shut_down ( .. ) to close the root file descriptor solves the problem perfectly (clean fsck).

I assume you mean fat_file_close?

Yes.

However, we're a bit unsure about the intent behind closing the root directory.

There is memory allocated in fat_file_open which you would leak.

We fixed this issue by creating a special "root file close" function, by removing the call to fat_file_update() in fat_file_close() (which caused the corruption).

I see the fat_file_close calls fat_buf_release and if the fs_info cache is not empty it is holding a bdbuf buffer so this would cause a leak of buffers.

What about the fat_file_close calls in the msdos init call on error? Would they also cause the same problem?

Yes, these will cause the same issues.

To update / summarise this ticket a bit here:

We originally attempted a fix to this problem by setting the hard-coded root directory cluster number to 2, as well as the above (remove corruption caused by fat_file_update() in fat_file_close() on unmount).

However, our attempt to fix the broken root cluster numbering breaks a hashing mechanism in fat_file_open(..). This mechanism indexes open file descriptors based on 1) parent directory cluster number and 2) offset into that directory structure. The issue is that the root directory, and the file pointed to by the first directory entry in the root directory, may construct their hashes based on the same indexes:

Root directory: cluster number 2, offset 0
First file in root directory: cluster number 2, offset 0

Before, this was not a problem of course, as the root directory had the hard-coded cluster number of 1, and the keys were therefore always unique. But this can actually cause a number of new issues.

The fix to this problem is to set the hard-coded root cluster directory number back to 1, instead of drastically changing the key hashing method function calls and data structures, and trusting that removing calls to fat_file_update(on_root_node) are sufficient to avoid the data corruption issue described above.

However, there are two other places in msdos_misc.c where the hardcoded root directory cluster number - FAT_ROOTDIR_CLUSTER_NUM - is used:

msdos_get_name_node()
msdos_get_dotdot_dir_info_cluster_num()

Like this:

if ( (MSDOS_EXTRACT_CLUSTER_NUM(dotdot_node)) == 0)
{

/*

  • we handle root dir for all FAT types in the same way with the
  • ordinary directories ( through fat_file_* calls ) */

fat_dir_pos_init(dir_pos);
dir_pos->sname.cln = FAT_ROOTDIR_CLUSTER_NUM;

}

Which, to my understanding, will never occur as you should never have a cluster number below 2 in a compliant (msdos) FAT file system. Does anyone know the intent behind this?

comment:3 Changed on May 11, 2017 at 7:31:02 AM by Sebastian Huber

Milestone: 4.124.12.0

comment:4 Changed on Aug 14, 2017 at 12:21:48 AM by Chris Johns

Version: 4.114.12

comment:5 Changed on Aug 24, 2017 at 9:56:36 AM by Sebastian Huber

Owner: changed from chrisj@… to Sebastian Huber
Status: newassigned

comment:6 Changed on Sep 6, 2017 at 12:39:26 PM by Sebastian Huber

Milestone: 4.12.04.11.3
Version: 4.124.11

comment:7 Changed on Sep 6, 2017 at 12:40:16 PM by Sebastian Huber <sebastian.huber@…>

In 4d495cf/rtems:

dosfs: Fix fat_file_update()

Do not update the non-existant meta-data of the root directory.

Update #2944.

comment:8 Changed on Sep 6, 2017 at 12:41:05 PM by Sebastian Huber <sebastian.huber@…>

Resolution: fixed
Status: assignedclosed

In a3199d91/rtems:

dosfs: Fix fat_file_update()

Do not update the non-existant meta-data of the root directory.

Close #2944.

comment:9 Changed on Oct 10, 2017 at 6:50:58 AM by Sebastian Huber

Component: fsfs/fat
Note: See TracTickets for help on using tickets.