#2792 assigned defect

possible RFS deadlock

Reported by: raduma Owned by: Chris Johns
Priority: normal Milestone: Indefinite
Component: fs Version: 4.11
Severity: normal Keywords: rfs deadlock
Cc: Blocked By:
Blocking:

Description

Hello,

I'm hitting a deadlock in our RTEMS 4.10 based system, in what seems to be a lock inversion issue with the RFS file system driver. Or at least so it seems from my initial investigation, coming here for clarification.

Scenario is as follows:

We have two tasks in our system doing file i/o operations, to a single RFS partition. One task is creating, writing and closing a new file once a second or so. The other is doing constant looping scan of the entire file system and stat'ing every file encountered. (This is a somewhat contrived scenario, but this scenario allowed us to get to a constantly reproducible issue from a random occurrence).

Symptoms are as follows:

  • Within a very short time, the system will dead lock in the file i/o layer.
  • The scanning task is locked in a callpath starting at rtems_filesystem_eval_path_start ending in rtems_bdbuf_anonymous_wait semaphore-wait on access_waiters.
  • The low freq file creation task is locked a callpath starting at close(fd) through rtems_rfs_rtems_file_close waiting to obtain the rtems_rfs_rtems_lock semaphore.
  • the bdbuff swapout task is just at the top waiting for the event to wake it up and do something

Digging and tracing through the code what seems to be happening is:

  • the file creator task creates files and starts writing to them, acquiring access locks on the bdbuf buffers.
  • the enumeration tasks scans and starts seeing files created, and wants to stat them
  • as part of stat, the filesystem lock handler is called, acquires the RFS lock
  • then as part of stat, it wants to load the RFS inode, tries to acquire the block from bdbuf for read
  • the block might be currently the one locked by the other task, so it waits.
  • the other task gets to close()'ing the file it has open, which *WOULD* release its lock on the buffers, and wake up the swapout task which would them flush and release all the other waiter tasks
  • *BUT* the rfs close implementation never gets that far because it's trying to acquire the global RFS lock which is held by the other task.

So, wondering if my understanding of what might be going on sounds legitimate. If so, if there are any mitigation strategies we could employ work around it (aside from don't do file i/o from multiple tasks).

Change History (5)

comment:1 Changed on 09/27/16 at 23:58:43 by Chris Johns

Owner: set to Chris Johns
Status: newassigned

Thank you for the excellent report. Do you have a test case you could attach to this ticket?

comment:2 Changed on 01/26/17 at 07:16:00 by Sebastian Huber

Milestone: 4.11.14.11.2

comment:3 Changed on 03/23/17 at 01:03:28 by Chris Johns

Milestone: 4.11.24.11.3

The 4.11.2 milestone is closing.

comment:4 Changed on 03/23/17 at 01:05:42 by Chris Johns

Version: 4.104.11

Move to the 4.11 branch.

comment:5 Changed on 02/07/18 at 23:21:18 by Chris Johns

Milestone: 4.11.3Indefinite

Requires funding.

Note: See TracTickets for help on using tickets.