[Interest] [Semi OT] Concurrent (multi-threaded) read/write disk IO?

Till Oliver Knoll till.oliver.knoll at gmail.com
Thu Feb 5 20:21:57 CET 2015


Hi Harri, Rainer,

thanks for sharing your thoughts!

Am 05.02.15 um 15:19 schrieb Harri Pasanen:
> On 05/02/2015 14:44, Till Oliver Knoll wrote:
>>
>> Am 05.02.2015 um 14:25 schrieb Till Oliver Knoll
>> <till.oliver.knoll at gmail.com <mailto:till.oliver.knoll at gmail.com>>:
>>
>>>... I am really
>> just interested in whether concurrent read/write access should be
>> avoided in the first place these days (or not).
>> ...
> The usual answer is "it depends"..
> 
> It depends on how much data you are accessing at each write/read.   It
> also depends on the underlying filesystem and size of files / how many
> files you are dealing with.

In my concrete use case I have "ordinary single harddisk desktop
systems" in mind, that is, no embedded ("limited") hardware, but also no
dedicated file server with RAID, highly optimised "Super-Filesystem" and
so on - just "plain vanilla" desktops.

Also, in my concrete use case I have "batch resize of photos" in mind,
where each file is around 5 (JPEG) - 25 ("Raw") MByte in size.

I don't know how fast the actual resizing will be - I have a combined
CPU/GPU solution in mind, for the sake of getting a bit into OpenCL -
but I imagine it won't empty the "Work Queue" faster than I can fill it
by reading the original images from disk and enqueuing it into the Work
Queue. Also, I plan to have a size limit of the Work Queue, so I imagine
I won't be reading "full steam" all the time (but who knows - maybe I
end up being able to scale an image faster than I can read and decode
the JPEG data ;))

So whenever I am not reading I could use that time to empty the "Result
Queue" and write the data to disk.

And off course the assumption is that we read and write from/to the same
harddisk ;)

I guess there are still a lot of "depends" in that use case above. I was
hoping to get a "general advice/rule of thumb" whether it is a good idea
to have two distinct threads, reading/writing "concurrently" from/to the
harddisk, where the data is big ((several MByte), but not as big as in
"streaming a movie" (in the order of GB).

> It also depends on your disk array, if you have one or more disks and
> capacity of the disks, which affects then number of read/write heads the
> disk has.  Also The NCQ* implementation and cache RAM amount in a disk
> makes a difference.

I was actually hoping that nowadays modern (say, <= 3 years old)
harddisks and Operating Systems (Windows, Linux, Mac) would handle the
above case somehow for me, given that the size of each file is up to 25
MBytes, and I could just "go ahead" and read/write. Maybe there is even
a technique which optimises concurrent read/write operations (off course
an OS/harddisk controller could only go that far to optimise concurrent
access - I guess when I try to read e.g. 10 times the same file, or even
different files, at different locations then it's "game over").

> If you are on linux, you already get a lot of optimization out of the
> box, it is typically much better than any other OS.  But even within
> linux the filesystem used makes a difference, for example some
> filesystems are good with lots of small files.   Sometimes file deletion
> is the bottleneck.
> 
> In the end in spinning drives the underlying physics of spinning media
> and moving read/write heads affect things.

In the end I think it is really the required physical moving of that
head, rather than the file system (the file system might have an
influence on how the data is "distributed" on the physical drive, but I
guess that is negligible with regards to concurrent read/write
operations, no?).

> But if you want maximum IO performance, the rule of thumb is to group
> your reads and writes, and read/write as much data as possible at once.
>   Even SSDs typically favor this.  In highly parallel supercomputer
> settings different rules may apply.

That's what my gut feeling tells me as well.

Also Stack Overflow answers to question like these seem to confirm this:

http://stackoverflow.com/questions/5321768/how-many-threads-for-reading-and-writing-to-the-hard-disk


On the other hand Rainer wrote:

Am 05.02.15 um 15:24 schrieb Rainer Wiesenfarth:> From: Till Oliver Knoll
>> Am 05.02.2015 um 14:25 schrieb Till Oliver Knoll:
>>> ...
>> http://www.tomshardware.co.uk/forum/251768-32-impact-concurrent-speed
>> [...]
>
> Please note that this post is more than five years old. Things -
namely I/O schedulers in operating systems and hard disk caching - have
changed since then.


I was hoping so, too.

> I would _assume_ that any modern OS is capable of scheduling I/O for
maximum performance. In addition, an own I/O scheduler would probably
only work for bare metal access to the harddisk. Otherwise, the
underlying file system and its potential fragmentation might void all
your effort.
>
> Thus my approach would be to start any number of concurrent reads and
writes that makes sense for the application side and start optimizing if
(and only if!) throughput is too bad.

Other links that I found seem to support that, that the underlying
"scheduler" figures out the best read/write strategy, and any attempt by
the application to implement that by itself would be counter-productive
(assuming "finite read/write" operaions, that is, not endlessly reading
several GB of data "non-stop"):


http://superuser.com/questions/365875/can-hard-disks-read-and-write-simultaneously-on-different-tracks-how

But maybe that answer only applied to the asked question "copy/pasting a
file".


So I guess what this all boils down to is: "I have to try for myself" :)
I let you know how it goes (the biggest problem however is that the only
computer in my household still having a spinning harddisk is a 15 year
old laptop running Windows 2000 ;))


Thanks a lot,
  Oliver






More information about the Interest mailing list