tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Using mmap(2) in sort(1) instead of temp files
>> Given the issues about using mmap, can anybody suggest how I should
>> proceed with the implementation, or if I should at all?
> There are two potential ways where mmap(2) could help improve the speed
> of sort:
> - If you know the input file name, use a read-only mmap() of that file
> and avoid all buffering. Downside: you can not store \0 at the
> end of a line anymore and need to deal with char*/size_t pairs for
> strings.
Actually, if you mmap it PROT_WRITE and MAP_PRIVATE, you could go right
ahead. But that'll cost RAM or swap space when the COW fault happens.
It also works only when the input file fits into VM; to rephrase part
of what I wrote yesterday on tech-kern, sorting a file bigger than 4G
on a 32-bit port shouldn't break.
> - You use "swap space" instead of a temporary file by doing
> mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_ANNON, -1, 0);
(well, MAP_ANON). Yes, but that has issues. The size of an mmap()ped
area is fixed, set at map time, whereas file sizes grow dynamically. I
suspect that trying to use mmap instead of temp files would amount to
implementing a rudimentary ramfs.
Furthermore, if the dataset fits in RAM, I'd say you shouldn't be using
the temporary-space paradigm at all; just slurp it in and sort it in
core. And if it fits in VM but not RAM, given the way swap is tuned
for general-purpose use instead of the kind of access patterns sort
exhibits, I suspect temp files might end up being more performant. And
if the dataset doesn't fit in VM, you'll need temp files regardless.
If this does go in, I really think it needs an option to suppress it.
/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mouse%rodents-montreal.org@localhost
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Home |
Main Index |
Thread Index |
Old Index