tech-kern: Re: what is the status of the ktrace-lwp branch?

Subject: Re: what is the status of the ktrace-lwp branch?
To: Christos Zoulas <christos@zoulas.com>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 10/30/2005 14:30:14
On Sun, Oct 30, 2005 at 01:26:57PM -0500, Christos Zoulas wrote:
> On Oct 30,  9:42am, chuq@chuq.com (Chuck Silvers) wrote:
> -- Subject: Re: what is the status of the ktrace-lwp branch?
> 
> | this issue isn't endianness, it's 32-vs-64.  the ktrace file format uses
> | this structure:
> | 
> | struct ktr_header {
> | 	int	ktr_len;		/* length of ktr_buf */
> | 	short	ktr_type;		/* trace record type */
> | 	pid_t	ktr_pid;		/* process id */
> | 	char	ktr_comm[MAXCOMLEN+1];	/* command name */
> | 	struct	timeval ktr_time;	/* timestamp */
> | 	const void *ktr_buf;
> | };
> | 
> | and then can be a variable-length blob after that.  the timeval, buf pointer
> | (what the heck is a kernel pointer doing in a file format anyway?), and the
> | contents of the variable blob will all be different when written by 32-bit
> | vs. 64-bit kernels.  there's no good way to tell which format the file is,
> | so they're basically not portable between platforms.   (oh, the implicit
> | padding in this structure is bound to cause compatibility problems as well.)
> 
> Yes, but I am not sure if it is worth-while providing cross platform
> portability for ktrace records. It is not very easy to achieve and will
> bloat the 32 bit ktrace files.

architecture-neutral file formats should always be a goal.
if the entire contents cannot be architecture-neutral, then at least
there should be an architecture-neutral framework so that the overall
structure of the file is common.  this is exactly what was done with
the user coredump format.  the new ELF core file format is not completely
common (obviously), but it's at least easy to recognize ELF core files
and tell on what architecture they were made and other bits of info
which are not hardware-specific, like the name of the process and
what signal (if any) caused the process to dump core.

this is actually very easy to achieve if we don't care about old kdump binaries
which increasingly I don't think we should, and it will not bloat 32-bit files.
there is no information in the ktrace records which intrinsically needs to be
different for 32-bit vs. 64-bit kernels.  timestamps should be the same size
regardless of the size of a kernel pointer, etc.

oh... looking further I see that someone added an optional siginfo_t to the
KTR_PSIG data, which is of course completely hardware-specific.  sigh.
that does make this a bit more complicated.  ideally, these SIG records
would contain both something to indicate the format of the siginfo_t
(ie. some unique ID for the MACHINE_ARCH) and a length so that it can
be skipped over even if the ID isn't know.

but that brings up another point, the SIG records also contain a sigset_t,
which we've changed in the past.  this type used to be 32 bits, and now it's
128 bits.  so we've already made incompatible changes to the ktrace file
format in the past, with no provision for older kdump binaries to be
able to read the new file format.


> | if we're going to make the records incompatible at all (which nick was
> | proposing to do by changing the timeval to a timespec), then we should fix
> | these portability issues at the same time, to minimize the number of flag days.
> 
> | 
> | if we make the assertion that old kdump binaries don't need to be able to
> | read new ktrace files (which seems reasonable, we've done the same with
> | both user and kernel coredumps before, and probably other things),
> | then we can redefine the file format any way we want and fix everything
> | at once, and build in support for better handling of future format changes
> | too if we want.
> 
> They can now if you "ktrace -v 0". Even v1 ktrace records can be read
> using an older kdump (the timestamps are not parsed correctly).
> I've left two diffs on ftp.netbsd.org:~christos/ktrace-lwp-{user,kernel}.diff.

the new file format not being correctly interpreted by an old kdump
(ie. the timestamp are misinterpreted) is the issue I'm talking about.
I didn't realize that the proposal included be able to specify which
file format the kernel will write, but I think this is getting very silly.
we don't support writing core files in the a.out format, we don't support
writing ktrace files in the pre-1.4 format, and we shouldn't support writing
ktrace files in the pre-lwp format anymore after this change either.


> | > 2. I understand where chuq is coming from and I think I agree with him. I
> | >    think that the uio_ is a process wide struct and not an lwp specific one.
> | >    For example, what happens when the lwp exits before the uio is completed?
> | >    On the other hand, keeping it as uio_proc, would require us to call
> | >    proc_representative_lwp(uio->uio_proc) in a bunch of places; not very
> | >    attractive either.
> | 
> | the uio_procp field was intended to identify the address space that the
> | pointers in uio_iov refer to.  as yamt points out, in our current system
> | the address space corresponds to struct vmspace since multiple processes can
> | share the same address space.
> | 
> | if we want to handle async operations, we will need to make sure that our
> | reference to the address space (via whatever structure) remains valid until
> | all async operations complete.  in some systems this is done by making
> | exiting processes wait for aio to complete, but this could be achieved
> | in other ways as well.
> | 
> | currently all the operations that use of uio_procp to refer to the kernel LWP
> | that issued the request are synchronous, to avoid the previous issue in
> | the absence of any other way to keep the proc pointer valid.  it doesn't
> | really make sense that any of these would ever become async, especially
> | because in our scheduler-activations world, the kernel LWP doesn't really
> | map to anything in particular that the user process is doing.  previously
> | we were looking to see what all non-address-space information was being
> | accessed via uio_procp so that we could consider passing this information
> | in a more explicit fashion, and the people doing that kinda got bogged down
> | in that process.  I'll dig through my old mail and see what I can find.
> 
> So what should we do? Should we commit the diff with uio_lwp and fix it
> later properly? Or should we come up with the vmspace diff now?

I think we should finish the analysis of what would go wrong if we left
the uio_procp field as-is (or changed it to be a vmspace, ie. used the
field only to refer to the address-space).  once we know what are the
current uses of uio_procp for non-address-space purposes, then we'll
have a much clearer idea how to proceed.  hopefully it will turn out
to be not so hard to fix the current abuses of uio_procp, and then
we can go with the vmspace approach.

-Chuck