Source-Changes-D archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Leak Sanitizer - how to suppress leaks



On 12.09.2019 21:24, Robert Elz wrote:
>     Date:        Thu, 12 Sep 2019 15:12:25 +0200
>     From:        Kamil Rytarowski <n54%gmx.com@localhost>
>     Message-ID:  <2a6e1fb2-cedc-4a57-750b-45f101be9fb9%gmx.com@localhost>
> 
> 
>   | This proposal is practically equivalent of disabling leak detection at
>   | all and removes the whole purpose.
> 
> No it isn't.   Or rather, it might be, the way you describe LSan to
> work, but if that's what it is doing, then it is close to useless:
> 

Please note that in all arguments about leak sanitization, that it uses
internally the same mechanism as garbage collectors. lsan and libgc (and
others) also server similar purpose.

>   | Leak detection works simply by scanning memory (TLS, stacks, heap) for
>   | pointers to allocations. If there is no longer a reference, than there
>   | is a leak.
> 

I forgot to mention that there might be references to allocated objects
still in at least registers. Certain leaks could be skipped if at least
one thread still references it in a register.

> So the following code
> 
> 	struct list {
> 		struct list *prev;
> 		struct list *next;
> 		int age;
> 		/* other stuff not relevant here */
> 	};
> 
> 	struct list *
> 	new_list(int age1, int age2)
> 	{
> 		strust list *l1, *l2;
> 
> 		l1 = malloc(sizeof *l1);	/* check for NULL elided here */
> 		l2 = malloc(sizeof *l2); /* ditto */
> 
> 		l1->prev = NULL; l1->next = l2; l1->age = age1;
> 		l2->prev = l1; l2->next = NULL; l2->age = age2;
> 
> 		if (age1 > age2)
> 			return NULL;
> 
> 		return l1;
> 	}
> 
> has no leak (in the case age1 > age2) ?   Scanning memory would see
> a pointer to *l1 (in *l2), and a pointer to *l2 (in *l1).  Both blocks
> of memory are referenced according to that trivial analysis.   Useless.
> 

LSan can detect indirect leaks:

$ ./test2

=================================================================
==14114==ERROR: LeakSanitizer: detected memory leaks

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x40ba65 in __interceptor_malloc
/public/llvm/projects/compiler-rt/lib/lsan/lsan_interceptors.cpp:54:3
    #1 0x403915 in new_list (/tmp/./test2+0x403915)
    #2 0x4039b4 in main (/tmp/./test2+0x4039b4)
    #3 0x40381c in ___start (/tmp/./test2+0x40381c)

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x40ba65 in __interceptor_malloc
/public/llvm/projects/compiler-rt/lib/lsan/lsan_interceptors.cpp:54:3
    #1 0x403907 in new_list (/tmp/./test2+0x403907)
    #2 0x4039b4 in main (/tmp/./test2+0x4039b4)
    #3 0x40381c in ___start (/tmp/./test2+0x40381c)

SUMMARY: LeakSanitizer: 48 byte(s) leaked in 2 allocation(s).

$ cat test2.c


#include <unistd.h>
#include <stdlib.h>

        struct list {
                struct list *prev;
                struct list *next;
                int age;
                /* other stuff not relevant here */
        };

        struct list *
        new_list(int age1, int age2)
        {
                struct list *l1, *l2;

                l1 = malloc(sizeof *l1);        /* check for NULL elided
here */
                l2 = malloc(sizeof *l2); /* ditto */

                l1->prev = NULL; l1->next = l2; l1->age = age1;
                l2->prev = l1; l2->next = NULL; l2->age = age2;

                if (age1 > age2)
                        return NULL;

                return l1;
        }


int
main(int argc, char **argv)
{
        struct list *l;

        l = new_list(10,5);

        return 0;
}


> The "close to" just above came from there being cases that it can find.
> 
> Because of this, I also looked at the change you pade to "fix" ps.
> You added
> 
> 	free(pinfo);
> 
> But *pinfo (a struct pinfo) contains 3 pointers, each of which may
> refer to other malloc'd blocks.   Further, pinfo points to an array
> of those structs.   Because of the memory scan, those pointers (in each
> element of the array) will still have been seen, but when you free(pinfo)
> then those pointers can no longer be accessed (depending upon what the free()
> in use while the sanatiser is running does, they may not still exist
> in memory, or they might, just in a block that is now free).  Anyway
> the struct kinfo_proc2 struct kinfo_lwp and char * (prefix) entries in
> all the struct pinfo's are not being freed.  There was no leak before, but
> your "fix" introduced one (or three * N).
> 

No more leaks were reported, if there are any I will address them.

LSan/NetBSD still needs more work and it is not expected to be perfect.

> I'd suggest reverting the change you made, and instead make pinfo a
> global instead of a local in main - then its value will be seen still
> existing by the memory scan (LSan will be happy) and ps won't be wasting
> (a trivial amount of) time doing an incorrect (as it is now) free(), and
> we'll all be happier.
> 

I will revert it, but I am looking for a more generic approach.

How about adding ifdef __NO_LEAKS like:

#ifdef __NO_LEAKS
free(3)?
#endif

And in lsan/asan/valgrind/etc checks use -D__NO_LEAKS.

In general, I don't want to reinvent fixes for the same leaks in various
tools differently.

> Incidentally, what happens with code like:
> 
> 	struct foo {
> 		int a, b, c;
> 	};
> 
> 	int *
> 	make_foo(int a, int b, int c)
> 	{
> 		struct foo *p;
> 
> 		p = malloc(sizeof *p);		/* again NULL check elided here */
> 		p->a = a; p->b = b; p->c = c;
> 
> 		return (&p->b);
> 	}
> 
> 	int * global_ptr;
> 
> 	main()
> 	{
> 		global_ptr = make_foo(1,2,3);
> 
> 		/* ... */
> 
> 		{
> 			struct foo *p;
> 
> 			p = (struct foo *)((intptr_t)global_ptr -
> 				 offsetof(struct foo, b));
> 
> 			p->a++;
> 		}
> 		
> 		/* ... */
> 	}
> 



> Ugly (and stupid) but not illegal I think.   The more common case is
> a list where the back pointers point at the forward pointer, rather
> than at the struct itself, so one can do
> 
> 	if ((*(p->back) = p->next) != NULL)
> 		p->next->back = p->back
> 	free(p);
> 
> to delete an element from the middle of a list easily - which works
> even if the element to be deleted is first or last.   There if the
> next field is the first in the struct, the back field will refer to
> the malloc's memory block (by accident) but if it isn't, it won't.
> 

Feel free to experiment.

http://cdn.netbsd.org/pub/NetBSD/misc/kamil/llvm-clang-compilerrt-10.0.0beta_2019-09-11.tar.bz2

>   | I have not investigated where are the sh(1) leaks,
> 
> As far as I know it doesn't.   What I said is that it does not free
> memory it allocates before it exits.  That is not a leak.
> 

I think we need to specify the definition. Leak is a memory object
without a pointer referencing it.

Whether a leak is a problem or no-problem, it's another affair. Whether
the leak is expected, wanted or unwanted it does not change the fact
that it is a leak.

A special type of a leak is referencing memory that is no longer needed
and could be freed.

>   | In sanitizers we cannot overload or do anything special with main()
>   | (such as replace it with our own copy).
> 
> That wasn't the suggestion.   However, what was assumed a more rational
> leak detection algorithm than appears to be in use - I agree, by simply
> scanning memory you cannot do anything special with main() as when the
> code runs, apparently, main's stack frame has already gone.
> 

Its stack frame can be gone.

> But were this done a different way -- the right way is to annotate, in the
> code, all assignments to pointers , and add a ref counter to all allocated
> memory blocks - when a pointer is assigned, and it points to something
> allocated, then incr the ref count.  Whan the value of a pointer is
> overwritten, and it used to point to allocated memory, decr the ref count.
> If that decrease drops it to 0, then memory has been leaked.  In the
> exit code for every function (except main, which can be recognised, we
> know its name is "main", in this scenario) put code that checks every
> local var that is a pointer, for each which points at allocated memory,
> decrease the ref count.  If it reaches 0, that's a leak.
> 
> This will still give false errors, eg in
> 
> 	main()
> 	{
> 		init();
> 		read_config();
> 
> 		return do_the_work();
> 	}
> 
> assuming that none of init() read_config() or do_the_work() are
> called anywhere else, then memory they allocate (and particularly
> do_the_work()) which is not freed is not a leak.
> 
> This method still has problems with doubly linked lists, handling
> those properly needs code to recognise what is going on, and when
> analysing a list pointer drref, instead of checking against 0,
> scan the list and find out how many internal pointers ref the
> list elements, and ensure that (at least) one element on the list
> (ahead of the one being deleted, for singly linked lists) has
> a higher ref count than can be attributed to pointers inside the
> list itself.   That's a lot of work for big lists (and trees, etc)
> especially ones that are constantly being updated.
> 
>   | main() is the same entity as any regular fun(). Unless we decorate it
>   | with some attributes in .c code, it is not distinguishable.
> 
> Decoration is not needed, its name is its decoration.
> 

From the compiler point of view it is only slightly different than any
regular "int foo(int, char**)".

From sanitizer point of view it is not distinguishable and we cannot
(easily) make any conditions based on this. From sanitizer point of view
it does not contain any decoration.. In order to make it special we
would need to patch every .c code with main() adding extra code in it.

Another approach would be to craft a dedicated csu with
customized/patched code. But it is against the design of sanitizers to
run on unmodified basesystem libraries.

Its still possible to patch code generation part in clang/gcc and make
some conditions there.. but it is overkill (and too difficult for me as
of now). Also it would eliminate the  purpose of LSan if it would still
see every memory object as referenced.

>   | Globals are treated specially in sanitizers as sanitizers generate
>   | constructors and destructors to keep the track of them.
> 
> Sorry, that (by itself) means nothing to me,   What do those constructors
> and destructors do?
> 
> What we have here is a global pointer, which points to allocated memory,
> and which is still valid when the program exits (the memory is still
> allocated, and the global var still points to it).
> 

Actually globals are not used in LSan and can be ignored.

>   | I have benchmarked in the past that we execute for true(1) or similar
>   | applications at least 1 million instructions on CPU.
>   |
>   |
>   | $ ./singlestepper true
>   | Total count: 1639915
> 
> True is a chell script.   You're measuring the startup (including
> dynamic linking) of a fairly large program.
> 
> Try
> 
> 	main() { _exit(0); }
> 
> and cc -static
> 
> and then measure that, and see how many instructions are executed
> (if we wanted an executable "true" - that is, if that were ever
> actually (significantly) used outside shell scripts, where it is
> a builtin, that is how we would write it.)
> 
> [Aside: for some reason, probably related to crt0 or something, the
> result of this is ridiculously large, and seems to have a good fraction
> of libc linked into it.   That could be fixed...]
> 
>   | $ ./singlestepper echo
>   | Total count: 1039124
> 

Please feel free to experiment with singlestepper:

https://github.com/krytarowski/picotrace/

From my logs, I managed to get down to 143k instructions in a static
application.

Yes, it needs some investigation whether we really need this amount of
code to be executed.

I consider this as something really wrong, but I keep focus on the
tooling for now.

> Again, that will almost all be the dynamic linker.
> 

Dynamic linker execution time is a part of execution time of the whole
application.

>   | Calling one 1 dummy function in libc in comparison to that is negligible.
> 
> Sure, if one function call is all it takes - but that would have to be
> "turn off the memory scan" - ie: do no leak detection at all.   That isn't
> what is desired, that would truly make the whole thing useless.
> 

I have benchmarked it.

http://netbsd.org/~kamil/patch-00148-__suppress_leak.txt

This implementation of __suppress_leak() takes 9-10 amd64 CPU
instructions for each __suppress_leak() call.

It's not the same as time based bechmarking and could be called
irrelevant by some people.

For time based checks, in order to get >0.00s time spent, I need to
rep[eat the call 10.000.000 times.

Before getting measurable performance impact of __suppress_leak() we
will can go out of memory..

In the context of anything executed in libc/csu/application, even for a
hello world application this is negligible.

> leak detection that finds bugs like the one I invented earlier in this
> message is a good thing to have - it is very easy to forget to free
> everything that has been allocated in a complicated function that is
> bailing out in an error case.   Finding those bugs is definitely worthwhile.
> 
> But insisting that all memory that is ever allocated is explicitly freed,
> just so the sanitiser doesn't claim it has "leaked" is absurd.
> 
> If it is possible to write code to run just before exit() which frees
> memory, that memory, by definition, has not been leaked - only memory
> that we no longer have a reference to, and so cannot possibly free()
> can be rationally considered leaked.
> 
>   | We are talking here about simple programs that are expected to get some
>   | measurable optimization by skipping free(3) calls. Such as sh(1) or ps(1).
> 
> I don't know the insides of ps, I don't know that there is much that needs
> to be freed which isn't already.  None of the things that the pinfo points
> to contain pointers themselves, so all that would be needed is 3*N + 1 free()
> calls (assuming this is all there is).  Certainly I did not care about the one
> free(psinfo) call that was added - that (if it were not creating a leak)
> would be irrelevant to anyone.
> 
> It was the general principle of requiring memory to be freed - actually
> writing code to do that, and then immediately calling exit() which frees
> everything, always, that I was objecting to.   The only reason for ever
> doing that extra work is to make it easier for a half baked sanitiser
> implementation to not report non-problems as errors.
> 

Sanitizer does not distinguish problems from non-problems, it detects
that what is was designed for.

I agree that writing code and quality needs some extra effort.

>   | In all other cases I consider this style of programming as buggy.
> 
> You can consider it however you like, but it isn't.   The interface
> specs for the OS guarantee that when a process exits, all of its
> resources (memory, fds, ...) are released.   Programs (and programmers)
> are entitled to rely upon this, and allow it to work for them.
> 
> All kinds of programs build all kinds of data structs as they work, and
> then they're done, they exit() - tearing it all down for no reason is
> lunacy.
> 
>   | If there are thousands of leaked memory objects than it is not just
>   | simple optimization, but really wrong style of programming.
> 
> Not wrong, just not the style you (apparently) like.
> 

In current POSIX we are lucky as this is defined 'Memory mappings that
were created in the process shall be unmapped before the process is
destroyed.'.

In a strict C code, in hosted environments this is not specified. In a
freestanding environment this is Implementation-Defined what happens on
exit().

I might be wrong with my memory, but I recall something that early SunOS
had a problem, that unmapping the memory regions on exit was imperfect
and the resources were lost until reboot. (Don't ask me to find where I
read about it now).

With garbage collectors, certain implementations of Lisp computers were
so slow, that it was quicker to reboot a computer than wait for the
garbage collector to finish its work.

> Incidentally, do you close(0) close(1) close(2) in all your programs
> before you exit()?   (fclose(stdin) etc ae as good when stdio is being
> used).
> 

exit(3)/process termination is defined to close file descriptors.

The behavior of flushing descriptor 0 is defined in POSIX to be
performed on exit.

LSan does not look look for unclosed file descriptors so we can skip
discussing the file descriptor leaks as offtopic here.

> If not, your program, according to you, is leaking file descriptors.
> 

No.

> The kernel closes them for you during the exit processing of course,
> but in your apparent philosophy that's not good enough...
> 

Actually whenever I open a file descriptor I try to close it when no
longer needed.

If we speak about philosophy, the same philospohy of not freeing the
resources is like to avoiding checks for the return value from
malloc(3), as it hardly ever fails today. After the update to the newer
jemalloc my terminal emulator stopped to work as an underlying library
was allocating an internal buffer with a garbage as its size, this
caused malloc(3) to return ENOMEM for very large values. Unfortunately
the fallout was difficult to catch and it was detected by leot@ with a
sanitizer.

Yes, I'm for checking for the return value always, ideally using
emalloc(3) from util(3).

>   | One example
>   | of that is ATF, it randomly picks one set of pointers to free before
>   | exit, and skips others leaking the memory. It is bug.
> 
> If it is actually doing just what you say, then no, it is not a bug.
> If it is losing memory while running, so as it runs more and more
> tests (or subtests) it keeps getting bigger and bigger as (some of)
> the memory for a previous (sub-)test hasn't been freed, but is no longer
> needed, or accessible - then that would be a bug, and should be fixed.
> 
> But if it is just doing what you say, and not bothering to free memory
> that it has been using (for example) for accumulating stats on the number
> of tests run, how many failed, the names of those that failed, ... and
> at the end it prints all that and exits without freeing that memory,
> that is 100% perfectly OK.
> 

As I stated, ATF tries to free the memory, but it is not freed
diligently and the job is done only partially.

>   | As I have stated originally, we could remove free(3) from almost every
>   | basesystem program and not many would notice.
> 
> I know what you're saying, but people would notice, as ...
> 
>   | There is even a C compiler
>   | that performs only allocations and never frees the memory. It is still a
>   | correct UNIX application, regardless of taking like 2GB for building a
>   | hello world application.
> 
> people observe that kind of thing, programs that are huge when they
> should not be.
> 
> Certainly with the huge (comparatively) VM space we have available today
> it is easy to let many of these kinds of things slide as not being all
> that important.
> 

If something is not important, the right approach is to pick g/c
programming language.

> But that isn't the case we're talking about - rather memory that *cannot*
> be freed until the program exits (the data will still be needed as long as
> the program remains running) but which is clearly useless after the
> program has finished.   exit() (inside the kernel, not the libc function)
> frees all of that for us.   Using it is not incorrect, not even slightly.
> 

In POSIX it's not incorrect.

In conformant C environment it's another story, that exceeds 'slightly
incorrect' code meaning.

> kre
> 
> ps: an alternative shcnge for ps.c would be to alter
> 
> 	return eval;
> 
> at the end of main() to be
> 
> 	fflush(stdout);
> 	_exit(eval);
> 
> ps doesn't need atexit handlers (other than to flush its stdout, in
> case it has been buffered), so we do the flush manually, and then _exit
> sends back the exit code.   No atexit() handlers ever get run, and LSan
> won't get a chance to object to the pinfo value not being freed.
> 
> I might start altering all programs I work on to use  that technique.
> 
> 

Every regular NetBSD program uses atexit(3).

ps(1) uses indirectly 3 atexit(3) entries from the early bootstrap code
(see: crt0-common.c), at least for calling _rtld_exit and _fini.

Reimplementing libc/csu is possible and legal... but rather on the
borderline.

LSan still has support for detecting _exit as it intercepts it. Actually
it does not perform leak detection at this point and only sets proper
exit code.. but it could be patched to again/still detect leaks for such
smart applications. The same could be done for syscall(2),__syscall(2)
and abnormal exits.

> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> (Not really!)
> 

Please try to change the perspective and change the thinking for a
moment. We can discuss here about purity of the POSIX behavior vs C
standard vs some styles vs tastes vs real definition of the leak etc.
but it does not help to address the real problem.

As an example, please imagine that a vendor, a user, a hobbyist, a
company is doing a product, a project on top of the NetBSD codebase/system.

Let's imaging that a company needs to use a low-footprint device that is
able to execute merely /sbin/init and an application started directly by
it (or built in into it).

For an example, let's assume that the device needs to contain a
functionality to list ps(1)-style information, such as "ps -O maxrss,
rss", for example for the logging purposes and putchar() is replaced
with send a character over a wire (doesn't matter now whether it is
sensible setup).

The functionality has to be linked into a long-running process.

There are two options:
 - copy-paste code from ps(1) and adapt, calling ps_main(3, ps_argv)
 - reimplement the functionality from scratch

Let's assume that a developer copies src/bin/ps/* into the project,
changes to a custom function name, links into the product and runs. It
seems to work fine, but when deployed the memory statistics start to
look dubiously and the longevity of a device is 1 day, 1 week or 1 month
and must be rebooted interrupting a critical process that shall not be
interrupted.

After the investigation we see that the RSS/MAXRSS keeps growing because
of the pinfo struct.

For another exercise assume that we use rumpunikernel here, so a
freestanding environment built for a single application, without the
real OS and POSIX assumptions.

Back to the real life, I was there and I saw that. I saw the developer
team building an embedded multimedia device and the frustration of
picking existing open-source code chunks. The memory leaks were killing
the experience so deeply, that it was often better to revisit the idea
of copy-pasting code and reimplementing the functionality from scratch.

In another company I saw a devel team that was following the rule of NIH
of everything in the product as they could not stand the memory
requirements on MIPS 256 MB devices in existing open-source projects due
to lack of the regime of freeing unused memory objects.

The leaks bit me when I was trying to turn our applications into
libraries and make bindings. for osquery-like purposes.


If we want to catch leaks in sh(1) (in our releng builds and machines in
a future MKSANITIZER mode) that affect long-running applications, we
first need to remove the noise from uninteresting leaks. And we are back
to the original problem.

 - use free(3) optionally wrapped in #ifdef _NO_LEAKS
 - use __suppress_leak() as a generic approach for all leak detecting
software with a negligible cost on
 - use per-tool approach

I'm personally against the 3rd option as inferior, ugly in practice and
sometimes requires patching that tool (no ifdef detection for
-fsanitize=leak...).

__suppress_leak() is practically free and could be easily aliased to a
proper function for a leak detecting tool when needed.

#ifdef _NO_LEAKS + free(3) is less invasive, but needs rebuilding the
software stack for code checking.

Attachment: signature.asc
Description: OpenPGP digital signature



Home | Main Index | Thread Index | Old Index