Re: current build failure automated messages

To: Andreas Gustafsson <gson%gson.org@localhost>
Subject: Re: current build failure automated messages
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Tue, 19 Jul 2016 13:16:01 +0700

    Date:        Sat, 16 Jul 2016 12:15:41 +0300
    From:        Andreas Gustafsson <gson%gson.org@localhost>
    Message-ID:  <22409.64317.493648.115567%guava.gson.org@localhost>

  | It would not be hard to implement, but I'm not sure it would be useful
  | enough to justify doubling the number of messages to the list.

With the current rate of build failures, "doubling the number of messages
to the list" means one extra message every few days normally - not something
I would be too concerned about.

  | What I do when I want to know if a reported build failure has been
  | fixed is to visit the web page whose URL is given at the very end of
  | the email.

I have used that if I want to get to the log of the actual build that
failed, when the extract in the e-mail is insufficient to work out what
actually happened.    I guess it relates to how old I am these days, but
jumping on a browser is rarely my first reaction to anything - I mostly
predate www.* and basically consider http a total botch of a protocol, so,
I am almost always looking for an alternative.   Something simpler to deal
with.

  | If the page says "Build: OK" at the end, the issue has
  | been fixed.  At least for me, this is less work overall than it would
  | be to handle twice the number of emails.

I actually cannot imagine that being possible for me, one more e-mail to
delete every few days is nothing, just switching to a browser and waiting
for it to page in takes an order of magnitude longer - let alone the
startup time if I don't have one running (which is not unusual) - plus
that I can read e-mail on a text terminal trivially, and while that web
page would not be hard for a text browser to process, it just seems wrong
to me...

  | Most build failures are fixed quickly anyway.

Yes, that is exactly the point.   There was a build failure early this
morning (my time) - when I saw it (aside from the current-users message
about it) I had 2 e-mails from a script I created, one telling me of the
failure, and another telling me it was fixed.   Then I knew immediately
I could simply ignore the failure mail (the current-users message wasn't
even worth looking at.)   All I needed to see was the Subject lines of the
messages, and all the info was available for me to delete all of them
(in this cases I didn't, as this was the first failure since I got the
script finished, and I wanted to see how well it worked, or if it worked
at all ... which is also why I waited to reply here until now, I wanted
to have something productive, IMO anyway, to share...)

  | If we're going to start sending more emails, I think adding notifications
  | saying "build is still failing after 24 hours" would be more useful
  | than "build now succeeding again".

I'm not sure about "more" useful, but that would certainly be useful,
though I think 12 hours would be a better timeout - that's long enough
for whoever broke it to have had time to fix things before causing others
to be provoked into getting involved.

I am appending the script I am now using below.   One caveat - as is the way
with these things - I made a couple of minor adjustments to the script after
it worked earlier - and there has not been another failure since to validate
it still works (it should, but ...)

Currently I am just running it from cron every 15 minutes, but probably
better would be to have it triggered by the current-users build failure mail,
and then run it every N minutes until it goes to OK again, and then just
send the "now OK" mail and terminate.  That part (aside from the mail sending)
would get done by another script.

Also, I have no idea of the timezone in which the log files are created, so
I am currently running the script using just local time (for whoever runs it.)
That only affects the name of the file that is fetched, and if right near the
beginning of the month, the previous one - just in case the commit list that
causes a failure, or corrects a failure, spans the month boundary).  As it
is now, I am probably going to start attempting to fetch August's log before
it first gets created (as August will come earlier for me than many of you).
Of course, if the timezone for those files is from Japan or Australia then
all would be fine (for me).   It should probably be, and probably is, UTC,
but before I make the script work that way, I'd appreciate confirmation from
someone who knows (that is: what timezone is used when deciding it is time
to create a new log file -- i.e.: that a new month has started?)

It would also be easier if the html markup actually marked the content
rather than just for appearance (class="build" means a different background
colour, class="ok" just means "text is green" and class="fail" "text is red",
and they're used that way... ideally there should be different classes for
different purposes, and if several of them all just happen to result in the
same appearance, that would be fine...)

Also, the script, attached below, attempts to make a directory
	/var/db/build-status
to keep track of what the current status is, for each architecture monitored,
(and some other stuff) but unless it is run as root (not recommended), it is
probably going to fail...   So just make the directory by hand before running
the script, and give it a suitable owner and permissions.   The first time
the script is run for an architecture it will send a more or less useless
e-mail which tells the current build status for that architecture (that it
does that is/was intentional...)

The script currently sends both "failed" (duplicating the current-users mail,
but without the extract from the build log file that usually contains enough
to figure out what failed) and "ok again" messages ... deleting the former
would be trivial.

kre

#! /bin/sh

# 
# Usage: $0 e-mail-address [arch...]
#

case "$#" in
0)	printf >&2 '%s\n' "Usage: $0 e-mail-addr [arch...]"; exit 1;;
1)	MAILADDR="$1"; set -- i386 amd64;;
*)	MAILADDR="$1"; shift;;
esac

BSTAT=/var/db/build-status
test -f "${BSTAT}/pid" && kill -0 "$(cat "${BSTAT}/pid")" 2>/dev/null && exit 0

unset BGPID
TD=$(mktemp -d /tmp/BuildStatus.XXXXXXX)
trap 'X=$?;
	test -n "${BGPID}" && kill -9 ${BGPID} 2>/dev/null;
	rm -fr "${TD}"; exit $X'			0 1 2 3 13 14 15

# This is disgusting, but is what it takes with an older sh
( exec 2>/dev/null;
	(sleep 600& echo $!; wait; kill -ALRM $$)& echo $!
) >"${TD}/P"
BGPID=$(tr '\012' ' ' <"${TD}/P")

# modify this to use your favourite URL fetching tool...
# exit status from function should be 0 if it succeeds, !=0 otherwise
# First arg ($1) is the file for the output, second arg ($2) is the URL
file_from_url()
{
	wget -4 -q -t3 -O "$1" "$2" >/dev/null 2>&1
}

build_status()
{
	sed -n < "$1"						\
		-e '/<a name="2[0-9][0-9][0-9]\.[0-9][0-9]/{'	\
		-e	's/^.*name=/name=/'			\
		-e	's/>.*$//'				\
		-e	H					\
		-e	d					\
		-e '}'						\
		-e '/<div class="build">.*>build: / {'		\
		-e	x					\
		-e	's/name=.*\n//'				\
		-e	's/^\n//'				\
		-e	's/<div class="build">.*\n//'		\
		-e	's/<div class="build">.*$//'		\
		-e	x					\
		-e	H					\
		-e '}'						\
		-e '$ {'					\
		-e	g					\
		-e	's/<div class="build">.*build: /'"$2=/"	\
		-e	's/\('"$2"'=[^ ]*\) .*$/\1/'		\
		-e	's/name=/'"$3"'=/'			\
		-e	p					\
		-e	q					\
		-e '}'
}

change_commits()
{
	printf '%s\n\n' "Commits just before build status changed to $2"

	sed -n < "$1"						\
		-e '1,/name="'"$3"'"/d'				\
		-e '/<div class="build">.*>build: '"$2 /{"	\
		-e	x					\
		-e	p					\
		-e	q					\
		-e '}'						\
		-e '/<div class="commit">/!d'			\
		-e 's/<[^>]*>//g'				\
		-e H						\
		-e '$ {'					\
		-e	g					\
		-e	p					\
		-e	q					\
		-e '}'
}

test -d "${BSTAT}" || mkdir -p "${BSTAT}" || exit 2

trap 'X=$?;
	test -n "${BGPID}" && kill -9 ${BGPID} 2>/dev/null;
	rm -fr "${BSTAT}/pid" "${TD}"; exit $X'		0 1 2 3 13 14 15

printf '%s\n' "$$" > "${BSTAT}"/pid || exit 3

eval "$(date +'CUR=%Y.%m DAY=%d')"
PREV=
case "${DAY}" in
0[1234])	PREV=$(date -d 'one month ago' +'%Y.%m') ;;
esac

for A
do
	file_from_url "${TD}/$A.cur" \
		"http://releng.netbsd.org/b5reports/$A/commits-${CUR}.html"; ||
			continue

	test -n "${PREV}" && {
	    file_from_url "${TD}/$A.prev" \
		"http://releng.netbsd.org/b5reports/$A/commits-${PREV}.html"; ||
			continue
		cat "${TD}/$A.prev" "${TD}/$A.cur" > "${TD}/${A}" ||
			continue
	} ||
		ln "${TD}/$A.cur" "${TD}/${A}" ||
			continue

	eval "$(build_status "${TD}/${A}" STAT NAME)"

	if [ -f "${BSTAT}/${A}" ]
	then
		eval $(cat ${BSTAT}/${A})
		test "${PSTAT}" = "${STAT}" && {
			test "${NAME}" = "${LNAME}" ||
				printf '%s\n'				\
					"PSTAT='${STAT}'"		\
					"LNAME='${NAME}'" >"${BSTAT}/${A}"
			continue
		}
	else
		LNAME="no-version"
	fi

	printf '%s\n' "PSTAT='${STAT}'" "LNAME='${NAME}'" >"${BSTAT}/${A}" ||
		continue

	change_commits "${TD}/${A}" "${STAT}" "${LNAME}" |
		Mail -s "NetBSD $A Build $STAT" "${MAILADDR}"
done

Follow-Ups:
- Re: current build failure automated messages
  - From: Robert Elz
- Re: current build failure automated messages
  - From: Andreas Gustafsson

References:
- Re: current build failure automated messages
  - From: Andreas Gustafsson
- current build failure automated messages
  - From: Robert Elz

Prev by Date: daily CVS update output
Next by Date: Re: current build failure automated messages
Previous by Thread: Re: current build failure automated messages
Next by Thread: Re: current build failure automated messages
Indexes:

Home | Main Index | Thread Index | Old Index