Skip to content

Conversation

@sshambar
Copy link
Contributor

Many clients currently have no network timeout; connections can take several minutes to fail (or longer if multiple addresses are resolved).

Patch adds a 10 second timeout to all client connections. upsmon especially benefits from the patch as one unavailable ups could delay processing of other ups's, and also cause the watchdog to fire.

Many ups clients were also not attempting SSL, even ones that use authentication to perform actions. Patch adds flag to attempt (but not require) it.

@jimklimov
Copy link
Member

Thanks. I presume you saw comments to PR #85 which tackled this matter before? At least, yours does use upscli_tryconnect() :)
I am not in favor of a hard-coded timeout like defined here. The define can be a default, but users should be able to override it (e.g. with a common CLI option and/or envvar, at least for clients without a config file?)

Also, this deserves entries in NEWS.adoc as well as UPGRADING.adoc (as a possible behavior change and cause for troubleshooting after NUT upgrades on a deployment).

@jimklimov jimklimov added enhancement service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug upsmon labels Mar 13, 2025
@jimklimov jimklimov added this to the 2.8.3 milestone Mar 13, 2025
@sshambar
Copy link
Contributor Author

Thanks. I presume you saw comments to PR #85 which tackled this matter before?

No, I hadn't searched back that far :)... I just had a machine go offline and all my upsmon's starting failing the watchdog and restarting (which failed their heartbeats)... (and my checkmk agents started stalling on upsc) -- so I decided some kind of reasonable timeout would at least fix the "connect" issue.

At least, yours does use upscli_tryconnect() :) I am not in favor of a hard-coded timeout like defined here. The define can be a default, but users should be able to override it (e.g. with a common CLI option and/or envvar, at least for clients without a config file?).

Well, several of the clients use '-t seconds' for a command completion timeout... do you want me to apply that timeout to the connect part too?

Environment config would be possible, but that gets into a much more involved "one per client vs unified client default vs nutclient.so library default vs ???"

I didn't add "alarms" to the network transport part of the program (like upsmon has), but I guess that's possible too...

Also, this deserves entries in NEWS.adoc as well as UPGRADING.adoc (as a possible behavior change and cause for troubleshooting after NUT upgrades on a deployment).

OK, I'll put something in them once we've got a consensus on what we're doing :)

@jimklimov
Copy link
Member

jimklimov commented Mar 13, 2025

I think 't' for --timeout would be good. Indeed, it is used as such for nut-scanner, for similar cause in upsrw and upscmd. Just for "testing mode" in upsdrvctl - but drivers can have their own addvar("timeout"...) or similar, and at least some of the networked ones do (netxml-ups, snmp-ups...)

Alternately, maybe better, 'T' could be used almost everywhere (so far seems only used for thread-count constraint in nut-scanner which has t as timeout) to avoid overlap with the timeout of waiting for commands to respond (which logically differs from the timeout of server connection, might as well be by orders of magnitude).

As for defaulting, I suppose there can be a new method in upsclient.{c,h} to get_default_timeout(char*) into which you would feed the value seen in CLI (if any, else the variable for the argument is NULL), and which juggles the built-in default, or envvar, or explicit request, consistently and with a single implementation. Then the programs just assign (or not) the variable for argv[i] corresponding to a -t they see (or not), and after the CLI option loop is done - call this method and get an int (or a struct timeval right away?)

The same method can be called to handle config file options, if any are provided (so maybe calling it twice, or with two args either or both of which can be null). Two args would help establish priorities better (built-in, config file, envvar, CLI option).

@sshambar
Copy link
Contributor Author

Alternately, maybe better, 'T' could be used almost everywhere (so far seems only used for thread-count

'W' appears available for all the clients; matching the ping -W option, it could support a float that can set fractional seconds.

As for defaulting, I suppose there can be a new method in upsclient.{c,h} to get_default_timeout(char*) into which you would feed the value seen in CLI (if any, else the variable for the argument is NULL), and which juggles the built-in default, or envvar, or explicit request, consistently and with a single implementation. Then the programs just assign (or not) the variable for argv[i] corresponding to a -t they see (or not), and after the CLI option loop is done - call this method and get an int (or a struct timeval right away?)

So you're thinking something like:

  void get_default_timeout(const char *opt_val, const char *config_val, struct timeval *tm);

which can be called multiple times, and if opt_val/config_val are non-NULL they "update" the internal timeout, and later calls to get_default_timeout(NULL, NULL, &tv) will return the latest value?

@sshambar
Copy link
Contributor Author

Actually, would it not be simpler to just have upscli_connect() just honor the default timeout? Then it's as easy as:

upsclient.h:
int upscli_set_default_timeout(const char *str); // <str> may be NULL, returns -1 on parse/value error
int upscli_get_default_timeout(struct timeval *tv); // -1 if no default set

<prog>.c:
#define UPSCLI_DEFAULT_TIMEOUT "10"
upscli_set_default_timeout(argval ? argval : UPSCLI_DEFAULT_TIMEOUT);
upscli_connect(...) // as before

The timeout is unset at startup, so current library users don't see a change.
Calls to upscli_set_default_timeout() set/update the timeout (NULL reverts to no default).

All nut client tools could have the new option, always call upscli_set_default_timeout() and get network timeouts with no further updates. The timeout could also apply to all network traffic (not just connect) and make clients easier to maintain. Most clients would never even call upscli_get_default_timeout()...

UPSCLI_DEFAULT_TIMEOUT could be in a common nut tool header somewhere, but isn't really in the library (which has no default)

Option ==> UPSCLI_DEFAULT_TIMEOUT is the new global default, but can be turned off with upscli_set_default_timeout(NULL)

@sshambar
Copy link
Contributor Author

As a trial, I added the api above and updated upsc (only) to use it with option -W (option -W 0 sets no timeout)

Many clients currently have no network timeout; connections can take
several minutes to fail (or longer if multiple addresses are resolved).

Added upscli_set_default_timeout() to assign a default timeout, honored
by upscli_connect().

Adds a 10 second timeout to all client connections.  upsmon especially
benefits from the patch as one unavailable ups could delay processing of
other ups's, and also cause the watchdog to fire.

upsc has a new option '-W <secs>' to set the network timeout.

Many ups clients were also not attempting SSL, even ones that use
authentication to perform actions.  Patch adds flag to attempt (but not
require) it.

Signed-off-by: Scott Shambarger <[email protected]>
@jimklimov
Copy link
Member

So you're thinking something like:

void get_default_timeout(const char *opt_val, const char *config_val, struct timeval *tm);

which can be called multiple times, and if opt_val/config_val are non-NULL they "update" the internal timeout, and later calls to get_default_timeout(NULL, NULL, &tv) will return the latest value?

Well, my idea here was that this method selects the best default timeval from given sources that may be present or absent, and the program then keeps the variable and uses it for upscli_tryconn. Comparing to your later comments, potentially this allows several such variables, rather than one hidden in the library as an initially-set-never-changed value. That could be useful e.g. in upsmon, maybe upslog or other clients that deal with several UPSes simultaneously...

The idea with (argval ? argval : UPSCLI_DEFAULT_TIMEOUT) seems neat. Perhaps this can be used with an UPSCLI_DEFAULT_TIMEOUT defined in the actual library (upsclient.h) and if some tools think they want another default - they can pass their own macro this way.

It may be even easier on processing (logic at least) somewhat, the method picks the best preferred non-NULL string and parses it into a number. MAYBE if the parsing fails, it takes the next best-liked string pointer and so on. Although then the argval being the last tried option with an invalid content would need a hard-coded number still (e.g. 0 for indefinite timeout like now?)

upscli_connect(...) // as before

This method, per #85 discussion, should connect until the system reports a failure (e.g. TCP/IP stack timeout). If we have a soft timeout to consider, consumer code should (be changed to) use upscli_tryconnect().

@jimklimov jimklimov marked this pull request as draft March 14, 2025 14:49
@sshambar
Copy link
Contributor Author

Well, my idea here was that this method selects the best default timeval from given sources that may be present or absent, and the program then keeps the variable and uses it for upscli_tryconn. Comparing to your later comments, potentially this allows several such variables, rather than one hidden in the library as an initially-set-never-changed value.

So I guess I'm unsure what the xxx_default_timeout() function does then... if the client passes in the "sources" (command option, default, config file in future?), is the function just a wrapper for strtoul() -> timeval.tv_sec? -- and clients store it locally and pass it to upscli_tryconnect()?... also, would clients then handle their own "post-connect" timeouts with alarm() or similar (as upsmon currently does)?

I guess I thought it would be nice to have a function that changed the timeouts for all network functions so that each client didn't have to "re-invent the wheel" with alarms or such (and not calling the function just keeps things the way they are now...)

That could be useful e.g. in upsmon, maybe upslog or other clients that deal with several UPSes simultaneously...

So you think upsmon/upslog might need different network timeouts for different UPSs? That's flexible, but i wonder how many people would need different timeouts for each UPS... of course, just calling the xxx_default_timeout() before handling each UPS's network traffic could support that -- perhaps a more clearly named upscli_set_net_timeout()?

The idea with (argval ? argval : UPSCLI_DEFAULT_TIMEOUT) seems neat. Perhaps this can be used with an UPSCLI_DEFAULT_TIMEOUT defined in the actual library (upsclient.h) and if some tools think they want another default - they can pass their own macro this way.

Yeah, I couldn't quite figure out a good name though, since it's only the "DEFAULT" if you actually call the function :D

It may be even easier on processing (logic at least) somewhat, the method picks the best preferred non-NULL string and parses it into a number. MAYBE if the parsing fails, it takes the next best-liked string pointer and so on. Although then the argval being the last tried option with an invalid content would need a hard-coded number still (e.g. 0 for indefinite timeout like now?)

Wouldn't it be useful to have a parse error (on a parameter value for example) actually fail the command with a useful error message rather than a warning and ignoring it?

upscli_connect(...) // as before

This method, per #85 discussion, should connect until the system reports a failure (e.g. TCP/IP stack timeout). If we have a soft timeout to consider, consumer code should (be changed to) use upscli_tryconnect().

But that would still require clients to (a) pass the timeout value about, (b) apply it to alarms() etc to interrupt other network functions... upscli_tryconnect() is useful for explicit timeouts, but a nice function to set the "default" timeout would make the interface easier to use...

@sshambar
Copy link
Contributor Author

BTW, it's worth noting that the upscli_readline(...) just calls upscli_readline_timeout(...,DEFAULT_NETWORK_TIMEOUT) (defined in common.h, and is 5).

Oddly, upscli_sendline(...) calls upscli_sendline_timeout(...,0)

@jimklimov jimklimov marked this pull request as ready for review March 30, 2025 09:37
… of the method remaining as it was

Signed-off-by: Jim Klimov <[email protected]>
…) to avoid dependency on env (locale)

Signed-off-by: Jim Klimov <[email protected]>
The standard says that static-storage arithmetic values are pre-initialized to
zero implicitly, but better safe than sorry with all the compilers out there.
The X={0} spelling for aggregates of arithmetic entities to be zero-inited for
each component should be supported since at least C89 standard.

Signed-off-by: Jim Klimov <[email protected]>
…any clients changed to 10-second default timeout instead of blocking [networkupstools#2847]

Signed-off-by: Jim Klimov <[email protected]>
@jimklimov
Copy link
Member

I think I'm done for now with formalities like man pages and news and stuff, if you could take over fixing the remaining NUT clients (I think most of them would just patch out the currently proposed numeric setting in favor of what ended up done in upsc.c) that would be great.

Maybe some special attention is needed to upsmon.c? But it is a wee hour of the night here and I begin to struggle reading the code, and a busy week awaits while I hope to cut off v2.8.3 in coming days :) So help is welcome again :)

Added -W <secs> network timeout arg to all clients to match upsc usage.
Updated help message with "common args" section for -h/-W/-V args.
Added upscli_get_default_timeout (used by upsmon)

Signed-off-by: Scott Shambarger <[email protected]>
@sshambar
Copy link
Contributor Author

I updated all the clients to use the same -W argument as upsc, and updated the "usage" text to match the "common arguments" grouping...

For use in upsmon I added upscli_get_default_timeout() to get the alarm timeout... I chose a "struct timeval *" paramenter, but the function could also return a "struct timeval" value (it's a style choice, feel free to change it :)

The client man pages don't mention -W yet (or -V or -h in most cases either)... I wasn't sure if you wanted to crowd the pages with all the "common arguments"...

I noticed that upscli_init_default_timeout() doesn't actually fail if an invalid "cli_secs" is used (since the "default_secs" is valid)... might be nice to issue at least a visible error message if the timeout is "ignored"... or even fail (but still ignore invalid "environment" values?).

Do you think other upsclient.c functions should use the default timeout too? eg upscli_readline() or upscli_sendline()

@AppVeyorBot
Copy link

@jimklimov
Copy link
Member

Thanks! I've fixed a spellchecker complaint. A pointer in the getter I think is better - less data pushed through stack/registers :)

As for default/configurable read/send timeouts, I'd say (and the amount of knobs in other programs supports this) that they are fundamentally different timeouts compared to initial-connection ones.

It may make sense to add a similar set of methods and internal setting for those (gotta read up if anyone differentiates read and recv timeouts as two different beasts too, if there's reasonable precedent for that). In either case, perhaps it makes sense to rename the ones added by this PR so far to clarify that this is about connection timeouts, before the libupsclient API publishes "unfortunate" names forever.

It is already unwieldy (bit me here) that upscli_init() is not so much about general initialization, but rather about SSL/TLS. I had put the variable init there first, before discovering that this method is actually called from only upsmon which sets up SSL/TLS keystores. It may make sense to track that the timeout was initialized before though, so upscli_init() would try to init it (0 or envvar) if it wasn't yet. That would be idiomatic and useful for envvar-tweaked clients.

@jimklimov
Copy link
Member

Failures to parse were seen with debug... changing to upslogx(LOG_WARNING,...)

…NG,...) about timeout strings that we failed to parse [networkupstools#2847]

Signed-off-by: Jim Klimov <[email protected]>
…efault_timeout_initialized [networkupstools#2847]

If not, auto-init in upscli_init() and upscli_connect() once, to
help clients not modified to call upscli_init_default_timeout()
explicitly to still benefit from NUT_DEFAULT_CONNECT_TIMEOUT envvar.

Signed-off-by: Jim Klimov <[email protected]>
@jimklimov
Copy link
Member

After the fix, runs with no logging do report the error(s):

jim@KIHOMI:~/nut$ NUT_DEFAULT_CONNECT_TIMEOUT=-3.8 ./clients/upsc -W 18,25 eco650
upscli_init_default_timeout: NUT_DEFAULT_CONNECT_TIMEOUT='-3.8' value was not recognized, ignored
upscli_init_default_timeout: cli_secs='18,25' value was not recognized, ignored
battery.charge: 100
battery.charge.low: 20
...

The wording may be not too friendly, but should get the point across :)

@AppVeyorBot
Copy link

…t to stress that "default network timeout" is about "initial connection" [networkupstools#2847]

Signed-off-by: Jim Klimov <[email protected]>
@jimklimov
Copy link
Member

Regarding naming changes (hey, ain't known as one of the hardest problems in IT for nothing) - hopefully my work for this pass is done... if CI agrees.

@jimklimov
Copy link
Member

Hm, that IP address printing misfired, will look at it later... or you can :)

…etworkupstools#2847, networkupstools#85]

Clang complained about explicit pointers:

	upsclient.c:1187:37: error: cast from 'struct sockaddr *' to 'struct sockaddr_in *' increases required alignment from 2 to 4 [-Werror,-Wcast-align]
	 1187 |                                                 struct sockaddr_in *addr_in = (struct sockaddr_in *)ai->ai_addr;
	      |                                                                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	upsclient.c:1192:39: error: cast from 'struct sockaddr *' to 'struct sockaddr_in6 *' increases required alignment from 2 to 4 [-Werror,-Wcast-align]
	 1192 |                                                 struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)ai->ai_addr;
	      |                                                                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	2 errors generated.

Signed-off-by: Jim Klimov <[email protected]>
@AppVeyorBot
Copy link

@jimklimov
Copy link
Member

Reports from NetBSD worker would be fixed in #2870
Windows just exceeded the hour...

@jimklimov jimklimov merged commit 5e65d4a into networkupstools:master Apr 3, 2025
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug upsmon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants