Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions raise if port dies #20

Open
benwilson512 opened this issue Jan 24, 2018 · 10 comments · May be fixed by #61
Open

Functions raise if port dies #20

benwilson512 opened this issue Jan 24, 2018 · 10 comments · May be fixed by #61
Labels

Comments

@benwilson512
Copy link

Hey folks,

Maybe I'm doing something wrong, but if my statsd agent restarts all function calls end up raising as:

> Sensetra.Statsd.increment("elixir.foo", 1)
** (ArgumentError) argument error
    :erlang.port_command(Sensetra.Statsd, [<<1, 31, 189, 172, 31, 22, 97>>, "elixir.foo", 58, "1", 124, "c"])
    (statix) lib/statix/conn.ex:30: Statix.Conn.transmit/2

This seems undesirable, monitoring functions shouldn't nuke the app in the event that the port goes down.

Ideally there'd be some mechanism for trying to reconnect as well, but at a minimum it would seem we probably want to catch this error.

@thecodeboss
Copy link
Contributor

@benwilson512 I was able to find a solution to the second problem of having the server reconnect. We use a GenServer to manage our Statix connection, which looks a bit like this:

defmodule Metrics do
  use Statix
  use GenServer

  def init(_opts) do
    Process.flag(:trap_exit, true)
    connect()
    {:ok, current_conn()}
  end

  def handle_info({:EXIT, port, reason}, %Statix.Conn{sock: __MODULE__} = state) do
    Logger.error("Port #{inspect(port)} exited with reason #{reason}")
    {:stop, :normal, state}
  end
end

Basically this Metrics module traps exits, so if the port gets closed it can stop and be restarted by its supervisor, which will cause a new connection to be opened.

This could lead to a restart loop, but using appropriate supervisor strategies you can define policies for the restart loop.

By default when you call connect(), your application is not handling exits unless they are something other than :normal, so by explicitly trapping all of them we can cause a reconnect.

--

For the other problem of increment crashing, you can override increment in the GenServer, and catch ArgumentErrors when they happen. Though it's not very elegant.

@benwilson512
Copy link
Author

@thecodeboss that solves the issue of reconnecting, but any metrics calls that happen between when it fails and when it restarts will raise. The inability to send a metric shouldn't take down a process.

@lexmag
Copy link
Owner

lexmag commented Apr 16, 2018

Hey, do you have any reliable way to trigger port dying?

@scrogson
Copy link

scrogson commented May 2, 2018

@lexmag :erlang.port_close/1

@scrogson
Copy link

scrogson commented May 2, 2018

I think the issue brought up in the issue is related to the usage documentation. The problem with the way it is documented is that it tells users to call the connect/0 function directly in the application's start/2 function. As I understand it, this links the port to the application process.

When the port closes and you try to call any of the APIs that call :erlang.port_command/2 it will cause an ArgumentError which will kill the application process.

@lexmag
Copy link
Owner

lexmag commented May 2, 2018

@scrogson thanks, but I wondered more on how (and why) the port gets closed in non-manual way.

@keathley
Copy link

keathley commented May 2, 2018

We've seen the port die if it can't communicate with the agent in a variety of ways: agent restarts, netsplits, etc.

@lexmag
Copy link
Owner

lexmag commented May 2, 2018

@keathley 👍.

I think the ultimate fix for this would be to provide Statix.child_spec which starts a GenServer that will monitor the port and reopen and re-register it in case of failure.

@benwilson512
Copy link
Author

I do think that's a solution, but I also think that wrapping the port operations in a try is critical as well. Even in the best case reconnection isn't instant, and nothing should die in the application just because monitoring is down. In the worst case the re-connection fails (maybe the agent is down).

@lexmag
Copy link
Owner

lexmag commented May 2, 2018

I also think that wrapping the port operations in a try is critical as well.

That's definitely what we should do of course.
I think we should also log missed metric in catch/rescue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants