Functions as a Service (FaaS) platforms allow you to execute code on demand in response to various events. In recent years, FaaS platforms have become a popular compute mode when traffic loads are bursty, low throughput, unpredictable, or when only a small subset of logic requires a backend server (such as with a mobile application). The FaaS architecture model allows developers to optimize cost of using cloud infrastructure by only consuming compute resources as needed, as well as allowing developers to scale individual portions of their application, rather than scaling an entire service.

While the basic premise of Functions as a Service is pretty straightforward, it can be easy to draw incorrect assumptions about the lifecycle of a single invoked function. Recently we encountered some HTTP connection reset issues on one of our Go Lambda functions, which was a great excuse for me to dig into container reuse in Amazon Lambdas in order to debug and resolve the underlying issue.

This article will focus on two main areas:

  • Lambda Contanier Reuse
  • Go’s http.Transport

Container Reuse

Chances are if you’ve ever searched for documentation regarding Amazon Lambda container reuse, you’ll have stumbled on this article. The article uses Node.js as the language example of chocie, and highlights a few key points:

  • Invocations of the lambda code are run in isolated sandboxes (containers)
  • Initialization code is executed once per container execution, handler functions can be invoked multpile times
  • Amazon may reuse the previous container: However, if you haven’t changed the code and not too much time has gone by, Lambda may reuse the previous container.
  • Containers that are reused are ‘frozen’ and then ‘thawed’. Background processes/threads that aren’t bound to the handler function’s execution will be stopped and resumed if the container is reused.

You should never depend on container reuse for proper execution of some code. Tasks that must complete with each invocation should be bound to the handler function in some way.

What We Saw

Our Go lambda function would respond to infrequent SNS events. These events are fairly spread out and don’t generally overlap in their arrival (arriving every few minutes). In practice, we saw that the same container could be reused for around 40 minutes, or potentially as long as several hours. To verify this, we included a random identifier that was printed with each handler function invocation (likewise, all container messages go to the same log group).

These reuse patterns are likely specific to our traffic load, but the fact that the same lambda could be reused for 40+ minutes (or even hours) is pretty eye opening.

Container Reuse Explained

The Amazon article mentioned above focuses on Node.js. In dynamic/interpreted languages, the reuse model is fairly intuitive: you create a handler/main type function and Amazon can invoke that specific function when a new request comes in.

When using a complied language like Golang, it’s unclear exactly how your handler code can be reused. Initially I assumed that the executable you uploaded would be invoked each time leading to a fresh state on each invocation, even if the underlying container was reused.

Under the hood, the aws-lambda-go library accepts a Handler function passed to the lambda.Start function. The lambda.Start method actually starts a TCP server that listens for new events (such as SNS or Web requests) and passes those along to the Handler method you provide to the Start method.

func Start(handler interface{}) {
	wrappedHandler := NewHandler(handler)
	StartHandler(wrappedHandler)
}

func StartHandler(handler Handler) {
	port := os.Getenv("_LAMBDA_SERVER_PORT")
	lis, err := net.Listen("tcp", "localhost:"+port)
	if err != nil {
		log.Fatal(err)
	}
	function := new(Function)
	function.handler = handler
	err = rpc.Register(function)
	if err != nil {
		log.Fatal("failed to register handler function")
	}
	rpc.Accept(lis)
	log.Fatal("accept should not have returned")
}

The code can be seen here.

The key takeaway here is that your executable is only invoked once per creation of a container, and any resources created and scoped to your handler function’s execution will be recreated each time the container is invoked and reused. Any resources created with a scope outside of your handler function will potentially be reused between invocations, regardless how how long the container has been frozen.

This behavior is great as it allows you to pool/reuse connections and other resources, but it does require you to be mindful of connection timeouts and the like, as resources may sit frozen for an undefined period of time. You can read more about the Execution Context of lambdas for greater detail.

In hindsight, this behavior makes perfect sense: re-using containers is a great optimization for both Amazon and developers, but if you’re newer to Lambda functions this behavior may not be intuitive (especially when uploading an executable, like with Go).

Connection Resets

Now that we understand how Go Lambda containers can be reused, let’s dive further into the connection reset issue. The behavior we were witnessing was that our HTTP request to a remote server would eventually fail due to a connection reset issue. In our application, the net/http.Client was created once and reused with each invocation of the handler function. The http.Client instance in question used a custom http.Transport object, rather than the net/http.DefaultTransport object.

Connection Pooling

As it turned out, the net/http Transport uses connection pooling/caching under the hood. This means if you’re periodically making requests to the same server, like we were, the Transport will keep around the underlying connection and reuse it on the next request. Connections that aren’t frequently reused or exhibit errors may be closed by the Transport.

In retrospect the connection pooling makes perfect sense, and this behavior is probably implemented in most standard http libraries for programming languages or third party libraries. For example, Node.js appears to pool/reuse connections as well. However, if you’re unaware of this behavior you may assume a fresh HTTP Connection is used with each request, leading you to draw incorrect conclusions about the source of connection errors.

Fixing the Connection Resets

Our HTTP Client was using a http.Transport that looked something like this. Can you spot the error?

var Transport = &http.Transport{
	DialContext: &net.Dialer{
        Timeout: 5 * time.Second,
    },
	TLSHandshakeTimeout: 5 * time.Second,
}

Again, let’s look at the http.Transport documentation.

Our configuration specifies some timeouts which are a great reliability practice, but as it turns out there’s a large number of Transport fields not defined in this configuration. In Go, variables/fields are initialied with their zero value if no other value is specified. Strings have a zero value of "", bools default to false, numeric types default to a zero value of 0, and pointer types have a zero value of nil.

With this in mind, let’s look at some of the Fields not defined on the above Transport struct:

  • DisableKeepAlives bool - When true, disables the caching behavior and only uses each connection for a single request.
  • MaxIdleConns int - Limits the maximum number of idle connections in the pool for all hosts. A zero means no limit
  • MaxIdleConnsPerHost int - Limits the number idle connections per host. A zero means no limit.
  • MaxConnsPerHost int - Limits the number of connections per timeout. A zero means no limit.
  • IdleConnTimeout time.Duration - Indicates how long an idle connection should remain in the pool before being closed. A zero indicates no limit to the timeout.

In this case, the lack of an IdleConnTimeout meant that our http.Client would indefinitely try to maintain a connection to the upstream server. The upstream server would close this connection after a few minutes, and a subsequent lambda invocation with the same container would encounter a connection reset error when trying to reuse this closed connection.

To verify this diagnosis, we started by setting DisableKeepAlives to true to ensure that the error was resolved when connection reuse was completely disabled. Once this solution was verified we went ahead and provided some sane default values for the other Transport configuration values.

Zero Values?

Overall, I like zero values in Golang, but this does feel like an error prone way to design the http.Transport API. If the Transport struct adds another field in a newer version of the language any existing explicit initalization of a net.Transport instance would use the zero value for that field, potentially resulting in an no limit value. Not having explicit limits/timeouts is a stablity anti-pattern and could lead to outages in the future.

This is definitely a case where more careful reading of the documentation on our part could have prevented this error, but it is nice when APIs/libraries make it harder for you to do the wrong thing without explicit intent/acknowledgement by the developer (ex: dangerouslySetInnerHTML in React).

Conclusions

My previous experience with Lambdas was pretty limited, so debugging this issue was a great learning experience and excuse to dig into lambda internals for Go (and other languages) as well as the net/http Client. Hopefully this experience is insightful for others as well.

Where we misstepped:

  • Poor understanding of Lambda Resuse/Lifecycle - Personally I assumed that most invocations would use a fresh instance given the spread of SNS events
  • Connection Pooling Under the Hood - http.Transport defaults to pooling/reusing connections
  • Bad http.Transport Config - Default values can be dangerous

Key Takeaways:

  • Lambdas executions aren’t inherently stateless. Be mindful of where resources are instantiated in relation to the Handler function invocations and their scope
  • Lambda reuse isn’t inherently predictable, and you shuoldn’t expect/depend on a container to be reused several times. That said, it is possible for low throughput lambda containers to be reused for hours.
  • Be wary when configuring libraries and not providing default values for all fields.
  • Read library and FaaS docs carefully ;)