Handling network bursts with channels in Golang
In mobile networking, we sometimes face enormous number of new connections(attach) coming in a very short period of time, most likely at the time a network equipment recovered from failure. It does happen, even if we have redundant network equipment and paths, inevitably by totally unexpected root cause. And the fact that even one of the biggest mobile network operators sometimes faces such situations is telling us to be prepared at our best(for vendors the most vital preparation might be layer eight stuff, though).
It’s not hard for me to imagine the similar kind of stuff happen as well in other tech domains like Web, DB, or any application that has networking functions.
Transport protocols, especially TCP and SCTP implemented within the Kernel, are smart enough to provide the retrying mechanism with us, but it sometimes makes the situation worse due to the increased traffic and/or consumption of application’s resources.
So, in this post, I’ll give quick instructions on how I handle multiple connections in a short period of time(=bursts) in our applications implemented in Golang, with a few examples.
Summary of the solution
- implement a throttling with a channel to mitigate bursts
- queue with slice and channel, which would be nice for aggregating connections
- investigate carefully which solution fits your system and its situation the best
- beware of the side effects
Scenario
Suppose we have the system like this;
Connection sequence
- the server(where the program below works) waits for new connection requests from clients at any time
- the requests are processed locally and then passed to backend
- the responses from the backend are validated in the server and passed to each client
In my mind “the server” is a base station and “backend” is a set of mobile core network nodes, but I think this can be applied to other systems that have multiple/distributed clients.
Resource limitations
- the system can handle 1,000 connections coming simultaneously(=in less than 1ms)
- if it’s more than 1,000, some of the requests fail at some point in the system, which makes the clients retry from the beginning
For further optimizations, it is highly desirable to benchmark(measure how many requests the system can handle without any optimizations) throughout the system and leave benchmarking code as it is. It’s possible that the parameters(interval in throttle, size of queue, etc.) can be tightened by the changes in some part of the system in the future. So, running the benchmark in CI workflow is great idea to fine-tune continuously.
Implementation
Golang’s goroutine and channel work so fine for this kind of networking functions. Thanks to them we can implement a throttling/queuing without making our codes too complicated to read/maintain.
Without optimizations
This is the initial implementation that tries to process the requests as fast as possible, which makes clients retry in case of 1,000+ requests coming at a time in this scenario.
type server struct {
errCh chan error
}
func (s *server) serve(ctx context.Context, l net.Listener) error {
defer l.Close()
// read and accept continuously and dispatch handler.
for {
select {
case <-ctx.Done():
return nil
case err := <-s.errCh:
if isFatal(err) {
return err
}
log.Println(err)
default:
// do nothing and move forward
}
conn, err := l.Accept()
if err != nil {
return err
}
go func() {
if err := s.handleRequest(conn); err != nil {
s.errCh <- err
}
}()
}
}
func (s *server) handleRequest(c net.Conn) error {
defer c.Close()
buf := make([]byte, 1500)
n, err := c.Read(buf)
if err != nil {
return err
}
// do something here;
// validate request, set values to some fields, etc.
result, err := doSomething(buf[:n])
if err != nil {
return err
}
// forward the manipulated request to the backend.
if err := forwardTo(backend, result); err != nil {
return err
}
return nil
}
Throttling
Throttling is to set certain interval(=delay) between a request and another.
type server struct {
errCh chan error
throttleCh chan net.Conn
}
func (s *server) serve(ctx context.Context, l net.Listener) error {
defer l.Close()
// run throttler that waits for conn to come in another goroutine
go s.throttledReqHandler()
// read and accept continuously and dispatch handler.
for {
select {
case <-ctx.Done():
return nil
case err := <-s.errCh:
if isFatal(err) {
return err
}
log.Println(err)
default:
// do nothing and move forward
}
conn, err := l.Accept()
if err != nil {
return err
}
// note that goroutines continue to increase in this implementation
// while the throttle works. it can be avoided by appending to slice
// and read them one by one with throttle, which may be a bit more
// costly in the case that simultaneous attach rarely happens.
go func() {
s.throttleCh <- conn
}()
}
}
func (s *server) throttledReqHandler() {
// let request handler run with 200ms of interval
throttle := time.Tick(200 * time.Millisecond)
for {
conn := <-s.throttleCh
go func() {
if err := s.handleRequest(conn); err != nil {
s.errCh <- err
}
}()
<-throttle
}
}
func (s *server) handleRequest(c net.Conn) error {
// do the same as above
}
- this idea makes the load on the backend averaged, so the peak load can be mitigated
- this is effective if the backend is not good at handling bursty traffic
- interval should be well-considered: the client may start retrying If it’s too long
- if the backend is “always crowded” this does not make so much effect
- if the server is distributed, consider implementing throttling at the entrance(where requests are aggregated) of the backend, which can be more efficient and effective in most cases
Queuing
With goroutine, channel, and slice, the server can aggregate the requests before dispatching.
type server struct {
errCh chan error
resultCh chan *result
}
func (s *server) serve(ctx context.Context, l net.Listener) error {
defer l.Close()
// run queued result handler that waits for result to come in another goroutine
go s.resultForwarder()
// read and accept continuously and dispatch handler.
for {
select {
case <-ctx.Done():
return nil
case err := <-s.errCh:
if isFatal(err) {
return err
}
log.Println(err)
default:
// do nothing and move forward
}
conn, err := l.Accept()
if err != nil {
return err
}
go func() {
if err := s.handleRequest(conn); err != nil {
s.errCh <- err
}
}()
}
}
func (s *server) handleRequest(c net.Conn) error {
defer c.Close()
buf := make([]byte, 1500)
n, err := c.Read(buf)
if err != nil {
return err
}
// do something here;
// validate request, set values to some fields, etc.
result, err := doSomething(buf[:n])
if err != nil {
return err
}
// send result to queue via channel
s.resultCh <- result
return nil
}
func (s *server) resultForwarder() {
var resultQueue []*result
resultQueue = append(resultQueue, <-s.resultCh)
for {
select {
// receive result from handleRequest() and append it to the queue.
case result := <-s.resultCh:
resultQueue = append(resultQueue, result)
// when 100ms passed since this iteration started, dispatch results in
// resultQueue to the backend (almost) all at once.
case <-time.After(100 * time.Millisecond):
for _, result := range resultQueue {
// forward the manipulated request to the backend.
if err := forwardTo(backend, result); err != nil {
s.errCh <- err
}
}
// clear queue
resultQueue = resultQueue[:0]
break
}
}
}
- the key idea here is to queue the processed request(=
result
) for 100ms, and dispatch all the requests in the queue at a time - this is more effective in some cases to reduce the resource consumption, e.g., if you have aggregation feature in the server (like chunking in SCTP which can put multiple payloads in a packet)
- this is good as long as we're sure the backend can handle all at once
Disadvantages
However effective it seems, this solution can only be like a symptomatic treatment: it is just to mitigate the loss that the system inevitably causes. Having a fixed values for limits and timeout means it increases delay for certain period of time in other words.
To avoid the side effects such as making the throttle/queue bottle-neck of the system even when it’s not too crowded, the parameters(intervals, max connection limit and timeout duration). should be determined with precise measurements and fine-tuned continuously. Of course, if you’re aware that those limitations are due to the shortage of machine resources or bad implementation of the program, you should try resolving that first.
For more advanced solution, dynamic throttling that collects metrics in real-time and sets the limits based on that can be an option in the large and critical systems. As you see, it can be much harder to implement and more complicated from the maintenance point of view, though.
Wrapping-up
I’ve explained how to implement a throttled/queued handling of requests over network in Golang, and what should be well-considered before that.
- benchmark the whole system to find the bottleneck and current limitations
- consider improving machine resources and efficiency of programs
- implement queued request handler based on the benchmark result (that’s easy in Golang!)
- continue benchmarking and fine-tuning
Be sure the way to control traffic varies depending much on the environment, so these solution can only be one of the samples for you. Let me know if you have any good idea :)
Happy networking!
References
Additional Go Programming Wikis in GitHub Golang repository has a lot of nice tips. It is not well-organized as a reader but each article is critical to a particular mission.