We recently launched the mobile app of our client Primed Mind during the World Series of Poker in Las Vegas. The app delivers professional audio content to people who want to improve their mindest (determination, confidence, recovery, etc.) and is targeted not only at professional poker players but also at pretty much everyone who feels they could increase their productivity and focus. This is a story about the technical side of the launch and some high level learnings about Elixir, Load Testing, and life.
The API backend for the Primed Mind app was our first big production app built with Elixir and Phoenix, after having played with it and using it internally. The reason to go with Elixir instead of our default Ruby on Rails stack was that the customer mentioned from the beginning that he wanted to launch big. Elixir seemed like a good choice for such a use case where we might have to scale quicker than we did for other startup customers.
The initial results were pretty disappointing. From what we had heard about Elixir before and from the low single request response times we had seen, we thought it would be doing a lot better. With the test setup described above and 2 AWS EC2 app servers (4 cores, 16GB) the app would soon start responding very slowly (30+ seconds) and even start throwing errors (e.g. checking out new database connections). All of this while the large Postgres database in the background was totally bored.
Looking closer at the app servers, RAM was definitely not an issue, the servers were not using more then 1.5 GB of the RAM (including additional services running). But CPU load got very high, very quickly. After some research we discovered that the issues were nothing related to Elixir/Phoenix in general, and that they only applied to the registration/login requests. The culprit for the high CPU load was simply identified as the bcrypt password hashing. Bcrypt is, by design, very computationally intensive, which was even amplified by the fact that the Elixir/C implementation is using "NIFs" (Native Implemented Functions), which will block the Erlang VM (see: Comeonin docs). So now we knew, that we would have to provide more hardware resources to handle our authentication-heavy target load (for example by switching to computation-optimized EC2 instances). For normal API request, even with larger JSON payloads, the response times were brilliant.
The other big learning was that we needed to change the load balancing algorithm of our HAProxy from the default
leastcon. While this should theoretically not make too huge of a difference for a setup with equally big app servers, it turned out to make a massive difference in reality. The same load test that had yielded errors and very long response times was suddenly going super smooth.
So after some intense debugging and several small additional improvements, we were finally seeing the results we were hoping to see from the beginning: smooth load tests with consistent low double digit response times (times that would even be hard to realize for a "hello world" request using Rails).
We were able to go quite confidently into the big launch, which in the end turned out to be smaller than expected, because the level of media/TV coverage for the Primed Mind co-founder Fedor Holz at the World Series of Poker in Vegas was not quite as high as as we had hoped. Instead we saw that traffic was very nicely and evenly distributed and instead of the huge masses we saw a very healthy stream of new users with an exceptional conversion rate. If you would like to improve your mindset we recommend checking out the Primed Mind app. They are really great and focused clients, with a lot of good energy, and we are very proud about the product we built together with them.
Even though the super high peak traffic never came, we gained some extremely valuable insights about our infrastructure and our Elixir app. Insights we would have had to find out "the hard way", without the help of StormForger. We are looking forward to putting our gained experience into use in our next Elixir project - could it be your project?