IMVU’s Startup Lessons Learned Conference Presentation

By Brett Durrett

IMVU presented at the Startup Lessons Learned Conference in San Francisco on April 23rd, 2010.  The event highlighted several companies that are being built using the “Lean Startup” framework created by Eric Ries, IMVU’s former CTO, largely based on his experiences at IMVU.

The conference was great and I had the opportunity to meet many smart entrepreneurs trying to build businesses out of great ideas.  I heard many stories about the challenges early startups encounter and could remember when IMVU was in that stage. I also talked to some people from companies that are now considered “big and successful” and heard a few comments along the lines of “been there, barely survived that”.

At the conference I had the realization that IMVU as a business is not exactly a “startup” anymore.   The goal of an early startup is discovering the right product and achieving a sustainable business model.  IMVU has been successful at this and is now all about building a growing, enduring business that is a high value for our customers and employees.  Though we still feel like a startup in many ways and hold onto the lean principles that proved to be so valuable, we now have new challenges to address that are typically not considered startup challenges.

A successful startup grows into a bigger business.  At IMVU, we heavily invest in our company so we can get more people working on features that delight our customers and build up the business.   With more people many of the ways you used to work don’t work anymore. For example, frequent meetings to get feedback from everybody in the company can work when you have fewer than 15 people… when you get to 50+ people this becomes a very expensive meeting. The overhead of making sure everyone in the meeting has background data and context to make an informed decision simply does not scale.  Joel Spolsky explained this well and provides good examples in his article, “A Little Less Conversation”.

There are a whole range of challenges in these transitions, from process to culture and all have to be accommodated as a company grows. At the conference I was approached by several people that had gone through the same experience, some successfully, some not.  I hope that some of what IMVU shared will help others to learn from our experience and allow more people to fall into the successful category.

Check out IMVU’s video presentation available at http://bit.ly/bBpUcm

If you just want to review the slides without audio, they are here: http://bit.ly/aGXqcY

[slideshare id=3861498&doc=sll2010-doesitscale-100426135956-phpapp02]

IMVU’s Agile Engineering Process

Clinton Keith, the Agile coach and trainer who introduced Agile and scrum to the video game industry in 2003, visited the IMVU offices during the Game Developer’s Conference in March. I spent time with Clint describing IMVU’s product development process, and we even had the opportunity to show Clint some of our scrum process in action. Clint reported a lot of interest in our successful implementation of Agile and Lean Startup methodologies from people he spoke with at GDC, and he asked if I’d be willing to do a Q&A session with him. Here is an excerpt of our interview, posted today on his Agile Game Development blog:


Agile Game Interview – Agile Engineering for Online Communities

CK: What is the overview of the IMVU engineering process?

JB: Our engineering team practices what people refer to as “Agile” or “Lean Startup” methodologies, including Test-Driven Development (TDD), pair programming, collective code ownership, and continuous integration. We refactor code relentlessly, and try to ship code in small batches. And we’ve taken continuous integration a step further: every time we check in server side code, we push the changes to our production servers immediately after the code passes our automated tests. In practice, this means we push code live to our production web servers 35-50 times per day. Taken together with our A/B split-testing framework, which we use to collect data to inform our product development decisions, we have the ability to quickly see and test the effects of new features live in production. We also practice “5 Whys” root cause analysis when something goes wrong, to avoid making the same mistakes twice.

CK: How do you get so many changes out in front of customers without causing problems, either for your customers or your production web servers?

JB: I think it’s important to point out that sometimes we actually do introduce regressions and new bugs that impact our customers.  However, we try to strike a balance between minimizing negative customer impacts and maximizing our ability to innovate and test new features with real customers as early as possible. We have several mechanisms we use to help us do that, and to control how customers experience our changes. It boils down to automation on one hand, and our QA engineers on the other.

First the automation: we take TDD very seriously, and it’s an important part of our engineering culture. We try to write tests for all of our code, and when we fix bugs, we write tests to ensure they don’t regress. Next, we built our code deployment process to include what we call our “cluster immune system,” which monitors and alerts on statistically significant changes in dozens of key error rates and business metrics. If a problem is detected, the code is automatically rolled back and our engineering and operations teams are notified. Next, we have the ability to limit rollout of a feature to one or more groups of customers–so we can expose new features to only QA or Admin users, or ad-hoc customer sets. We also built an incremental rollout function that allows us to slowly increase exposure to customers while we monitor both technical and business metrics to ensure there are no big problems with how the features work in production. Finally, we build in a “kill switch” to most of our applications, so that if any problems occur later, for example, scaling issues, we have fine-grained control to turn off problematic features while we fix them.


Read the rest of Agile Game Interview – Agile Engineering for Online Communities

IMVU’s Approach to Integrating Quality Assurance with Continuous Deployment

We’ve heard a lot of interest from folks we’ve talked to in the tech community and at conferences about our Continuous Deployment process at IMVU, and how we push code up to 50 times a day. We’ve also received some questions about how we do this without introducing regressions and new bugs into our product, and how we approach Quality Assurance in this fast-paced environment.

The reality is that we occasionally do negatively impact customers due to our high velocity and drive to deliver features to our customers and to learn from them. Sometimes we impact them by delivering a feature that isn’t valuable, and sometimes we introduce regressions or new bugs into our production environment.

But we’ve delivered features our customers didn’t like even when we were moving more slowly and carefully—and it was actually more costly because it took us longer to learn and change our direction. For example, we once spent more than a month of development time working on a feature called Updates–similar to the Facebook “friend feed”, and we guessed wrong–way wrong–about how our customers would use that feature.  It took us a way too long to ship a feature nobody actually wanted, and the result was that we delayed delivery of a feature that our customers were dying to have: Groups.

Asking customers what they want takes guesswork and internal employee opinions out of product development decisions, making it easy to resolve questions such as, “Should we build tools to let users categorize their avatar outfits, or should we build a search tool, or both?”  We make the best decisions we can based on available customer data, competitive analysis, and our combined experience and insights—then build our features as quickly as possible to test them with customers to see if we were right.

We’ve found that the costs we incur–typically bugs or unpolished but functional features–are worthwhile in the name of getting feedback from our customers as quickly as possible about our product innovations. The sooner we have feedback from our customers, the sooner we know whether we guessed right about our product decisions and the sooner we can choose to either change course or double down on a winning idea.

Does that mean we don’t worry about delivering high quality features to customers or interrupting their experience with our product? Nothing could be further from the truth.

For starters, we put a strong emphasis on automated testing. This focus on testing and the framework that supports it is key to how we’ve structured our QA team and the approach they take. Our former CTO and IMVU co-founder Eric Ries has described in detail the infrastructure we use to support Continuous Deployment, but to summarize, we have implemented:

  • A continuous integration server (Buildbot) to run and monitor our automated tests with every code commit
  • A source control commit check to stop new code being added to the codebase when something is broke
  • A script to safely deploy software to our cluster while ensuring nothing goes wrong. We wrote a cluster immune system to monitor and alert on statistically significant regressions, and automatically revert the commit if an error occurs)
  • Real-time monitoring and alerting to inform the team immediately when a bug makes in through our deployment process
  • Root cause analysis (Five Whys) to drive incremental improvements in our deployment and product development processes

This framework and our commitment to automated testing means that software engineers write tests for everything we code. We don’t have a specialized role that focuses solely on writing tests and infrastructure–that work is done by the entire engineering team. This is one factor which has allowed us to keep our QA team small. But you can’t catch all regressions or prevent all bugs with automated tests. Some customer use cases and some edge cases are too complex to test with automation. Living, breathing, intelligent people are still superior at doing multi-dimensional, multi-layered testing. Our QA Engineers leverage their insight and experience to find potential problems on more realistic terms, using and testing features in the same ways real customers might use them.

Our QA Engineers spend at least half their time manually testing features. When we find test cases that can be automated, we add test coverage (and we have a step in our Scrum process to help ensure we catch these). We also have a Scrum process step that requires engineers to demonstrate their features to another team member–essentially, doing some basic manual testing with witnesses present to observe. Since we have far more features being built than our QA Engineers have time to test, it also forces the team to make trade-offs and answer the question, “What features will benefit most from having QA time?” Sometimes this isn’t easy to answer, and it forces our teams to consider having other members of the team, or even the entire company, participate in larger, organized testing sessions. When we think it makes sense, our QA Engineers organize and run these test sessions, helping the team to find and triage lots of issues quickly.

Our QA Engineers also have two more important responsibilities. The first is a crucial part our Scrum planning process, by writing test plans and reviewing them with product owners and technical leads. They help ensure that important customer use cases are not missed, and that the engineering team understands how features will be tested. Put another way, our QA engineers help the rest of the team consider how our customers might use the features they are building.

The second responsibility is what you might expect of QA Engineers working in an Agile environment: they work directly with software engineers during feature development, testing in-progress features as much as possible and discussing those features face-to-face with the team. By the time features are actually “ready for QA”, they have usually been used and tested at some level already, and many potential bugs have already been found and fixed.

Regardless of how much manual testing is completed by the team before releasing a feature, we rely again on automation: our infrastructure allows us to do a controlled rollout live to production, and our Cluster Immune System monitors for regressions, reducing the risk of negatively impacting customers.

Finally, once our features are live in front of customers, we watch for the results of experiments we’ve set up using our A/B split-test system, and listen for feedback coming in through our community management and customer service teams. As the feedback starts rolling in—usually immediately, we’re ready. We’ve already set aside engineering time specifically to react quickly if bugs and issues are reported, or if we need to tweak features to make them more fun or useful for customers.

We’re definitely not perfect: with constrained QA resources and a persistent drive by the team to deliver value to customers quickly, we do often ship bugs into production or deliver features that are imperfect. The great thing is that we know right away what our customers want us to do about it, and we keep on iterating.

IMVU Memcached Usage at 150k Requests per Second per Node

By: Jon W

Running a large web site is quite interesting. Not only do you have a *lot* of customers to keep happy, but you also have a *lot* of hardware to keep happy. Racks and racks of servers go into making sure that IMVU is always available, 24 hours a day, 7 days a week, except for occasional brief outages.

One of the kinds of servers we use is known as “memcached,” which is a simple way of reducing the workload for database servers by caching the result of frequent database queries in memory. This has two benefits:

  • It reduces load on the database servers (which are expensive) and moves it to machines that just have a lot of memory (which is cheap).
  • It allows us to serve web pages faster. Each web page within imvu.com is the result of several database queries; if the result of all of those queries is already in memory somewhere, we can send the page back in less than a second, whereas it would take longer than that if we had to go to the database for each query.

That being said, it’s no surprise that we have a significant number of servers running memcached to improve the speed and performance of our site. It used to be that when your memcaches hit 50,000 requests per second, they would top out — but that was a couple of years ago. It also used to be that you would switch to (potentially lossy) UDP for memcached traffic, rather than staying on (potentially laggy) TCP. However, for better or for worse, we’ve stayed on TCP for our memcached traffic for now.