« New Summize.com | Main | 2007's Best: Books »

December 19, 2007

Computing Compute Resources for a Startup

Startups today need far less capital for compute resources then they did just a decade ago. This has been driven by the improvements in CPU processing speeds, people designing systems that run on cheap commodity servers and an overall trend of getting more computer for less.  For Summize, I couldn't imagine building the technology and processing the data we have today with less than a few million dollars in hardware a decade ago. Now $25k gets you a lot of computing power.  Even with the cheaper cost of hardware these resources come with additional costs that can be real issues for a startup.  For example, you still must buy and setup servers, find rack space, deal with backups, DNS servers and many other tasks before those servers are usable. There is also the additional time and knowledge to perform the maintenance that keeps them running.  In this post I examine some of the costs for figuring out when to build vs rent these resources.  Others can benefit from this as it is not a problem specific to Summize but to compute intensive startups.

First a quick definition on the type of problem I am talking about here.  There are many examples of computing problems that examine some data set to derive new information.  This data analysis is similar to more traditional data mining, but focuses more on summarization or derivation of the data than trying to mine some new trend with the guidance of a human. Wikipedia has a good definition of "Data Analysis" as "Data analysis is the process of looking at and summarizing data with the intent to extract useful information and develop conclusions."

Here is what I see as the goal of this exercise: figure out where the tipping point is between renting resources and buying/maintaining them.  Understanding these factors can help influence new designs and allow the building of cheaper systems. I am going to compare our current hosting costs to Amazon's compute cloud as that seems to be the gold standard of renting compute power these days. Note, your costs may differ, and if cheaper let me know how you did it.  For those of you not familiar with Amazon's EC2 and S3 services, they rent data storage and compute services to others, thus allowing people to benefit from their economy of scale.

Here is the blurb from Amazon:

Amazon EC2 passes on to you the financial benefits of Amazon's scale. You pay a very low rate for the compute capacity you actually consume. Compare this with the significant up-front expenditures traditionally required to purchase and maintain hardware, either in-house or hosted. This frees you from many of the complexities of capacity planning, transforms what are commonly large fixed costs into much smaller variable costs, and removes the need to over-buy "safety net" capacity to handle periodic traffic spikes.

That argument holds merit from my experiences at AOL where we were able to build out services cheaper than others because all needed pieces like raised floor space, operations personal, etc, were purchased and run at scale.

High Level Costs

Let's first look at the cost of a server and hosting it. A powerful server from DELL costs around $4,300 for a 1U server with 2 quad core CPUs (that's basically 8 processors in 1U), 16 gigs of memory, and two 15k 73-Gig disks with rails. Second, our service provider charges us $75 a month per 1U for power and a basic 1 Mbps bandwidth, DNS, firewall and other necessities.  This incentivizes us to reduce our U footprint as there is a basic cost to the space; additionally, multiple CPUs / cores in a single box also reduce the setup and maintenance costs. 

Amazon provides a similar machine configuration, "15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform", for $0.80 per hour --  that's $19 a day or $575 per month. The Amazon bill comes to ~$7,000 per year compared to ~$5,200 for the server and hosting for the first year. Note that I am completely eating the cost of the server in the first year. The second year of a server would only cost of the rack space: $900. But we are trying to keep things as simple as possible with this analysis.

The simple answer is that a big machine costs more from Amazon than buying it yourself -- if you used it all the time.  The reality is many machines sit idle the majority of the time, so the question is at what point are these two approaches equivalent?  I used the Amazon costs and priced out similar boxes from Dell for their various configurations.

Yearcosts

This graph plots three lines for three machine types from Amazon: on the Y axis the yearly cost and the X axis the average number of hours used per day. The cost of purchasing an extra-large from dell is $4,300, a large is $2,700 and a small instances ($1,000 from dell).  First insight, the small instance from Amazon is cheaper than just renting the space from our provider ($900), without any hardware at all. If you needed to purchase that same machine it would cost you an additional $1,000, of which $130 is just for the rails to put the box in the rack.  These results match well with Amazon's claim of economy of scale.

Now a large instance (Amazon terminology) costs about $2,700, with $900 for hosting, or $3,600 a year. That is about the same cost of renting a large instance full time from Amazon.  Again, from the second year on you would realize more savings (if an accountant is reading this, my amortizing costs and depreciations may be off, let me know, just don't forget that four years is a long time for todays hardware, 2 years old is about the max of what we want).

A quick overview: for small server needs, Amazon is hard to beat; for larger server needs the result is less clear -- there is some tipping point where Amazon is not as cost effective.  One clear direction that any architect needs to consider is how to get analysis processes to work on small EC2 instances. Now to dig into the details for bandwidth.

Bandwidth

Bandwidth is another cost for small players.  For example, we pay about $500 a month for an average usage of 5-10 Mbps.  Let's look at the costs for the following scenario where you upload data to a compute cloud, process it, and return some summary of the data, where the summary data set is approximately 10% of the initial data size.  Below are the costs from 100MB to 5GB per month for the bandwidth costs to process the data.

Bandwidthcosts

Note: Inter-node transfer of data is not charged for Amazon and most hosting providers and permanent storage via S3 is at $0.15 per GB.

Now that we have the basics costs for hardware and bandwidth let's talk about various scenarios in which those two factors are used.

Scenarios

So how does this all come together? Here is how I have been thinking about it. Any type of "data analysis" application is going to have the following stages:

  • Data acquisition - a combination of server and bandwidth needs
  • Data analysis -  a set of CPUs and memory for processing and summarizing the data
  • Data transfer - summarized data transfered to finial location

Our strategies can be classified into the following approaches:

  1. Acquire, analyze and summarize the data on Amazon
  2. Acquire, analyze and summarize the data on startup resources
  3. Pick the cheapest for each step and transfer when needed

Data acquisition

Using approach #3 we will examine each step in terms of cost for a mythical crawl, analyze cycle. First we will examine the Amazon cost associated with 'Data acquisition'. A 5-Mbps crawl rate or ~1,500GB per month gives us a bandwidth cost of ~$150 from Amazon. Additionally, a large server instance is needed to handle the crawl processing at a cost of $288.  The storage fee for the data is $0.15 per GB so ~1500GB = $225. Thus, the 'Data acquisition' costs are 150+288+225=$663 per month.  In comparison our own cost would be ($500 bandwidth + $225 server + $75 hosting) $800 per month. If, this were the only step using Amazon would represent a savings of about $140 per month and $1680 per year per 5-Mbps crawl.

The big difference here is our current cost of bandwidth.  Converting it into a price per GB transfered we find our current costs are $3.24 vs the $0.10 from Amazon. It is interesting that the reason Amazon is cheaper is not the processing costs but bandwidth cost.

Data analysis

The next step 'Data analysis' has the following four cases:

  1. (amazon crawl $663 + internal processing $67) = $730
  2. (amazon crawl $663 + bandwidth $270 + startup processing $225) = $1,158
  3. (startup  crawl $800 + startup  processing (2*$225)) = $1250
  4. (startup  crawl $800 + amazon data in transfer $150 + amazon  processing $67) = $1017

Here we see the benefit of using Amazon's variable resourcing. The cost of a server is $67 (the one week you needed it), compared to the $225 for a startup machine (full time).

Data transfer

So far the cheapest strategy has been to use Amazon for crawling and data analysis.  This is cheaper than the other approaches by ~$290 a month and ~$3400 a year. Unfortunately those savings get eaten up in an unexpected way, long term storage of the raw data. 

Like most analysis, not all the questions are known a priori thus original data is often stored for some period of time (1 year to 18 months) so re-processing the data can be done as new needs arise. Lets say we keep the data for one year, so each 1.5TB of data costs an additional $2700 per year for its long term storage.  That is much more costly than buying cheap back-up external drive at $220 per TB and using them as off-line back-ups. 

The actual summarized data represents a smaller data size than the original, thus less expensive to transfer. Lets say the analysis produces a summary of the data that is 10% of the original data set size. The 150GB summary would only add an additional $30 per month to transfer out of the Amazon network.  But, because of the long term storage issue, the original data would also need to be transfered, this adds ~$300 to our case #1 which is slightly better than doing it all ourselves case#3 or a hybrid approach like case #4.

Conclusion

While your mileage may vary, there seems to be real value with variable resourcing if you can manage it with your application. The take aways points I get from this analysis is that long term storage can eat any savings Amazon may have from its services. Also, when large hardware setups are needed and used it is still cheaper to buy and maintain them yourself.  Lastly, bandwidth costs are still too high for start-ups. 

For those of you that want to play with various Amazon pricing scenarios they provide a nice cost calculator.

I look forward to others' thoughts on this.

Abdur
abdur@summize.com

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/2467032/24159360

Listed below are links to weblogs that reference Computing Compute Resources for a Startup:

Comments

Another option to consider would be to host the crawling and archive servers at your office. It seems like these 2 components of your system would take up the most space and utilize the most bandwidth. A 15Mbs (synchronous connection; see link below) Verizon Business FIOS* connection should be more than enough to meet your bandwidth needs for the crawling as well as uploading the summary data set (just under 24 hours).

This should allow you to drastically reduce the number of servers that you colocate. The real concerns in this senario would be the reliability of the FIOS connection and power at your office.

I would be interested in your analysis of how this option measures up to the ones you outlined above.

Verizon pricing:

http://www22.verizon.com/content/businessfios/packagesandprices/packagesandprices.htm

*Please note that I am not a VZ investor or employee.

Mike, great to meet you and thanks for the comment keep them coming. I completely agree that finding ways to reduce our bandwidth costs are a big win for us. While I am not ready to put servers in our office at this time. Services like FIOS and Amazon put out a cost for bandwidth that others must start to match. I see this as a big win for small start-ups in general. I think what happens to our analysis is that for large crawls it is more economic to do it yourself if the bandwidth costs are brought down.

I am still excited about finding other ways to use the Amazon services as that will allow us to scale out smaller needs in a more cost effective approach.

Thanks,
Abdur

Hei Abdur, long time no talk!
First of all best wishes to you and your Team.

This post of yours is so detailed that it looks & feels more like a publisheable paper than a post on a blog, even a corporate one.

Thanks for sharing your experience. To tell you the truth, Pete S. had told me nice words about EC2 in the past, and I trust Pete. I had read a thing or two about EC2 and found it (of course) quite intriguing, yet I had a little sense as to how cost-effective it would be. Well, after reading your post I have a MUCH better idea about the entire story.

One little suggestion: as (so far) my knowledge of EC2 is only theoretical, could you speculate a bit on the actual experience? It sounds pretty easy to use, but is it really so? The reason why I am asking is that this part too must be added into the overall cost equation, when developer's resources are constrained.

Once again, my very best wishes to you and the entire Summize's gang.

Mirco Mannucci

Mirco great to hear from you. Since we have not deployed anything using EC2 I cannot comment on that, but here is a great post that has lots of users commenting on their experiences (http://www.highscalability.com/amazons-ec2-pay-you-grow-could-cut-your-costs-half)

I think Amazon brings two important pieces to the startup world. First, it gives small startups the first viable means for commodity computing. Second, business that need resources now have industry standard pricing for bandwidth and hosting to negotiate with.

Cheers and keep the comments coming...

Great post, Abdur. With a bit of polish, it could easily be published.

It seems your particular challenge is the size of your raw data sets -- they're expensive to store long term on A3 and expensive to rehost to off-the-shelf, cheap, disk. Even with that, though, it seems that you save some money by using Amazon’s backend.

I can think of a handful of additional factors that you may want to consider:

- The obvious benefit to EC2 is that you only pay for machine time you actually use. In your datacenter, you're paying for machine time you don't use. When you hit a real growth curve, I'd expect that you'll build out servers ahead of expected usage - and thus wasting cycles (and dollars) in the process. EC2 inherently includes a "just in time" scaling model. Your analysis assumes a static footprint in your datacenter, when in fact it’s expanding.

- There could be some optimization in CPU usage that could provides additional savings – when using Amazon’s hosts you clearly want to ensure they’re running as close to 100% as possible.

- Spreading the cost of your servers across two years (expected life)

- Adding some cost for administering your servers. This can be hard to account for at a small scale, but as server counts grow, administering hosts quickly becomes full time job(s) vs. an occasional hassle.

- Amazon’s infrastructure is probably more reliable than what you could provide in a startup environment. An intangible benefit, but outages are never good.

Even with the data transfer charges, it seems that you save some money in your model in the first year. Continued hosting there seems to make a lot of sense, given their cost structure, reliability, and built in scalability.

I wouldn't expect anything less in terms of this level of analysis from Abdur; nice one bro.

I 2nd Wise's variables for consideration too.

Good stuff Abdur.

Another variable is the actually performance you will see out of the virtual private servers that ec2 offers. Quote:

"One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor."

And see:

http://docs.amazonwebservices.com/AWSEC2/2007-08-29/DeveloperGuide/instance-types.html

Depending on how your application utilizes disk, cpu, and RAM, you may see radically different performance on a VPS setup. For example, when deploying a crawl and index farm on VPS hosting (VMWare ESX/Linux) in Europe, we found a very significant decrease in local disk random-access (scratch space) vs. theoretically similar direct physical server installation. We also ended up needing to patch our software to gracefully cope with a system clock that regularly ticked backwards on such a setup.


Post a comment

If you have a TypeKey or TypePad account, please Sign In