Deduplicating Amazon S3 Storage with StorReduce

Amazon has some pretty slick services available to consumers, and I’ve been taking advantage of their Simple Storage Service (S3) and Glacier for a fair bit of time, especially for my off-premises backups – you can read more about that in my Synology to Glacier post. The pricing is straight forward, being a combination of capacity, queries, and retrieval. And capacity is paid by the drip without any concerns about the underlying infrastructure; that’s someone else’s problem.

In S3, though, I might have some concerns around the pricing to store (and hydrate) large quantities of data, especially if I’m looking for a comprehensive backup and archive solution or large data repository.


The folks at StorReduce pinged me a while back to check out their deduplication appliance designed for Amazon S3, and it sounded neat enough to give it some eyeball time. The basic premise is that a StorReduce virtual machine is inserted either on-site or into Amazon EC2 to act as an S3 interface.

An on-site deployment makes a lot of sense for those seeding a new bucket to avoid saturating the WAN, with a migration of the appliance to EC2 being optimal once data is stored in S3. The nice thing about EC2 is that you can pick any size of VM you want based on the amount of data you plan to send over to the bucket; or just increase to a larger AMI later as growth occurs.


As you can see from the diagram above from the AWS blog post, your application or API calls talk to the StorReduce appliance directly as if it was an S3 endpoint. As files are sent over, the appliance deduplicates the data in-line before it is written to the bucket. StorReduce acts as the middle man to ensure that only unique data is sent to the back-end S3 bucket, leveraging that beefy “c3” instance type with its local SSD storage and CPUs to handle the deduplication process.

I was given a trial instance by StorReduce that used a c3.4xlarge instance to chew up anywhere from 90 – 305 MB/s of data, which I had no chance of saturating from the home lab. 🙂


If you want to trial the solution yourself, there are many different options available on the AWS Marketplace, with the option name specifying the amount of TB you can store. StorReduce provides the software at zero cost for 10 days, although you still pay for the AWS usage fees. That worked out to be a few US dollars per day when I priced out a lab using the StorReduce 20 AMI running version 2.3.4.

User Experience

Once the appliance is online, there isn’t much to do. StorReduce provides a rather granular step-by-step walkthrough for getting your hands dirty. It is pretty simple – deploy the box, set up the back end config into S3, set up a user, key(s), a StorReduce bucket, and go. There’s also a pretty snazzy FAQ that answers a ton of different configuration, performance, and topology questions.


With the endpoint live, I configured S3 Browser on my desktop to view the contents of my new bucket. I tossed over several small backup files from Veeam and various other documents as a test. S3 Browser will automatically split them up into smaller file chunks, and I use the Pro version so that I can send over more than 2 streams at a time.

The StorReduce stats page reported a 93.7% deduplication ratio from the 112.1 Mebibytes of raw data (here’s the Wiki page for those not familiar with MiB). I’d imagine that further backups would only increase that number, since my rate of change is a bit low. Ultimately, your change rate should be the maximum amount of uniqueness – but likely less, since parts of that will also be dedupable.



Seems like some pretty useful tech. The UI is a bit overly simple and could use a polish, but the data you need is available and you can upload or modify files in the buckets directly so I won’t gripe too much – after all, you will likely be spending the vast majority of your time consuming the service via some other means or the API.

One thing I didn’t tinker around with was bucket policy. Looks like you use the standard IAM formatting, which is handy.


I was hoping to backup directly to StorReduce via Synology, but their S3 support is for Amazon only; I’d have to hack the DNS table to redirect it to my StorReduce endpoint. Additionally, the AWS Tools for PowerShell didn’t seem to have any way of specifying any other endpoints, either. I guess all the tools assume S3 for now, but I hope to see some more ability to use custom endpoints in the future.