Skip to main content

Azure Table Storage – Hello World to Transaction in Azure Table Storage

“We need a database therefore we need SQL Server.” How many times have you heard this? This is so often the default position for people wanting to persist data on the server and as the old adage goes, this is the hammer to every database requirement which then becomes the nail. SQL Server has simply become “the standard” for many people.
SQL Server is a beast. It does a lot.  It’s awesome at what it does, but it always does more than I actually need. Azure Table Storage is simple, at least relatively speaking. You have a table, it’s partitioned, it has rows. You put stuff into that table then you query is back out, usually by referencing the partition and row keys.

One of the things that attracted me to Table Storage is that it’s not constrained to a server or a VM or any logical construct that’s governed by finite resources (at least not “finite” within a reasonable definition), rather it’s a service. You don’t pay for CPU and RAM or put it on a particular server, you pay for the number of transactions and the amount of storage you need.

So what’s the “bad” bit? It’s something new to learn, a “foreign entity” to most people, if you like. It has its own idiosyncrasies, quite different ways of working with data and interacting with the system and you can’t SELECT * FROM Foo. But it’s easy and it is massively impressive and that’s what I want to walk you through here now.
Blobs, queues and tables

In days gone by, binary data loaded into a website (such as images in a content management system) would be stored either on the file system in IIS or within SQL Server. Both options cause problems of different natures and varying degrees. In Azure, you’d load it into blob storage and that’s one of the storage types available. There are other uses for blob storage as well but there’s no need to go into that here.
Another storage type is “queues” which are essentially just message queues. They’re very awesome and serve a very valuable role in providing asynchronicity between system components. Again, won’t go into it here but read up onHow to use the Queue Storage Service if it sounds interesting.

And finally there’s Table Storage which is obviously the focus of this post. The main point I wanted to make was that these three paradigms all exist within each and every storage account whether you elect to use them or not.

One storage account can have many tables.
Each table has many partitions.
Each partition has many rows.
Each row has a partition key, a row key and a timestamp.

You store an entity in the row that inherits from Microsoft.WindowsAzure.Storage.Table.TableEntity.
You create a partition and a row by inserting an entity with a partition key and a row key.
You retrieve a row by searching with the partition and row key and casting the result back to the entity type.

That it in a nutshell and it’s not even the simplistic version – that’s just how it works. There are features and nuances and other things you can do with it but that’s the bulk of it and it’s all I really needed in order to write this post.

Conceptually, partitions are not that different to the alphabetised sections of a phonebook (remember those paper ones they used to dump on your doorstep?) Rather than just chucking all the numbers randomly into the book, they’re nicely distributed into 26 different “partitions” (A-Z) then of course each person is neatly ordered alphabetically within there. It does wonderful things in terms of the time is takes to actually find what you need.

In Azure Table Storage, partitions can have a big impact on performance.
Let’s do the “Hello World” of Table Storage using the mentioned data structure.

Now we’ll use that access key and whack it into a connection string and it looks just like this:

<connectionStrings>
  <add name="StorageConnectionString"
    connectionString
="DefaultEndpointsProtocol=https;AccountName=haveibeenpwned;AccountKey=mykey" />
</connectionStrings>

Just like a SQL connection string (kinda). Note the endpoint protocol – you can go HTTP or HTTPS. Clearly the secure option is preferable.

Into the code itself, we begin by getting a reference to the storage instance:

var connString = ConfigurationManager.ConnectionStrings["StorageConnectionString"].ConnectionString;
var storageAccount = CloudStorageAccount.Parse(connString);

Now we’ll get a reference to the table within that storage instance and if it doesn’t already exist, we’ll just create it (obviously just a first-run thing):

var tableClient = storageAccount.CreateCloudTableClient();
var table = tableClient.GetTableReference("BreachedAccount");
table.CreateIfNotExists();

Before we can start chucking stuff in there, we need that entity I mentioned earlier so let’s create a BreachedAccount that inherits from TableEntity.

public class BreachedAccount : TableEntity
{
  public BreachedAccount() { }
  public string Websites { get; set; }
}

Notice how we haven’t created partition or row keys? They’re inherited from the TableEntity and what it means is that we can now do this:

var breachedAccount = new BreachedAccount
{
  PartitionKey = "bar.com",
  RowKey = "foo",
  Websites = "Adobe;Stratfor"
};

var insertOperation = TableOperation.Insert(breachedAccount);
table.Execute(insertOperation);

 If we were to add another attribute to the BreachedAccount entity then we’ll see that too even though we already have data there conforming to a different schema. That’s the neat thing about many NoSQL database implementations in that you’re not constrained to a single schema within the one container.

Now if I want to insert 154 Million records I don’t really want to do 154 million individual inserts as each connection does have some overhead. What I want to do is batch it and that looks more like this:

var batch = new TableBatchOperation();
batch.Insert(breachedAccount1);
batch.Insert(breachedAccount2);
batch.Insert(breachedAccount3);
table.ExecuteBatch(batch);

Batching is about more than just committing a bunch of rows at one time, it also has an impact on cost. Remember how Azure Table Storage charges you $0.0000001 per “transaction”? I want to bring that cost down a little and I can do this by batching because a batch is one transaction. However, there are some caveats.

Firstly, you can only batch records into the same partition. I can’t insert foo@bar.com and foo@fizz.com within the same batch. However, I can insert foo@bar.com and buzz@bar.com at the same time as I’m using the domain as the partition key. What this meant is that when I wrote the code to process the records I had to sort the data by domainso that I could keep all the records for the partition together and batch them. This makes sense in the context of Julie’s earlier comment about the partition being tied to a machine somewhere.

Secondly, you can only batch up to 100 rows at a time. Those 32 million Hotmail addresses? That’s 320,000 batches thank you very much. This meant my importer needed to not only enumerate through accounts ordered by domain, but each time it had a collection of 100 it needed it commit them before moving on. Per the previous point, it obviously also had to commit the batch as soon as it got to the next domain as it couldn’t commit to multiple partitions in the one batch.

Ok, so we’ve got data in the system, but that’s just the start. Of course it’s also the hardest bit so that’s good, let’s now pull records back out. Obviously I’ve designed the whole thing to be ultra fast in terms of reading data based on the email address. Remember that this is what I’ve compromised the partition and row keys out of.
It’s pretty much the same deal as earlier in terms of needing a storage account object and then a table client after which you can just invoke the “Retrieve” method and pass it the partition key (the domain) and the row key (the alias):

var retrieveOperation = TableOperation.Retrieve<BreachedAccount>("bar.com", "foo");
var retrievedResult = table.Execute(retrieveOperation);

var breachedAccount = (BreachedAccount)retrievedResult.Result;

Comments

Popular posts from this blog

Support for debugging lambda expressions with Visual Studio 2015

Anyone who uses LINQ (or lambdas in general) and the debugger will quickly discover the dreaded message “Expression cannot contain lambda expressions”. Lack of lambda support has been a limitation of the Visual Studio Debugger ever since Lambdas were added to C# and Visual Basic.  With visual studio 2015 Microsoft has added support for debugging lambda expressions. Let’s first look at an example, and then I’ll walk you through current limitations. Example To try this yourself, create a new C# Console app with this code: using System.Diagnostics; using System.Linq; class Program { static void Main() { float[] values = Enumerable.Range(0, 100).Select(i => (float)i / 10).ToArray(); Debugger.Break(); } } Then compile, start debugging, and add “values.Where(v => (int)v == 3).ToArray()” in the Watch window. You’ll be happy to see the same as what the screenshot above shows you. I am using Visual Studio 2015 Preview and it has some limitati...

gcAllowVeryLargeObjects Element

There are numerous new features coming with .NET 4.5 and here, on this blog, you can find several posts about it. But the feature we are goint to talk about today is very exciting, because we were waiting for it more than 10 years. Since .NET 1.0 the memory limit of .NET object is 2GB. This means you cannot for example create array which contains elements with more than 2GB in total. If try to create such array, you will get the OutOfMemoryException. Let’s see an example how to produce OutOfMemoryException. Before that Open Visual Studio 2012, and create C# Console Application, like picture below. First lets create simple struct with two double members like example below: 1 2 3 4 5 6 7 8 9 10 11 12 public struct ComplexNumber {      public double Re;      public double Im;      public ComplexNumber( double re, double im)      {    ...

How to allow a very large object in .net application?

Since .NET 1.0 the memory limit of .NET object is 2GB. This means you cannot for example create array which contains elements with more than 2GB in total. If try to create such array, you will get the OutOfMemoryException. Let’s see an example how to produce OutOfMemoryException. Before that Open Visual Studio, and create C# Console Application. Lets create simple struct with two double members like example below: public struct ComplexNumber { public double Re; public double Im; public ComplexNumber(double re, double im) { Re = re; Im = im; } } As we know this structure consumes about 16 bytes of memory. So if we want to create array of this type which consume more than 2GB we need to create array at least with 134217728 instances. So this sample program below creates 130000000 (about 1,97 GB) of array. int maxCount = 130000000; ComplexNumber[] arr = null; try { arr = new ComplexNumber[maxCount]; } catch (Exception ex) { Console.WriteLine(ex.Message); } So if we run t...