“We need a database therefore we need SQL
Server.” How many times have you heard this? This is so often the default
position for people wanting to persist data on the server and as the old adage
goes, this is the hammer to every database requirement which then becomes the
nail. SQL Server has simply become “the standard” for many people.
SQL Server is a beast. It does a lot. It’s awesome at what it
does, but it always does
more than I actually need. Azure Table Storage is simple, at least
relatively speaking. You have a table, it’s partitioned, it has rows. You put
stuff into that table then you query is back out, usually by referencing the
partition and row keys.
One of the things that attracted me to Table
Storage is that it’s not constrained to a server or a VM or any logical
construct that’s governed by finite resources (at least not “finite” within a
reasonable definition), rather it’s a service. You don’t pay for CPU and RAM or
put it on a particular server, you
pay for the number of transactions and the amount of storage you need.
So what’s the “bad” bit? It’s something new to
learn, a “foreign entity” to most people, if you like. It has its own
idiosyncrasies, quite different ways of working with data and interacting with
the system and you can’t SELECT * FROM Foo. But it’s easy and it is massively impressive and
that’s what I want to walk you through here now.
Blobs, queues and tables
In
days gone by, binary data loaded into a website (such as images in a content
management system) would be stored either on the file system in IIS or within
SQL Server. Both options cause problems of different natures and varying
degrees. In Azure, you’d load it into blob storage and that’s one of
the storage types available. There are other uses for blob storage as well but
there’s no need to go into that here.
Another
storage type is “queues” which are essentially just message queues. They’re
very awesome and serve a very valuable role in providing asynchronicity between
system components. Again, won’t go into it here but read up onHow to use the
Queue Storage Service if it sounds interesting.
And
finally there’s Table Storage which is obviously the focus of this post. The
main point I wanted to make was that these three paradigms all exist within
each and every storage account whether you elect to use them or not.
One
storage account can have many tables.
Each
table has many partitions.
Each
partition has many rows.
Each
row has a partition key, a row key and a timestamp.
You
store an entity in the row that inherits from
Microsoft.WindowsAzure.Storage.Table.TableEntity.
You
create a partition and a row by inserting an entity with a partition key and a
row key.
You
retrieve a row by searching with the partition and row key and casting the
result back to the entity type.
That
it in a nutshell and it’s not even the simplistic version – that’s just how it
works. There are features and nuances and other things you can do with it but
that’s the bulk of it and it’s all I really needed in order to write this post.
Conceptually,
partitions are not that different to the alphabetised sections of a phonebook
(remember those paper ones they used to dump on your doorstep?) Rather than
just chucking all the numbers randomly into the book, they’re nicely
distributed into 26 different “partitions” (A-Z) then of course each person is
neatly ordered alphabetically within there. It does wonderful things in terms
of the time is takes to actually find what you need.
In
Azure Table Storage, partitions can have a big impact on performance.
Let’s
do the “Hello World” of Table Storage using the mentioned data structure.
Now
we’ll use that access key and whack it into a connection string
and it looks just like this:
<connectionStrings>
<add name="StorageConnectionString"
connectionString="DefaultEndpointsProtocol=https;AccountName=haveibeenpwned;AccountKey=mykey" />
connectionString="DefaultEndpointsProtocol=https;AccountName=haveibeenpwned;AccountKey=mykey" />
</connectionStrings>
Just
like a SQL connection string (kinda). Note the endpoint protocol – you can go
HTTP or HTTPS. Clearly the secure option is preferable.
Into
the code itself, we begin by getting a reference to the storage instance:
var connString = ConfigurationManager.ConnectionStrings["StorageConnectionString"].ConnectionString;
var storageAccount = CloudStorageAccount.Parse(connString);
Now
we’ll get a reference to the table within that storage instance and if it
doesn’t already exist, we’ll just create it (obviously just a first-run thing):
var tableClient =
storageAccount.CreateCloudTableClient();
var table = tableClient.GetTableReference("BreachedAccount");
table.CreateIfNotExists();
Before
we can start chucking stuff in there, we need that entity I mentioned earlier
so let’s create a BreachedAccount that inherits from TableEntity.
public
class BreachedAccount
: TableEntity
{
public BreachedAccount()
{ }
public string Websites { get; set; }
}
Notice
how we haven’t created partition or row keys? They’re inherited from the
TableEntity and what it means is that we can now do this:
var breachedAccount = new BreachedAccount
{
PartitionKey = "bar.com",
RowKey = "foo",
Websites = "Adobe;Stratfor"
};
var insertOperation = TableOperation.Insert(breachedAccount);
table.Execute(insertOperation);
If we were to add another attribute to the BreachedAccount
entity then we’ll see that too even though we already have data there
conforming to a different schema. That’s the neat thing about many NoSQL database
implementations in that you’re not constrained to a single schema within
the one container.
Now if I want to insert 154 Million records I don’t really want
to do 154 million individual inserts as each connection does have some
overhead. What I want to do is batch it and
that looks more like this:
var batch = new TableBatchOperation();
batch.Insert(breachedAccount1);
batch.Insert(breachedAccount2);
batch.Insert(breachedAccount3);
table.ExecuteBatch(batch);
Batching is about more than just committing a
bunch of rows at one time, it also has an impact on cost. Remember how Azure
Table Storage charges you $0.0000001 per “transaction”? I want to bring
that cost down a little and I can do this by batching because a batch is one transaction. However, there are some caveats.
Firstly,
you can only batch records into the same partition. I can’t insert foo@bar.com
and foo@fizz.com within the same batch. However, I can insert
foo@bar.com and buzz@bar.com at the same time as I’m using the domain as the
partition key. What this meant is that when I wrote the code to process the
records I had to sort the data by domainso that I could keep all
the records for the partition together and batch them. This
makes sense in the context of Julie’s earlier comment about the partition being
tied to a machine somewhere.
Secondly,
you can only batch up to 100 rows at a time. Those 32 million Hotmail
addresses? That’s 320,000 batches thank you very much. This meant my importer
needed to not only enumerate through accounts ordered by domain, but each time
it had a collection of 100 it needed it commit them before moving on. Per the
previous point, it obviously also had to commit the batch as soon as it got to
the next domain as it couldn’t commit to multiple partitions
in the one batch.
Ok,
so we’ve got data in the system, but that’s just the start. Of course it’s also
the hardest bit so that’s good, let’s now pull records back out. Obviously I’ve
designed the whole thing to be ultra fast in terms of reading data based on the
email address. Remember that this is what I’ve compromised the partition and
row keys out of.
It’s
pretty much the same deal as earlier in terms of needing a storage account
object and then a table client after which you can just invoke the “Retrieve”
method and pass it the partition key (the domain) and the row key (the alias):
var retrieveOperation = TableOperation.Retrieve<BreachedAccount>("bar.com", "foo");
var retrievedResult =
table.Execute(retrieveOperation);
var breachedAccount = (BreachedAccount)retrievedResult.Result;
Comments
Post a Comment