Windows Azure Tables are a non-relational, key-value-pair, storage system suitable for storing massive amounts of unstructured data. Whereas relational stores such as SQL Server, with highly normalized designs, are optimized for storing data so that queries are easy to produce, the non-relational stores like Table Storage are optimized for simple retrieval and fast inserts. This article will cover the very basics of Windows Azure Table storage and provide you with resources and suggested topics to continue your learning.
Some people, when first learning about the Windows Azure platform, find it hard to understand the purpose of the Table Storage feature. This is especially true of those who are familiar with developing applications using highly relational data. To get a good understanding of how a Key-Value Pair system differs from a traditional relational database you can read Buck Woody’s article on the topic in his continuing series: Data Science Laboratory System - Key/Value Pair Systems.
The code examples provided in this article are written using the Windows Azure .NET SDK 2.2 with Visual Studio 2013; however, like most services on Windows Azure this SDK simply calls a REST based API in the background for you. This underlying REST API allows for a variety of program languages and platforms to use Windows Azure services and the Table Storage service is no different. You can find documentation on using the Table Storage service from node.js, Java, PHP, Python and Ruby at http://windowsazure.com.
Getting Started
To get started using the Table service, we’ll first need to have a Windows Azure account and create a Storage Account. You can get a free trial account or, if you have a MSDN Subscription, you can sign up for your Windows Azure benefits in order to try out the examples included in this article. Once you have an Azure account you can then create a storage account that can then be used to store Tables.
To create a storage account, log in to the Windows Azure management portal. After you log in to the portal you can quickly create a Storage Account by clicking on the large NEW icon at the bottom left hand of the portal.
From the expanding menu select the ‘Data Services’ option, then ‘Storage’ and finally, ‘Quick Create’.
You will now need to provide a name for your storage account in the URL textbox. This name is used as part of the URL for the service endpoint and so it must be globally unique. The portal will alert you if you select a name that is already in use. The next step is for you to select a location for your storage account by choosing one of the data center locations in the dropdown. This location will be the primary storage location for your data, or more simply, your account will reside in this Data Center. You may also see an additional dropdown if you have more than one Windows Azure subscription. This allows you to select the subscription that the account will be related to.
All storage accounts are stored in triplicate, with transactionally-consistent copies in the primary data center. In addition to that redundancy, you can also choose to have ‘Geo Replication’ enabled for the storage account. ’Geo Replication’ means that the Windows Azure Table data that you place into the account will be replicated in triplicate to another data center within the same region. So, if you select ‘East US’ for your primary storage location, your account will also have a triplicate copy stored in the West US data center. This mapping is done automatically by Microsoft and you can’t control the location of your secondary replication, but it will never be outside of a region so you don’t have to worry about your West US based account somehow getting replicated to Europe or Asia as part of the Geo Replication feature. Storage accounts that have Geo Replication enabled are referred to as geo redundant storage (GRS) and cost slightly more than accounts that do not have it enabled, which are called locally redundant storage (LRS).
Click ‘Create Storage Account’ once you have selected the location and provided an account name. In a few moments the Windows Azure portal will then generate the storage account for you. When the account is fully created, you will see a status of Online. By selecting the new storage account in the portal, you can retrieve one of the access keys we will need in order to work with the storage account.
Click on the ‘Manage Access Keys’ at the bottom of the screen to display the storage account name and two 512 bit storage access keys used to authenticate requests to the storage account. Whoever has these keys will have complete control over your storage account short of deleting the entire account. They would have the ability to upload BLOBs, modify table data and destroy queues. These account keys should be treated as a secret in the same way that you would guard passwords or a private encryption key. Both of these keys are active so that if necessary you can perform a rolling change in configuration for all applications utilizing the accounts in case one of the keys is compromised.
Your storage account is now created and we have what we need to work with it. For now, get a copy of the Primary Access Key by clicking on the ‘copy’ icon next to the text box. Now that we have our storage account set up we can talk about Table Storage.
What are Tables and Entities?
In Windows Azure Table Storage, the term ‘Table’ is used to describe a grouping of entities. You can loosely think of an entity as a row of data, but it’s more like a collection of properties and values that were stored together within a table. Unlike relational databases, the entities inside of a table do not need to have the same structure or schema. This means that we might have an entity that stores properties about a product in the same table as an entity that stores properties about the product options.
There are some rules about entities: each entity can have up to 252 properties but the size of an entity with all of the properties and values cannot exceed 1 MB. Table storage entities support the following data types: Byte array, Boolean, DateTime, Double, GUID, Int32, Int64 and String (up to 64KB in size). There are an additional three required system properties that must exist on every entity: PartitionKey, RowKey and TimeStamp. The partition key is way to group entities within a table and control the scalability of the table which we will touch on in a bit. The row key is a unique identifier for an entity within a given partition. The combination of partition key and row key is the unique identifier for an entity within a table, comparable to a primary key in a relational database. TheTimestamp property represents the last time the entity was modified and is managed by the Storage sub-system. Any change you make to Timestamp will be ignored.
There is no direct table-specific limit to how much data you can store within a table. The size is restricted only by the allowable size of a Windows Azure Storage account which is currently 200 TB, or 100 TB if the storage account was created prior to June 7th, 2012. A storage account can hold any combination of Windows Azure Tables, BLOBs or Queues up to the allowable size of the account. There is a reason that there is a difference in the allowable size depending on when the storage account was created. Starting on that date, accounts are created on the newer infrastructure of Windows Azure Storage which drastically increased the throughput and scalability of the system.
Store some Data!
Now that we know about Tables and entities, let’s store some data. Most applications will be using a client library to write data into the tables, or calling the REST API directly. For our example we will use a simple C# Console application that will create a table in a storage account and then add an entity to the table using the 2.1 Windows Azure Storage Library for .NET.
Using Visual Studio, create a C# Console application from the standard template. By default the created project will not have a reference to the storage library, but we can add that easily using the NuGet package manager.
Right-click on the project and select ‘Manage NuGet Packages…’ from the context menu.
This will load up the Package Manager UI. Select the ‘Online’ tab from the Package Manager dialog, and search for ‘Azure Storage’. As of the time of this writing version 2.1.0.3 was available. Select the Windows Azure Storage package and click ‘Install’.
The same result can be achieved via the Package Manager Console if you prefer to manage your packages from a command line. Type ‘Install-Package WindowsAzure.Storage’ into the Package Manager Console and the assemblies will be added to your project just as they would be via the UI show above. You’ll see that several assemblies have been added to the project references including ones for OData, spatial and Entity Data Model (EDM). These are used by the client library when working with table storage.
Open the program.cs file and add the following statements to the top of the file:
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Auth;
using Microsoft.WindowsAzure.Storage.Table;
Add the following to the Main method:
string accountName = "YOUR_ACCOUNT_NAME";
string accountKey = "YOUR_ACCOUNT_KEY";
try
{
StorageCredentials creds = new StorageCredentials(accountName, accountKey);
CloudStorageAccount account = new CloudStorageAccount(creds, useHttps: true);
CloudTableClient client = account.CreateCloudTableClient();
CloudTable table = client.GetTableReference("sportingproducts");
table.CreateIfNotExists();
Console.WriteLine(table.Uri.ToString());
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
Console.WriteLine("Done... press a key to end.");
Console.ReadKey();
You will need to modify the code above and change the value for the accountName to match your own storage account name. Then provide one of the account storage keys from the portal to assign to the accountKey variable.
Walking through the code above you’ll see that we use the accountName and accountKey variables to create aStorageCredential object. This is then used to create a CloudStorageAccount object, which is the root object we use for access to any of the storage subsystems: BLOBs, queues and, of this article Tables. You’ll see that we pass in the credentials for the storage account as well as indicate that we want to access to the storage account using HTTPS. When we make a call against table storage from the storage library, the actual call to the service behind the scenes is made against the REST based API. Each call is signed using the credentials and, if we specify to use HTTPS, it is sent encrypted over the wire.
Note that there are other ways to create the CloudStrorageAccount object, such as using the staticCloudStorageAccount.Parse method if you have a full storage account connection string. In your production code you should store the credentials or the connection string in configuration and read the values from there, or have them passed in to your code so that you aren’t hard coding the account to be used.
After the CloudStorageAccount is created the code then creates a CloudTableClient object, which is used as a façade to work with Table Storage directly. The code then creates a CloudTable object using theGetTableReference method of the CloudTableClient object. This is just a reference for the client library to use, it hasn’t made a call to the REST API yet at all. The next line, table.CreateIfNotExists(), will actually make the first call to the Table service REST API and, if a table named “sportingproducts” doesn’t already exist within the storage account it will create it. Note that the call to CreateIfNotExists is idempotent, meaning we can call it multiple times and it will only ensure the table is created. If the table already existed no action would be taken and no data that might already exist within the table would be changed.
After the table is created we write to the console the URL of the table. Remember that the table service, like all the Windows Azure Storage services, exists as a REST based API so every table has its own resource location, or URI. Calls against this table, such as inserts, selects, etc., are all sent to this URI.
Now that a table has been created we can add entities to it. First, we define an entity. Add the following class to the project:
public class SportingProductEntity : TableEntity
{
public SportingProductEntity(string category, string sku)
: base(category, sku) { }
public SportingProductEntity() { }
public string ProductName { get; set; }
public string Description { get; set; }
}
You’ll notice that we inherit our entity from TableEntity, which is a base class that provides the required properties of PartitionKey, RowKey and Timestamp. Note that you don’t have to inherit from this base class, but if you choose not you will want to implement the ITableEntity interface, including implementing some methods that are handled for your on the TableEntity object. The Table storage classes and method in the storage library assume your entities will either inherit from TableEntity or implement the ITableEntity interface. Choosing to not do either of these is possible, but is beyond the scope of this article.
Every entity stored must have a partition key and row key provided. The first constructor in the example uses a category of sporting goods as the partition key and the product SKU as the row key. The constructor passes those values along to the base TableEntity constructor that takes the partition key and row key parameters. Later in the article we will cover more on this choice for our keys and how it affects queries. You will notice that we also define a second constructor that has no parameters. This is required so that the object can be deserialized later when being retrieved from storage by the client library’s default implementation within TableEntity.
We have defined an entity, so now we can create one and add it to the table. Add the following code immediately after we write out the URI to the console in the Main method:
SportingProductEntity entity = new SportingProductEntity("Baseball", "BBt1032")
{
ProductName = "Louisville Slugger",
Description = "A great bat for any level of player!"
};
TableOperation insertOperation = TableOperation.Insert(entity);
table.Execute(insertOperation);
Console.WriteLine("Entity inserted!");
Here the code is creating an instance of the SportingProductEntity object providing the category of ‘Baseball’ and a SKU of ‘BBt1032’. A TableOperation object is created to insert the entity. If you look at the static methods forTableOperation you’ll see you can perform multiple types of operations such as Delete, Merge, InsertOrMerge, orReplace. The next line of code executes the command against the table. When this line of code executes behind the scenes, an OData insert command against the table is generated and sent to the REST based API, which then inserts the row into the table.
With the addition of this code, our method is actually no longer idempotent. If you run this code more than once you’ll receive an exception. Digging down into the exception information, you’ll find the error code “EntityAlreadyExists”. This is because the Insert has already occurred and the unique entity, as defined by the partition key and row key, is already in the table.
Change the creation of the TableOperation line to be the following:
TableOperation insertOperation = TableOperation.InsertOrReplace(entity);
Now, when you run the code, you will no longer receive the exception. We have told the system to either insert the entity if it is not present, or to completely replace the entity with this instance if the entity is already in the table. The way that you decide to deal with key collisions in your own solutions will depend on where the data is coming from and your own business requirements. As mentioned earlier in the article, you have a number of table operation options to choose from.
Viewing your Data
In our code example, we wrote a query in C# to add data to our table; however, most developers are used to a tool like SQL Server Management Studio to query and view data. Having a good Windows Azure Storage tool makes it easy to ensure that your code is working correctly with the storage system and you are getting the results you are looking for. There are a variety of storage tools out there. Most of these tools provide a mechanism for querying your storage tables as well as being able to manipulate the data within the tables. For example, LINQPad is a good tool for querying Table Storage and with the Windows Azure SDK for .NET you can use the Server Explorer to query your tables directly from within Visual Studio.
In this article we’ll use Cerebrata’s Azure Management Studio to load up some additional data in oursportingproducts table and then view it. For just a quick sample we will upload a simple csv file. Here is a sample of the data from the file:
From within Azure Management Studio I find the table from the storage account I wish to upload the data to and right-click the table. From the context menu I can select Upload > From CSV….
Once I’ve selected my CSV file, I will be prompted for some mapping information.
Note that I’ve set the checkbox to indicate that the name of the columns, which will map to the properties, exist in the top row. I’ve also indicated that the text qualifier is a double quote. If you notice that line 6 from the sample file has a comma in the description, so I need to surround the full description in this qualifier since commas are my column delimiter. You can check out the format preview if you like to see how the mapping looks for the first few lines of the file. Click Next where you are satisfied.
On the next screen I can modify the data types for the columns that I’m importing, and also set up what values should map into the PartitionKey and RowKey-Values. You can also determine if you want to replace an entity if it already exists with the data you are importing. Once you are set with these, click on ‘OK’ to perform the import. Azure Management Studio has a ‘Transfers‘ window that will show you the progress of the import.
Once the import is complete I can refresh my sportingproducts table view in Azure Management Studio to see the data.
If you don’t see all of your data, note that Azure Management Studio doesn’t pull back your entire table when it shows the view. It only pulls back the first 1,000 entities by default (which you can change in the settings).
As mentioned earlier, there are a variety of Windows Azure Storage tools out there. When looking for any tool, you should try several and pick one that best suits your needs.
Querying the Table
We’ve successfully added data to the table both via code and by importing with a tool. While the Storage tools allow us to verify that we uploaded what we expected from our code, the applications are likely to be using code to retrieve their data. Let’s look at a very simple sample query: To do this we replace the code in the Try catch of theMain method with the following code:
try
{
StorageCredentials creds = new StorageCredentials(accountName, accountKey);
CloudStorageAccount account = new CloudStorageAccount(creds, useHttps: true);
CloudTableClient client = account.CreateCloudTableClient();
CloudTable table = client.GetTableReference("sportingproducts");
TableOperation retrieveOperation = TableOperation.Retrieve<SportingProductEntity>("Baseball","BBt1032");
TableResult query = table.Execute(retrieveOperation);
if (query.Result != null)
{
Console.WriteLine("Product: {0}", ((SportingProductEntity)query.Result).ProductName);
}
else
{
Console.WriteLine("The Product was not found.");
}
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
When you execute this code, you’ll see that the entity was retrieved and the product name was displayed. In this case you will see that we queried by the partition key and row key, so essentially we were performing a lookup by the entity’s primary key. There are numerous ways of querying for the data you have in storage, but knowing the exact partition key and row key will always be the fastest.
For the new Table Service Layer you can also use the lightweight query class TableQuery. This class allows you to construct a query to send to the server that is more complex than simply asking for the PartitionKey and RowKey; however, the syntax is not as easy as working with LINQ. If our data had an integer property for StockOnHand we could search for all items in the Baseball category that we have in stock using the following:
//....
CloudTable table = client.GetTableReference("sportingproducts");
TableQuery<SportingProductEntity> baseballInventoryQuery = new TableQuery<SportingProductEntity>().Where(
TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, "Baseball"),
TableOperators.And,
TableQuery.GenerateFilterConditionForInt("StockOnHand", QueryComparisons.GreaterThan, 0)));
var baseballInventory = table.ExecuteQuery(baseballInventoryQuery);
if (baseballInventory.Any())
{
foreach (SportingProductEntity product in baseballInventory)
{
Console.WriteLine("Product: {0} as {1} items in stock", product.ProductName, product.StockOnHand);
}
}
else
{
Console.WriteLine("No inventory was not found. Better order more!");
}
Note that the construction of the filter we want to apply on the server happens by building up the TableQuery. We then execute the query against the table. The result of the execution is an IEnumerable, so from there we can apply any of the LINQ operators we are familiar with; however, all of that logic will be applied on the client-side.
In previous versions of the .NET Storage library prior to 2.x you could use LINQ to query your tables, which under the hood used WCF Data Services. When the 2.0 Storage library was released it provided the new Table Service layer which utilizes the new Microsoft.Data.OData library. This new approach offers greater performance and lower latency than the previous table service layer, but didn't ship with an IQueryable interface until version 2.1. While you can still use the 1.x style of queries with the previous classes found under theMicrosoft.WindowsAzure.Storage.Table.DataServices namespace, or even the 2.1 "fluent" mechanism shown above, you can also use the new IQueryable option using the CreateQuery method off the table reference. The same query above using the "fluent" approach would look like the following if we used the new IQueryableapproach:
//....
CloudTable table = client.GetTableReference( "sportingproducts");
var baseballInventoryQuery = ( from entry in table.CreateQuery<SportingProductEntity>()
where entry.PartitionKey == "Baseball" && entry.StockOnHand > 0
select entry);
var baseballInventory = baseballInventoryQuery.ToList();
if (baseballInventory.Any())
{
foreach ( SportingProductEntity product in baseballInventory)
{
Console .WriteLine( "Product: {0} as {1} items in stock" , product.ProductName, product.StockOnHand);
}
}
else
{
Console .WriteLine( "No inventory was not found. Better order more!" );
}
Note that the call to ToList on the query actually executed the request and performed the filter on the server. The query would have also executed if the code had simply iterated the results; however, the Any() extention method is not supported by the provider so the code execute the query first. Not all extension methods will be supported so you may need to execute your query before performing additional operations, but be aware that once the query is executed the additional operations will occur client side.
A word about partitions and queries
The scalability you get from a Windows Azure Table comes from how you partition your data using the partition key. For example, let’s look at the data we used above. We choose to use categories as the partitions key. In the table sample below you’ll see that we have at least two partitions: Cricket and Baseball. Each entity is stored with a category as the partition key and uses the product SKU as the row key.
As mentioned in the code example above, the best possible query to run is one that includes both partition key and row key as this is the primary key for an entity. In the code sample, we knew the category and the SKU for what we were looking for so the query will be the fastest we can achieve. It might not always be the possible for you to know both keys, so it is best to understand the flow of how you will access this data in order to get the best performance. This category partition approach might be acceptable if, for the vast majority of time, the flow of our data-access starts at the category. For instance, the approach above may make sense if the flow of our solution was to retrieve all of the products within a category to display on a product page and then a user can select an individual product where we already know the category.
All entities with the same partition key within the same table are guaranteed to be accessed via the same partition server. A partition server knows about all the data that exists within one or more partitions. The number of partitions a partition server is responsible for depends on how much data is within the partitions and how often they are being accessed. The Windows Azure Storage system manages where partitions are stored and keeps a mapping of what partitions are managed by which partition servers. This mapping is then stored at the front end layer, which is made up of servers that are load-balanced for all requests to storage. As requests for data come into the system, the front end nodes direct the request to the correct partition server, or servers, to retrieve the data. That means that when we query for all of the data in the “Cricket” category we will only have to talk to a single partition server in Windows Azure to retrieve our data because all of that data resides in a single partition.
Remember that a table can have many partitions so a query against a table may actually cross one or more partition servers. In the example above, our query wouldn’t be very efficient at all if we only had a product SKU. This is because we do not know the partition key. In essence, our query for a specific SKU would have to be executed against every partition server that stored data for our table in order to find the specific matching SKU. In addition, since a row key itself is only unique to the partition it lives in, the query will not assume that once it has found a single result that the query is complete. In order for the query to be complete, the system will need to look through the entire table which may be spread across multiple partition servers. This is what in relational databases we would call a table scan which is very inefficient.
It is very important to choose the correct partitioning scheme. You may need to try multiple options to see what works best for you, or even store the same data in different ways in order to optimize the system for your scenarios. This is not unlike when companies use a data warehouse to store the same data as their transactional systems in a manner which is easier to build analytical queries from.
Continuation Tokens
One of the things to keep in mind when using Table Storage is that, in some cases, your query may not complete on a single call to the REST based API. There are a lot of reasons this might occur, such as when the query crosses multiple partition servers or the amount of data coming back is quite large. When this happens, the storage service will return continuation tokens with the query results. These continuation tokens can then be used to continue the query by making additional calls.
The good news is that the .NET storage library can completely abstract the continuation tokens from you and your code. In the code example above we used the TableOperation.Retrieve and CloudTable.Execute to perform the query. Using this mechanism the storage library handles the continuation tokens for you and keeps making calls back to the storage service until all of the results are found. There are methods on the CloudTable class, such as ExecuteQuerySegmented which return a TableResultSegment object. Think of this as a partial result in which your code can look for continuation token and decide if it should continue the query or not. This approach gives you much more control over your queries. For example it provides a means for you to do manual paging of results or ensure that you don’t accidentally pull back millions of entities.
If you do decide to use the segmented queries remember to always check for the existence of continuation tokens rather than rely on checking to see if the last result had any rows. It is very possible to receive an empty result set but still have continuation tokens. This happens most often when a query crosses partition server boundaries but that particular partition didn’t have any results to include.
Do I get Transactions?
One of the most common questions that come up when talking about Table Storage is regarding whether transactions are supported. Table storage does support batch transactions against data within the same table and the same partition. There are several rules about the transactions though: all of the entities must exist within the same partition of the table, the number of entities in the transaction can’t exceed 100 and the entire batch being sent to the server cannot exceed 4 MB in size. As you can see, there are limitations to the level of transaction support you get which revolves around the partition. This is another good reason why choosing your partition key scheme is very important.
Summary
Windows Azure Table Storage is a highly scalable, Key-Value pair NoSQL storage that is offered as a service. Developers on the Windows Azure Platform should become familiar with how Table Storage works and how it differs from the relational databases they are used to. Knowing how table storage works will help you determine if it is a good fit for your particular requirements
The features and capabilities of table storage continue to grow. At the BUILD conference in June it was announced by the Windows Azure Storage team that they plan to support CORS for table storage and JSON results by the end of 2013, which will be very welcome features.
This article only scratched the surface using Table Storage. To become more familiar with the service I highly recommend researching the following topics and links:
- Windows Azure Storage: What’s Coming, Best Practices and Internals – This is a video from the BUILD conference.
- Windows Azure Storage Client Library 2.0 Tables Deep Dive – The 2.x version of the client library was a major change from previous versions. This blog post from the Windows Azure Storage Team helps shed some light on the differences. Note that this came out when 2.0 shipped, and there has already been other versions shipped.
- Windows Azure Storage Scalability and Performance Targets – It is very important to understand the scalability and throughput needs of our solution and compare those against the known targets for the storage system itself to make sure it matches up.
- If you need to securely provide a means for code to directly call the Table Storage APIs but you don’t want to provide them the access keys for a storage account you can create a Signed Access Signature for them to use.
- Understand how to leverage Retry policies when using the Client Library through the TableRequestOptions parameter that can be passed to most of the query operations. This is very important for dealing with transient errors that might occur in distributed systems.
- Understanding how you are charged for using the Table storage service. Currently you are charged by how much data you store in your account as well as the number of transactions against your account. The amount of these charges have been going down as the cloud vendors make file storage a commodity, but it is still important to understand how you are charged no matter how small the amounts are.
- You can control serialization of entities beyond the reflection based serialization provided by default in theTableEntity base class by either manually implementing your own using the ITableEntity interface for your entities, or by using the EntityResolver delegate.
- Getting the Most out of Windows Azure Storage – This is a great video on by Joe Giardino that covers many of the new features of the storage library as well as details best practices.
Comments
Post a Comment