My Data Day at Build – Azure Elastic Databases, Azure Search, IoT, Azure Stream Analytics, and more
[This post is by Jeff Mlakar, a member of the Business Intelligence Team at Bennett Adelson. Follow us @BIatBA and @JeffMlakar]
Today was Day 2 for me at Microsoft Build. And it was all about DATA. From the new Azure SQL Database Elastic Databases, to Azure Search, to Big Data, to Azure Stream Analytics, and Azure Machine Learning. All data. And, more amazingly, all data in the cloud. I remember when Azure’s data offerings were limited to their blob and table storage and the beginnings of SQL Server Data Services. To see how much the data offerings in Azure have exploded is surprising and exciting. Even today’s keynote was heavy on Azure Machine Learning. Analyzing mapped human genomes in R and exposing that algorithm as an api in the cloud so anyone with a (now surprisingly accessible) mapped human genome can create a heat map of their health risks shows how what is happening in data analytics can really make personal differences in our lives. And, I gotta say, when thinking about IoT and ML, I didn’t see cow pedometer artificial insemination coming. “AI meets AI”…
My first session:
Modern Data in Azure
Presented by Lance Olson, et al. In many ways, it was perfect that this session kicked off my day of data. This demo was presented as a tour of Azure data offerings, primarily from a developer point of view. As in, it wasn’t just an explanation of different data storage types in Azure, as I was expecting. They built a web app and brought in each form of data technology as it was needed for the application. A nice approach. The app was called the WingTip Ticketing application, and would be expanded on in the next session to be a SaaS offering. The first data offering being added to the ticketing application was:
Azure SQL Database Elastic Database Pool
The ticket ordering was to be handled by a relational database, Azure SQL Database. The argument for using the brand new Elastic Database Pool (announced today and in preview) was that it would make sense to logically and physically partition the database by artist, as some artists might naturally have far more load than others, depending on demand. They demonstrated ticket sales loads against a standard database and against a newly sharded Elastic Database Pool, sharded by artist. The load was measured in DTUs, or Database Throughput Units. I’ll explain this more in a bit as it was heavily covered in the next session. I was impressed by how performance could be increased, but I was more impressed with how easy it all was to configure. This includes setting up the sharding strategy, and integrating the Elastic Database Client Library into the web application code. Elastic Databases were covered thoroughly in the next session, so I’ll come back to them then.
Azure Document DB
They used Azure’s NoSQL document database service, Azure Document DB to handle ratings and reviews, with the thought that this data would be largely unstructured. For those who don’t work with document databases, you can basically think of a document as a record in a table whose schema is fluid. This way when new data is added, no schema changes are needed, just updates to the Model, View, and Controller in the web. Documents could be created with the code:
Documents doc = Client.CreateDocument(Collection.SelfLine, entity);
Even SQL-like queries could be created using
Azure Search was utilized to provide users with an intuitive search box in the ticketing application. Azure Search is a fully managed Search-as-a-Service in the cloud. It reminds me of working with ElasticSearch in the way you set up indexes, analyzers, and suggestions, though since we are in the Azure cloud it is FAR easier to provision and set up. Once an index and indexer was set up and data populated, wiring up the search was easy using the namespace:
And using the SearchIndexClient for operations like:
They showed the use of better scoring in Search, as well as suggestions. They didn’t demo any highlighting of hits capability, but I asked them afterwards and they said this was available.
Apache Storm for HD Insight
Some Big Data work was then done for the interesting example of upping the search results score based on number of recent tweets. They used Apache Storm on HDInsight with a spout to twitter based on a hashtag of a fictional music star. They bolted this to our Azure Search index and then had us in the audience tweet to the hashtag. When the hard-coded number of 10 tweets for the hashtag was met, that artist’s score would increase in the Search results. A compelling example.
All-in-all a great tour of many newer Azure data offerings. It was like 4 sessions in one.
My next session:
Building Highly Scalable and Available SaaS Applications with Azure SQL Database
Presented by Bill Gibson and Torsten Grabs. This session was more of a deep-dive into the new elastic capabilities of Azure SQL Databases, like I mentioned before. For me, I kept trying to get my head around how this was different from Federations. I’m starting to get the idea now, though, that this is not just a logical separation, but a strong data sharding strategy that can handle predicted and unpredicted loads while saving you from having to write all the routing code that you had to with Federations.
We’re back in the WingTip Ticketing application (a tongue-twister name all presenters were having trouble with). This time we’re making the application to be a SaaS offering, with different customers using the service for their own ticketing pages. We’re shown how to set up our elastic databases in PowerShell scripts. We create a database, a database per customer, we register these databases with the ShardMap, and then we add our customers to traffic manager rules. What we end up with is a collection of customer databases and a common customer catalog database. Not that different from a Federation, but without the usual bottlenecks.
Establishing the connection is achieves as follows:
SqlConnection conn = SaasSharding.GetCustomerShardmap.OpenConnectionForKey(
passing in the customer’s key.
We’re shown how we can scale our databases’ min and max DTUs via the portal (see pic below), PowerShell, Rest APIs, or T-SQL.
I mentioned DTUs (Database Throughput Units) before, so let me elaborate on at least my current understanding of what it means. Of 4 dimensions of performance: Reads, Writes, ComputeCPU, and Memory, a DTU is the max value of these 4, after they have (somehow) been adjusted. I’ll have to read their whitepaper sometime to see exactly how this is done.
In the end, it all looks good. Easy to manage and with the possibility of handling unpredictable usage.
Building Data Analytics Pipelines using Azure Data Factory, HDInsight, Azure ML, and More
Presented by Mike Flasko. Boy, if ever there was a session that made me feel like I didn’t know anything about data flows, this was it. Azure Data Factories, named so because they resemble a Henry Ford assembly line for data, are a shift for me in my SSIS-centric ideas of data flows. They are a new preview service for modeling and executing the data analytics pipeline. In a visual designer in the Azure portal, we create data sets (be they tables or files), activities (like Hadoop jobs, custom code, ML models, etc), and pipelines (a series of Activities) to complete a data analytics load process.
The Data Set source is defined in a json document. Activities to partition data are done via a Hive script on an HDInsight cluster. We combine and aggregate data in an activity defined by a JSON object. A final activity is used to call an Azure ML scoring activity. We don’t need to know its inner workings. Only the schema of the input and output and how to call the algorithm.
The end result is a process that takes cell phone log data, combines it with our existing customer data, aggregates it, and spits out a data set that says what the probability is of each customer cancelling their service. This end result is then easily (also using the Factory) sent to PowerBI for a lovely dashboard.
This is all still really new to me and I really need to study up on this. I have left over questions like how would you handle workflows for bad data and what is the best way to promote a factory from staging to prod (Mike answered the latter for me after the session: leaving the factory as is and swapping linked services definitions to make the factory run against production). One way or another, the data game is changing, and this session was an excellent introduction to the brave new world.
Gaining Real-Time IoT Insights using Azure Stream Analytics, Azure ML, and Power BI
Presented by Bryan Hollar et al. Azure Steam Analytics was just released to GA 2 weeks ago and this is the first I’ve gotten to see it in action. After seeing case studies from Fujitsu and the Kinect team, ASA implementation was mapped out for us. The shift is from thinking about reporting on data at rest, to data in motion. For example, we could analyze how many twitter users switched sentiment on a topic within a minute in the last ten minutes. SAQL (Stream Analytics Query Language), makes this easy by being a flavor of SQL mixed with temporal extensions. You’re analyzing within a time window, and these windows could be tumbling, hopping, or sliding depending on how you’ve set up your queries. For example:
SELECT Topic, COUNT(*) AS TotalTweets
FROM TwitterStream TIMESTAMP BY CreatedAt
GROUP BY Topic, HoppingWindow(second,10,5)
With Azure Stream Analytics, your data flow pipeline is set to pull from existing event hubs, analyze, and persist (or display) its results. The processed data doesn’t even need to be persisted to be reported on. ASA basically exists in the same place in the data flow pipeline as ML. But where ML would take the “cold path” of analyzing large sets of data at rest, ASA is analyzing the data as it streams. Though now, for a brief preview, ML is integrated into ASA. I’m told you’ll be able to sign up for this preview at the ASA team blog:
As far as integrating the Internet of Things, it’s basically a matter of configuring your event hub in your data pipeline. So, there’s not much difference pulling from twitter or the Internet of Things. You can then configure your output to be PowerBI (also in preview) for a real-time dashboard. The most impressive IoT and ASA example was by Fujitsu who showed an impressive app that geo-mapped energy consumption data and could zero in on spikes right down to an area of a building in real time. Though, now that I think about it, they may be tied with pedometer analysis to tell when your cows were in heat. And now I finally know why they call it a “Heat Map”.
Building Big Data Applications Using Azure HDInsight Service
Presented by Asad Khan. The final session of my day was a tour-de-force of Big Data in Azure. It was four one-hour sessions compressed to one and it was a doozy. Started with 30 seconds of the basics of big data: caring about volume, velocity, variety, and variability of data. They mentioned how Apache Hadoop is an Open Source platform for large amounts of unstructured data, but how the managed infrastructure of Azure makes HDInsight an enticing implementation. They covered HBase for NoSQL, and taught us how to use Storm for streams of data, by showing us how to build spouts for twitter and bolts for Signal R to display data on the web in real time. It was an ambitious session, but pulled off very well.
I’m just now realizing that of everything I’ve mentioned, Big Data with HDInsight is the old man at the ripe old age of a year and a half. Speaks volumes to how much Microsoft is invested in growing its cloud data offerings. Can’t wait to see what’s next!