These days, virtually no corner of the enterprise remains untouched by IT. From customer-facing Websites and brick-and-mortar storefronts to mobile apps and clouds, IT—and how well an organization puts it to work—separates the leaders from the laggards. A new report from the Business Performance Innovation (BPI) Network, Accelerating Business Transformation through IT Innovation, offers insights into this rapidly evolving space.
Among the key findings: nearly 70 percent of global managers surveyed by the BPI Network believe technology has become “far more important” to their business. However, transforming the concept into reality is a problem. Less than half (47 percent) of the 250 executives polled rate the level of innovation within their IT groups as “good” or “very high,” while 52 percent feel the organization rates “poor” or it is “just making progress.” In addition, only 42 percent believe their IT groups are doing a good job of becoming a more strategic, responsive and valued business partner.
“Business managers today are frustrated with the sluggish pace of innovation coming from their IT staffs” said Tom Murphy, editorial director for BPI Network. “They believe new business-oriented metrics should be used to measure IT performance.” This translates into a new scorecard for the IT organization. “They want to see new ideas for value creation coming from IT. They want faster and cheaper development of applications. And some want to see the same metrics applied to both IT and the business teams.”
The report identified a second major issue: “Business leaders are very sophisticated in their understanding of new hybrid IT technologies, including advanced data center systems and the use of the cloud for app development, data processing, disaster recovery and secure data storage, Murphy noted. “They know the capabilities, they see competitors benefiting from them, and they want their own IT groups to provide new and innovative tools in days and weeks, rather than months or years.”
Navigating the boundaries between business and IT is critical. CIOs recognize the need to work with business leaders to develop a common language and collaborate on developing and meeting the same metrics, the report found. However, leaders, including CTOs, who resist these changes will run into more resistance from the CEO and CFO as they seek funds for traditional IT approaches, Murphy said. The report also found that there’s a growing emphasis on funding shifting from long-term CapEx budgets to more flexible OpEx budgets as organizations move from costly, on-premises data centers to cloud-based business models that are funded on an as-needed basis.
“Instead of asking for millions of dollars to build a data center that will support the organization for a decade, business leaders are more likely to use credit cards to add services needed in the short-term, such as when they need extra capacity during the peak sales seasons,” Murphy said.
The report recommends that CIOs focus on six core issues that define today’s hybrid IT: design an architecture that provides consistency across on-premises and off-premises environments; understand whether applications belong in an enterprise data center or the cloud; develop a more holistic view of data; Implement automation and management using managed services; better understanding IT consumption models; and adjust and adapt security to fit a hybrid model.
“CIOs should work closely with their peers to manage the organizational change,” Murphy said. “They should be prepared to work toward this goal in small steps, which will show progress and build buy-in across the organization.”
– See more at: www.cioinsight.com
]]>by James Kobielus
To survive the competitive struggles, every fresh technological innovation must find clear use-cases in the marketplace. There must be some specific itch that the new approach can scratch at least as well, and hopefully much better, than the alternatives.
As the mania for Apache Spark grows in the big-data analytics arena, we must remember that it’s still an unproven technology. The early crop of commercial solutions that implement Spark haven’t yet converged on distinctive use-cases that call for Spark and no, say, Hadoop, NoSQL, or established low-latency analytics technologies. What is Spark’s application sweet-spot?
When you ponder Spark’s prospects, you must consider a related question. What exactly are the core deployment models and use-cases for which Spark is best suited in today’s crowded big data marketplace? What differentiators does Spark have over rival platforms, whether open source or proprietary, for addressing these requirements? And do these differentiators, taken as a whole, provide sufficient impetus for Spark to find its commercial sweet-spot rapidly and thereby achieve widespread adoption?
With these questions hanging in your mind, here are the principal deployment models in which Spark may prove its value in real-world applications:
The Internet of Things (IoT) may spell the end of data centers as we’ve traditionally known them. Data centers’ core functions — processing and storage — are increasingly being decentralized out to the network’s edges. The IoT is also greatly expanding the need for distributed, massively parallel processing of huge amounts of machine and sensor data of all sorts. Not just that, but the analytics required in these “fog computing” scenarios will increasingly emphasize low-latency, massively parallel processing of machine learning and graph analytics algorithms of great complexity.
As I detail here, fogs are clouds in which the primary processing nodes are network-edge endpoints, such as sensor-laden Internet of Things (IoT) devices. Fogs distribute the storage, bandwidth and other cloud resources out to the IoT endpoints, most of which are embedded deeply in the hardware infrastructure of the end applications. These fog requirements feel tailor-made for Spark, which includes an interactive real-time query tool (Shark), a machine-learning library (MLib), a streaming-analytics engine (Spark Streaming), and a graph-analysis engine (GraphX). As the IoT industry converges, sometimes haltingly, toward a common fog infrastructure, Spark may just fulfill that niche better than any other open source platform.
Spark, building on HDFS, clearly has the ability to shoulder practically any Hadoop cloud deployment model and use-case, not just those associated with the IoT. As a start, Spark can access and process data stored in HDFS, HBase, Cassandra, and any other Hadoop-supported storage system. As a general-purpose cloud platform, Spark boasts performance advantages vis-à-vis Hadoop, most notably Spark’s ability to parallelize models in real-time across distributed in-memory clusters. And unlike Hadoop’s MapReduce, Spark can combine SQL, streaming, and graph analytics within cloud analytics applications. Clearly, the cloud market seems ripe for Spark, especially in an era where distributed, heterogeneous storage layers, streaming low-latency middleware, and in-memory cloud platforms are in the ascendance.
Spark may ride its adoption in IoT and cloud environments to become ubiquitous for stream-computing applications of all kinds. Some industry observers question whether Spark truly supports all the key requirements for robust stream processing. One might argue that other open source stream-computing platforms, such as Apache Storm and Apache Samza, have better performance, functionality, or development features than Spark for these use-cases. But one might just as well argue that Spark’s advantages as a fog and cloud analytics platform lessen the need for it also to be the slam-dunk choice for stream computing. If Spark can support streaming analytics reasonably well for the majority of use-cases, it might also become the standard there as well.
Apache Spark supports SQL, machine-learning, graph, and streaming analysis against a range of data types, and in multiple development languages.
What might prevent Spark from achieving widespread adoption in any or all of these markets is not just the presence of established platforms and tools (e.g., Hadoop) that adequately address 90% of the core use-cases. Over the next two to three years, the key obstacle to widespread Spark adoption may simply be Spark’s immaturity, the paucity of field-proven, enterprise-grade Spark platforms, and the lack of a well-developed ecosystem of Spark tools, libraries, and applications.
Considering that many enterprises have now committed to Hadoop and various NoSQL platforms as their strategic big-data platforms, they may be reluctant to commit to Spark until it has truly proved its value in a sufficient number of real-world deployments. Likewise, most organizations with stream-computing requirements have already committed themselves to a commercial solution or perhaps an alternative open source platform.
Spark is the latest shiny new big-data bauble. To make the most of its “next-big-thing” status, Spark promoters will need to generate actual user demand for the technology. Advocates should avoid pitching it at customers who’ve become jaded by the incessant drumbeat for all things big data, especially Hadoop.
Spark will naturally float to its proper level in the big data ocean. Hyping it out of proportion to its competitive differentiation would only inspire a backlash. And that would be counterproductive for Spark in the long run, and deter potential users from considering it long before they’ve had their first serious opportunity to kick the tires.
James Kobielus is IBM’s Big Data Evangelist. He is an industry veteran who spearheads IBM’s thought leadership activities in big data, data science, enterprise data warehousing, advanced analytics, Hadoop, business intelligence, data management, and next best action … View Full Bio
Seen at informationweek.com.
]]>By David Linthicum | InfoWorld
Good news for Microsoft: “Strong sales of cloud products to businesses helped lift Microsoft’s revenue by 18 percent last quarter, though its profits declined,” reports Reuters. Microsoft CEO Satya Nadella said in a call with financial analysts that commercial cloud revenue grew almost 150 percent year over year to $4.4 billion annually, driven primarily by sales of Office 365, the Azure IaaS and PaaS products, and Dynamics CRM Online.
You have to hand it to Microsoft. It has taken the “slow and steady” approach to cloud computing, and it’s worked out. Indeed, many pundits called Microsoft late to the market, but thanks to the company’s existing developer base and loyal Office and operating system technology users, it’s gained real traction in the cloud.
As a result, the cloud competition is shaping up to be a three-provider race, with Microsoft now a serious contender alongside Google and Amazon Web Services.
Today, AWS is the provider to beat in the cloud race, of course, but both Microsoft and Google are making strong efforts. Google has a sound PaaS and IaaS offering, and like Microsoft is gaining share. Although Google doesn’t have the same installed base of software as Microsoft, Google has nonetheless innovated its way to higher market share.
Indeed, a new report by Jillian Mirandi projects Google’s overall cloud business — Google Cloud Platform, Google App Engine, and Google Compute Engine — will grow 84 percent this year to hit $1.6 billion in revenue.
If you’re doing the math, that puts Google in third, and Microsoft in second, and AWS in first place.
AWS has very few negatives as a public cloud provider, and it hasn’t stubbed its toes in the last few years as many people had expected. At this point I don’t think that’s likely to happen, so AWS’s pace will remain strong.
Although AWS’s lead is huge (and well deserved), the market is beginning to normalize, and Google and Microsoft are accelerating their market penetration. I don’t see this as a neck-and-neck-and-neck race anytime soon — AWS’s lead is too strong — but I do see three strong providers, which helps make the cloud market and the various technologies powering it much more interesting.
This article, “Amazon can no longer take its cloud leadership for granted,” originally appeared at InfoWorld.com. Read more of David Linthicum’s Cloud Computing blog and track the latest developments in cloud computing at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.
Seen at infoworld.com.
]]>By Chris Kanaracus | IDG News Service
Microsoft and SAP’s long-standing partnership is being strengthened with the pending certification of SAP’s ERP (enterprise resource planning) and other software for deployment on the Azure cloud infrastructure service.
By the end of the second quarter, SAP’s Business Suite, Business All-in-One, mobile platform, Adaptive Server Enterprise database, and the developer version of the Hana in-memory computing platform will be certified for Azure, the companies said Monday.
SAP’s Cloud Appliance Library will make it possible to launch preconfigured SAP software packages to Azure within just a few minutes, they added.
Under the agreement, Microsoft will support customers if a problem crops up at the infrastructure level, while SAP would take over if the issue involves an application error, said Kevin Ichhpurani, senior vice president, head of business development and strategic ecosystem at SAP.
If the problem’s source can’t be immediately targeted, SAP and Microsoft would work together to resolve it, he added.
Also Monday, SAP and Microsoft announced the general availability of an integration between SAP BusinessObjects and Microsoft Power BI through Excel; an upcoming release of SAP’s Gateway that will tie together SAP applications and Office 365; and plans for SAP mobile applications that support Windows and Windows Phone 8.1.
More details about the partnership announcement are expected to be released at SAP’s Sapphire conference in June.
SAP’s announcement follows moves by both Oracle and Infor to certify their software for Azure.
The deal has positive implications for both Microsoft and SAP, said analyst Ray Wang, chairman and founder of Constellation Research.
“This is SAP trying to show its cloud cred with a Microsoft partnership, and Microsoft trying to show enterprise cloud cred with SAP,” he said.
But SAP’s arrangement with Microsoft only goes so far, given it doesn’t currently include Business One, which competes with some members of Microsoft’s Dynamics ERP family, as well as the enterprise edition of Hana.
Monday’s announcement represents the beginning of what will lead to more SAP software heading to Azure, according to Ichhpurani.
The deal makes sense for both SAP and Microsoft, according to Holger Mueller, vice president and principal analyst at Constellation Research.
“Application vendors keep hearing that they need to deploy on known, standard IaaS infrastructures and not proprietary ones from their customers,” he said. SAP recently launched an IaaS built on top of Hana.
Meanwhile, IaaS vendors such as Microsoft are looking to gain more workloads in order to get a return on their investments, Mueller added.
The Excel integration with Business Objects is another benefit for Microsoft, Holger said. Excel users will be able to work with data held in Business Objects using the familiar Excel interface. This interoperability will further strengthen Excel’s position “as the key data analysis and exploration tool,” he said.
Chris Kanaracus covers enterprise software and general technology breaking news for The IDG News Service. Chris’ email address is Chris_Kanaracus@idg.com
Seen at infoworld.com.
]]>By Mikael Ricknäs | IDG News Service
Amazon Web Services is making a pitch for enterprises’ high-performance databases to run on its infrastructure, launching new instances optimized for the task.
The R3 instance family has been added to Amazon RDS (Relational Database Service), which takes care of the administrative grunt work for databases such as MySQL and SQL Server.
[ Andrew C. Oliver answers the question on everyone’s mind: Which freaking database should I use? | Keep up with the latest approaches to managing information overload and compliance in InfoWorld’s Enterprise Data Explosion Digital Spotlight. ]
The new instances are optimized for memory-intensive applications, and have the the lowest cost per GB of RAM among all of Amazon’s RDS instance types. They can be used to run the kinds of demanding database workloads often found in gaming, enterprise, social media, web, and mobile applications, Amazon said in a blog post.
The underlying hardware uses the latest Intel Xeon Ivy Bridge processors and delivers higher sustained memory bandwidth with lower network latency and jitter compared to existing instances.
The five R3 instances have between 16GB and 262GB of RAM plus between two and 32 virtual CPUs. The highest performing instance has a network speed of up to 10Gbps.
Right now, users can launch databases based on version 5.6 of MySQL, PostgreSQL, or SQL Server. Support for versions 5.1 and 5.5 of MySQL is in the works, as is support for Oracle’s database, Amazon said.
On-demand pricing for MySQL R3 instances start at US$0.240 per hour in the US West region. They are available from Amazon’s datacenters in Europe, the Asia Pacfic region and the U.S. The company expects to make them available from Beijing, São Paulo and the GovCloud in the near future, as well.
Send news tips and comments to mikael_ricknas@idg.com
Seen at infoworld.com.
– See more at: //podcasts.infoworld.com/d/cloud-computing/amazon-wants-run-your-high-performance-databases-243330?source=rss_cloud_computing#sthash.ywfxaFHy.dpuf
By Paul Krill | InfoWorld
A quick option for building Web and iOS apps is on the horizon from a group of developers in Europe. Hoodie is an open source tool for building Web applications in days via an open source library described as being easier to use than JQuery.
“Hoodie is a software platform that abstracts all important back-end operations, like handling payments, sending emails, security and permissions, synchronizing data, etc, into a simple API and is extensible via plug-ins,” said Jan Lehnardt, co-inventor of Hoodie and CEO of The Neighborhoodie Software, which is overseeing the technology.
The eventual target audiences are user experience and visual designers, front end developers, and people with few development skills who can produce user experiences in HTML, CSS, and jQuery.
“Ultimately, we want to empower people to solve their personal and professional problems. Much like Excel, Access, or Lotus Notes enable billions of regular people to be productive in their businesses and at home, Hoodie wants to make sure people can use computers and the Web creatively,” Lehnardt said.
Hoodie is not akin to codeless development environments such as Mendix or Outsystems — at least not for now. “There might be a fully integrated UI builder with little or no code based on Hoodie, and it might come from us — or someone else — but this is out of scope for now and for a while. We’d love to see this, though,” Lehnardt said.
The technology’s website positions Hoodie as having an “Offline First” and “noBackend” architecture for front-end-only apps on the Web and on iOS, with one Hoodie port featuring front-end bits for iOS in Objective-C code. With the noBackend architecture, application builders build full-stack applications without thinking about the back end. Offline First, meanwhile, is “an invitation for a dialogue and a public research project to establish a language, design patterns, and technological solutions to survive in the sometimes-connected world of today,” Lehnardt said. Applications are offline by default. Interactions in Hoodie feature client software on either a browser or device with back-end operations occurring in the cloud. CouchDB is leveraged for document storage.
Lehnardt described Hoodie’s current development stage as “probably around beta.” But a Hoodie-based application already is in production with about 10,000 users. Still planned for Hoodie are comprehensive documentation, a tryout platform, and a hosting solution. The company is inviting Node.js developers to extend Hoodie with plug-ins.
This story, “Open source Hoodie is tailored for quick app dev” was originally published at InfoWorld.com. Get the first word on what the important tech news really means with the InfoWorld Tech Watch blog. For the latest developments in business technology news, follow InfoWorld.com on Twitter.
]]>By Serdar Yegulalp | InfoWorldFollow
For proof of how radically Microsoft has evolved, especially in terms of its approach to its own software stack, look no further than the newly open sourced and cross-platform incarnation of ASP.Net, named ASP.Net vNext. And for an example of how wise that strategy has been, look no further than the efforts of one programmer to get ASP.Net vNext running on OS X and Linux.
As reported by eWeek, Australian .Net developer Graeme Christie was able to port ASP.Net vNext to those platforms — with a fair amount of help from Microsoft. “Microsoft [is] fully integrating Mono [an earlier .Net framework developed for platforms other than Windows] and Linux into their build environment and test matrix,” Christie wrote in a blog post, “and [is] actively working with the community to make Mono a top class platform for hosting ASP.Net.”
Microsoft is not planning to deploy versions of vNext directly for OS X or Linux, but rather to foster the development of vNext on top of Mono for those platforms. A rough parallel would be how a major multiplatform application — for example, Mozilla’s Firefox — might be ported to a new platform: first by enthusiasts, then later with aid from its original creators to ensure the resulting product is up to snuff.
To that end, getting vNext running on OS X and Linux right now requires a little heavy lifting. The process that Christie describes is a multistep procedure that requires first building Mono, then Microsoft’s K Version Manager and K Runtime Environment tools. In other words, it’s not a one-click installation or deployment process — yet.
Making ASP.Net vNext open source is by itself a major step forward for the platform, but it’s far from the only enhancement. Web Platform Team architect Scott Hanselman wrote in detail about what else vNext offers developers, including more granular deployment for apps (such as allowing each app to have its own sub-edition of the .Net Framework with its own set of packages), optimizations for low-memory and high-throughput scenarios, and cloud- and server-optimized versions of libraries.
As committed as Microsoft is to open source in its post-Ballmer incarnation, the company has also been prudent about how it goes about doing so. For one, Microsoft elected to use the Apache 2.0 license for ASP.Net vNext, the same license Google used for Android — most likely for the same reasons, since the Apache license allows a company to use open source code without also having to free up any proprietary enhancements it might make. vNext could remain at the heart of any number of commercial projects — for example, Microsoft’s for-pay development tools or its server products — without impacting their licensing.
Microsoft has been inching toward open-sourcing ASP.Net for some time, mainly via the pieces associated with it. In 2009, the ASP.Net MVC framework was open-sourced under Microsoft’s MS-PL license, although licensing the underlying code seemed like a pipe dream kept out of reach more by Microsoft’s intractability on the issue than any underlying intellectual property issues. But now all that has changed, and it’s up to developers to see how much more headway the new ASP.Net makes on platforms other than Windows.
This story, “Microsoft’s new open source ASP.Net can run on Linux, OS X” was originally published at InfoWorld.com. Get the first word on what the important tech news really means with the InfoWorld Tech Watch blog. For the latest developments in business technology news, follow InfoWorld.com on Twitter.
]]>By Rick Grehan
Apache Cassandra is a free, open source NoSQL database designed to manage very large data sets (think petabytes) across large clusters of commodity servers. Among many distinguishing features, Cassandra excels at scaling writes as well as reads, and its “master-less” architecture makes creating and expanding clusters relatively straightforward. For organizations seeking a data store that can support rapid and massive growth, Cassandra should be high on the list of options to consider.
Cassandra comes from an auspicious lineage. It was influenced not only by Google’s Bigtable, from which it inherits its data architecture, but also Amazon’s Dynamo, from which it borrows its distribution mechanisms. Like Dynamo, nodes in a Cassandra cluster are completely symmetrical, all having identical responsibilities. Cassandra also employs Dynamo-style consistent hashing to partition and replicate data. (Dynamo is Amazon’s highly available key-value storage system, on which DynamoDB is based.)
Cassandra’s impressive hierarchy of caching mechanisms and carefully orchestrated disk I/O ensures speed and data safety. Its storage architecture is similar to a log-structured merge tree: Write operations are sent first to a persistent commit log (ensuring a durable write), then to a write-back cache called a memtable. When the memtable fills, it is flushed to an SSTable (sorted string table) on disk. All disk writes are appends — large sequential writes, not random writes — and therefore very efficient. Periodically, the SSTable files are merged and compacted.
A Cassandra cluster is organized as a ring, and it uses a partitioning strategy to distribute data evenly. The preferred partitioner is the RandomPartitioner, which generates a 128-bit consistent hash to determine data placement. The partitioner is assisted by another component called a “snitch,” which maps between a node’s IP address and its physical location in a rack or data center.
When Cassandra writes data, that data is written to multiple nodes so that it remains available in the event of node failure. The nodes to which a given data element is written are called “replica nodes.” Cassandra uses the snitch to ensure that the replica nodes for any particular piece of information are not in the same rack. Otherwise, if the rack were to fail, the data element and all its replica copies would be lost.
Should one or more nodes in a cluster become overutilized, Cassandra rebalances the cluster with the aid of “virtual nodes,” or vnodes. An entirely logical construct, a vnode is essentially a container for a range of database rows. Because each physical node is assigned multiple vnodes, Cassandra can rebalance simply by moving a virtual node from an overloaded cluster member to less burdened members. Using virtual nodes makes load balancing more efficient because it allows Cassandra to rebalance by moving small amounts of data from multiple sources to a destination.
To maintain write throughput in the face of node failures, Cassandra uses “hinted handoffs.” A node receiving a write request will attempt to deliver the request to the replica node responsible for the data. If that fails, the recipient node (referred to as the “coordinator node”) will save the request as a “hint” — a reminder to replay the write operation when the unreachable replica node becomes available. If the coordinator node knows beforehand that the replica node is unreachable, the hint is saved immediately.
Hinted handoffs are one of Cassandra’s consistency repair features. Another, called “read repair,” comes into play during read request processing. Depending on the consistency level chosen (explained below), Cassandra may satisfy a read request by reading only one of the replica nodes. Even so, it will issue background reads to all the replica nodes, and verify that all have the latest version of the data. Those that don’t are sent write operations to ensure that all copies of the data are up-to-date and consistent.
A prominent benefit of an RDBMS is its adherence to ACID — atomicity, consistency, isolation, and durability — principles, which guarantees repeatable, deterministic behavior in a multiclient setting and helps ensure data safety in spite of system failure. Nonrelational databases like Cassandra eschew ACID guarantees on the basis that they become performance-limiting as the database scales in both quantity of data and I/O requests.
Cassandra is described as being “eventually consistent.” When data is written to Cassandra, that data is not necessarily written simultaneously on all replica nodes. As described earlier, some cluster members might be temporarily unreachable. However, hinted handoffs ensure all nodes eventually catch up, and the system becomes consistent. Similarly, read repairs catch and correct inconsistencies when the data moves in the other direction, from Cassandra to the outside world.
This notion that different nodes in a cluster might possess inconsistent copies of a given data element might make you uneasy. The good news is that you can tune Cassandra’s consistency level. For instance, you can control the level of consistency that a write operation has achieved — how many replica nodes have written the data — before the write is acknowledged as successful to the issuing client application.
Similarly, on read operations, you can control how many replica nodes have responded before the response is returned to the client. This tunable consistency level ranges from Any, which means the request completes if any node responds, to All, which means the request only completes if all replica nodes have responded. Midway between Any and All are consistency levels such as Quorum, which allows requests to complete if a majority of replica nodes have responded. Cassandra’s tunable consistency is a powerful feature that lets you balance speed and consistency or trade one for the other. Want speed? Pick Any. Want full consistency? Pick All.
Because Cassandra is distributed, a cluster’s members require a mechanism for discovering one another and communicating state information. This is where Cassandra’s Gossip protocol comes in. As you might suspect, Gossip gets its name from the human activity of passing information throughout a group via apparently random, person-to-person conversations.
Certain nodes in a cluster are designated as “seed” nodes. Each second, a timer on a Cassandra node fires, initiating communication with two or three randomly selected nodes in the cluster, one of which must be a seed node. Consequently, seed nodes will tend to have the most up-to-date view of a cluster. (When a new node is added to a cluster, it first contacts a seed node.)
Cassandra works to keep Gossip communication efficient. Each node maintains two sorts of states. HeartBeatState tracks the node’s version number, which is incremented any time information on the node has changed, and how often the node was restarted. ApplicationState tracks the operational state of the node (such as the current load). Nodes exchange digests of HeartBeatState information with one another. If differences are found, the nodes then exchange digests of ApplicationState info, and ultimately the ApplicationState data itself. In addition, the Gossip algorithm first seeks to resolve differences that are “farther apart” (in terms of version numbers), since those are more likely to embody the widest inconsistencies.
RDBMS users familiar with SQL should feel right at home with CQL, the Cassandra Query Language, which can be executed from the Python-based Cassandra shell utility (cqlsh) or through any of several client drivers. Client drivers are available from websites like Planet Cassandra, where you’ll find CQL-enabled drivers for Java, C#, Node.js, PHP, and others.
In the past, drivers communicated with a Cassandra cluster using a Thrift API — Thrift being a framework for creating what amounts to language-independent remote procedure calls for client and server. Cassandra’s Thrift API is now considered a legacy feature, as the CQL specification defines not only the CQL language, but an on-the-wire communication protocol as well.
CQL’s syntax resembles its relational cousin’s. It has SELECT, INSERT, UPDATE, and DELETE statements, and these are accompanied by FROM and WHERE clauses. In addition, CQL’s data types are what you would expect. You’ll find integers, floats and doubles, blobs, and more. Of course, there are differences. For one, CQL has no JOIN operation. And when you write a FROM clause, you specify column families — though, as of the latest version of CQL, the term “table” is used in place of “column family.” CQL also lets you specify the desired consistency level for any operation, but its real benefit is that it is a data management language quickly grasped by relational programmers, and is independent of a specific programming API.
Installing Cassandra is reasonably straightforward, particularly if you download the DataStax Community edition, which bundles a Web-based management application called OpsCenter. I downloaded and installed the tarball version of Cassandra on my Ubuntu Linux system (the apt-get version for some reason refused to install) and found that the real work lies in configuring a Cassandra cluster. The configuration.yaml file holds scads of tunable parameters for the node and its cluster.
For example, you can set the number of tokens that will be assigned to the node, which controls the proportion of data (relative to other nodes) that the node will be responsible for. (This is useful if your cluster is composed of heterogeneous hardware because more powerful members can be configured to handle heavier loads.) Happily, for a small trial installation, you need only configure the listening IP address for the current node and the IP addresses of the cluster’s seed nodes.
OpsCenter runs a server process on your management host that communicates with agent processes executing on the cluster’s nodes. The agents gather usage and performance information and send it to the server, which provides a browser-based user interface for viewing the aggregated results. With OpsCenter, you can browse data, examine throughput graphs, manage column families, initiate cluster rebalancing, and so on. (As an aside, I was unable to get OpsCenter working successfully on my Linux installation. The DataStax Community Edition installation on Windows worked, but only partially, it being unable to connect to the agent service.)
While documentation — primarily in the form of FAQs, wikis, and blogs — exists on the Apache Cassandra site and the Planet Cassandra site, DataStax is the most comprehensive source for Cassandra documentation and tutorials. In fact, Planet Cassandra’s Getting Started page more or less points you to the DataStax pages.
DataStax maintains documentation of both current and previous versions; as Cassandra is updated, you can troubleshoot any earlier installations you continue to run. The Web pages are well hyperlinked and provide plenty of diagrams. Along with video tutorials, you’ll also find reference guides for Java and C# drivers, as well as developer blogs on Cassandra internals.
Until recently, Cassandra provided no transactional capabilities. However, the latest release of Cassandra (version 2.0) adds “lightweight transactions” that employ an atomic “compare and set” architecture. In CQL, this is manifested as a conditional IF clause on INSERT and UPDATE commands. The data is modified if a particular condition is true. You can imagine a CQL INSERT statement that will only add a new row if the row does not exist, and the presence of the transactional IF test will guarantee that the INSERT is atomic for the database.
Cassandra 2.0 also improves response performance with “eager retries.” If a given replica is slow to respond to a read request, Cassandra will send that request to other replicas if there’s a chance the other replicas might respond prior to the request timeout. With version 2.0, Cassandra now handles the removal of stale index entries “lazily.” In the past, stale entries were cleaned up immediately, which required a synchronization lock. The new technique avoids the throughput-constricting lock.
While Cassandra is a complicated system, its symmetrical treatment of cluster nodes makes it surprisingly easy to get up and running. The SQL-like nature of CQL is a great benefit, making it quicker and easier for developers moving from RDBMS environments to become productive.
Nevertheless, the learning curve for Cassandra is significant. It’s a good idea to set up a small to modest development cluster and do plenty of experimenting, particularly with your data schema and configuration parameters. Performance issues can become significant as the application scales up.
This story, “Cassandra lowers the barriers to big data” was originally published by InfoWorld.
]]>Joab Jackson, IDG News Service (New York Bureau)
April 03, 2014
As it rolled out tools and features for coders at its Build developer conference Thursday, Microsoft showed that it is ready to embrace technologies and platforms not invented within its walls.
Rather than relying solely on internal tools, the Azure cloud services platform has incorporated a number of non-Microsoft technologies, including popular open source tools such as the Chef and Puppet configuration management software, the OAuth authorization standard, and the Hadoop data processing platform.
The company has also taken steps to incorporate open source into its product roadmaps, by releasing the code for its new compiler and setting up a foundation for managing open source .Net projects.
“Clearly Microsoft’s message is its support of multi-platform. It will take any part of your stack, it doesn’t have to be just Microsoft software,” said Al Hilwa, IDC research program director for software development. “This is good for Microsoft and good for the ecosystem.”
Microsoft’s Azure strategy is to “enable developers to use the best of Windows ecosystem and the best of the Linux ecosystem together … and one that enables you to build great applications and services that work on every device,” Scott Guthrie, Microsoft’s new executive vice president overseeing the cloud and enterprise group, told the audience of developers and IT professionals.
On the developer side, the company announced that it has open-sourced its next generation compiler for C# and Visual Basic, code-named Roslyn.
To date, compilers have been “black boxes,” explained C# lead architect Anders Hejlsberg.
Roslyn is unique as a compiler because has a set of APIs (application programming interfaces) that can feed information about a project as it is being compiled to Microsoft’s Visual Studio IDE (integrated development environment) and third-party development tools.
Hejlsberg demonstrated how Visual Studio can offer helpful tips through an “interactive prompt,” using feedback from the compiler. For instance, it can flag libraries that have been called but not used in the program code.
Microsoft is hoping that other vendors will incorporate the API into their software development tools. Developers can also now add their own features into C# and have the compiler recognize them. Open-sourcing the compiler may also lead to efforts to create versions of C# for other platforms.
The company released Visual Studio 2013 Update 2 Release Candidate.
One new capability allows for two-way communication between the Visual Studio IDE and browsers.
Typically, when developers write code for a Web application in Visual Studio, they can check to see if it runs correctly by running it in a browser.
Now, using a technology known as Browser Link, developers can edit the source code directly in the browser. Browser Link will write the changes back to the source code file in Visual Studio. If a file such as a related style sheet is not open, Visual Studio can open the file and make the change as well.
Browser Link works on “any open browser,” in Microsoft’s words; the company named Google Chrome and Firefox, in addition to Internet Explorer.
In addition to open-sourcing C#, Microsoft has also started an organization, called the .Net Foundation, to manage additional open source .Net projects from Microsoft and others.
The company also announced the general availability of Visual Studio Online, a hosted version of the IDE that works within Azure and is incorporated into Microsoft Team Foundation Service to enable rapid DevOps-styled development.
On the cloud side of operations, Azure has incorporated two of the industry’s leading open source configuration management tools, Chef and Puppet. Users can deploy these technologies to quickly boot up, configure or reconfigure large numbers of virtual machines.
Microsoft has also redesigned the Azure portal, giving it a much more flexible interface. It builds on the Windows Tile design, allowing users to add their own tiles that can display live information, such as metrics of how well the user’s operations are performing. One tile even keeps a tally of the bill that the user has accumulated in the current billing cycle, which should help eliminate any surprises when the monthly payment comes due, Guthrie noted.
Guthrie touted a wide range of other Azure improvements and new features as well.
Azure now offers staging support. This feature allows a Web developer to set up a working copy of an application that is about to go live in a full production setting, for final testing. This eliminates the need to do the final test on the live production version of the application.
Also new with Azure is Traffic Management Server, a service that can route application requests to the copy of a distributed application that is closest to the requester’s geographic origin, potentially lowering latency times for users.
Microsoft has taken further steps in integrating its Active Directory (AD) directory services into Azure.
Now enterprises can use their AD directories to authenticate mobile users, providing a single sign on option for employees and partners that allows them to use the same password for desktop and mobile device access to an organization’s resources.
This AD support has also been incorporated in the Microsoft’s Office365 hosted Office service.
On the data side, Azure’s SQL Server service now offers more space and a higher promised service level agreement. Users now can store up to 500GB of data, rather than 150GB. Microsoft is also guaranteeing that the service will remain in operation for at least 99.95 percent of the time.
The company has also added a backup service that allows users to revert the database back to an earlier state any time in the prior 31 days. This “roll-back” feature would be valuable to a database administrator who accidentally deletes data or makes some other mistake that could cause irreparable loss of data.
Microsoft has also updated its HDInsight Hadoop service to run the latest version of Hadoop, version 2.2, and to incorporate the Hadoop YARN (Yet Another Resource Negotiator) scheduler that can be used to process jobs based on streaming data.
Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab’s e-mail address is Joab_Jackson@idg.com
Seen on cio.com.
]]>In this post, we will discuss how we built the personalized onboarding flows for the mobile app. It will briefly cover the end-end architecture of the mobile stack, how we use the LinkedIn recommendation engine and AB testing frameworks with the onboarding graph based API.
LinkedIn Mobile Stack
The LinkedIn Mobile stack consists of a frontend tier of node.js servers that provide set of APIs accessible over HTTP to the mobile clients. The LinkedIn iPhone, Android, and mobile-web clients all share this same set of APIs. The node.js tier fetches and aggregates data from one or more internal LinkedInrest.li powered services, such as the recommendation service. The mobile stack also heavily uses Voldemort, a distributed key-value store, and LiX, LinkedIn’s internal member segmenting and AB testing platform to learn and iterate on the best possible user-experience in the mobile clients
LiX is used to decide if and how a feature should be displayed to a given user. With LiX, we can enable a feature for a specific subset of users based on a the member attributes, language, geo, date of enrollment into LinkedIn etc. It enables us to effectively roll out a feature to the entire member base controlling the segment we want to target. A LiX experiment is configured using a graphical web based user interface. For the personalized onboarding feature, LiX controls who sees the flow and what set of screens are in the flow.
Design Goals
Building a feature on mobile often requires making tradeoffs. This post will focus on the following three general mobile design goals as they apply to building this onboarding feature on mobile.
Building a Delightful Experience
Speed plays a large part of creating a delightful mobile experience. To achieve speed, the entire onboarding process needed to be short enough so that users could quickly start using the app.
Achieving Relevance
The onboarding experience is successful only if our users gain value from completing it:
Creating a Flexible Design
A flexible design was essential to enable us to rapidly experiment with these three relevancy dimensions. We needed to be able to target the onboarding experience to specific sets of users, vary the order, number, and types of screens in the onboarding flow, and recommend the best items for each user for each type of content.
How We Built It
Targeting
To establish which members are eligible for the guided onboarding experience, we use LiX. Using LiX, we divide the population of members into two groups, those who should see the onboarding experience flow and those who should not based, on a recent LinkedIn join date or a low number of connections.
Once we determine that a member is eligible for the guided flow, we perform a lookup in Voldemort to determine the last time at which the member saw the guided flow. If enough time has passed since they last saw it, then we will redisplay the flow.
Personalization
Each screen is built using data from LinkedIn’s recommendation service. Content recommendations are generated via a set of Hadoop jobs running over the hundreds of millions of profiles, connections, and activities on LinkedIn. Impression and actions taken in the onboarding experience are fed back into this system to improve recommendation relevance in subsequent recommendation flows.
The set and order of screens within the flow are customized based on the demographic information provided by the member when signing up for LinkedIn. All flows begin with an optional address book import step. If the member declines the address book import option, then a different set of screens is shown.
For example, we distinguish between student and members who are employed. They would see a sequence similar to this:
Onboarding Flow Graph API
To ensure a fast initial load, quick inter-screen transitions, and to present the user with an indication of how many screens remain in the flow, we chose a directed graph data structure. The graph defines the screens, and transitions between screens, and API endpoints specifying where to get recommendation data for each screen.
Every vertex in the graph represents a single screen and contains two properties, node and edges.
node contains the information needed to fetch the recommendations and the text needed to render the screen.
Decoupled Data Fetch
Our graph does not contain the recommendations for each screen. Instead, it contains a resourcePath property that specifies another API endpoint from which the client can fetch recommendations. This decouples the graph flow control structure from the data on each screen.
Clients can prefetch data for screens before the screen is displayed without sacrificing latency on the initial load of the graph data structure itself.
The recommendation data model for each screen is independent of the type of content being recommended. This allows us to introduce a new recommendation type without updating the clients. For example, we could introduce university recommendations for students, by simply adding a new node to the graph for students.
{
"influencers": {
"node": {
"resourcePath": "/li/v2/onboard/influencers",
"logo": "influencer",
"title": "Get insights from the world's top minds",
"subtitle": "Follow LinkedIn Influencers to hear what industry leaders have to say.",
"type": "influencers",
"submitToastText": "Following",
"postResourcePath": "/li/v2/onboard/influencers"
},
"edges": {
...
}
}
}
edges enumerates all possible transitions from the current screen to other screens. edges is composed of a default object and an optional options array.
Changing the order of the screens in the flow means returning a different set of edges in the graph. Using LiX, we can experiment with screen orders for different segments of the population, again without updating the clients.
{
"influencers": {
"node": {
...
},
"edges": {
"default": {
"dest": "channels"
},
"options": [
{
"dest": null,
"predicate": [
"abi"
]
}
]
}
}
}
The default object has a dest property that specifies the identifier for the next screen. The dest property on an edge is always another named property on the graph. A dest property with the value of null means that the flow should terminate.
Each option in the options array has a dest property and an array predicate.
Predicate
The predicate array specifies one or more identifiers. The client evaluates each identifier to true or false.
If all of the identifiers in the predicate array are true, then the option evaluates to true and the client must navigate to the screen specified by dest. If the option is false, then the client must continue to evaluate the other options in the options array until exhausted. If no options evaluate to true, then the client navigates to the dest specified by default.
{
"options": [
{
"dest": null,
"predicate": [
"abi"
]
},
{
"dest": "channels",
"predicate": [
"foo"
]
},
{
"dest": "groups",
"predicate": [
"bar",
"baz"
]
}
]
}
For example, one predicate identifier we use is abi. If the user has chosen to import her address book, then abi will evaluate to true. By using this identifier in a graph vertex edges array, we can control the screen flow based on whether or not the user has performed an address book import.
This type of predicate evaluation has the following benefits:
Root
To enter into the graph data structure, we provide a separate root node.
{
"root": {
"default": {
"dest": "pymk",
"pathLength": 4
},
"options": [{
"dest": "m2m",
"predicate": [
"abi"
],
"pathLength": 4
}]
}
}
This represents the incoming edge to the graph. It has nearly the same structure as a generic graph edge described above with the addition of pathLength, a hint to the client about how many steps there are in each possible flow originating from this edge. This value is used to display the step indicator in the UI.
Putting it all together, we arrive at the full graph structure
Conclusion
Building our APIs using a directed graph structure enables us to quickly experiment with the guided onboarding experience.
The onboarding feature has been enabled for over a month now. So far, we have observed that 76% of members complete the entire flow. Furthermore, 72% of members who complete the flow perform at least one action. On average, members who interact with a screen perform 3.7 actions on that screen.
We believe that by further tuning the ordering and types of recommendations surfaced in the onboarding experience flow, we can enable our new members to quickly get the most out of what LinkedIn has to offer.
Seen on linkedin.com.
]]>