<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Think Big</title>
	<atom:link href="http://thinkbiganalytics.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://thinkbiganalytics.com</link>
	<description>Making big data come alive.</description>
	<lastBuildDate>Wed, 19 Jun 2013 00:09:14 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Campaigning in the Age of Big Data</title>
		<link>http://thinkbiganalytics.com/campaigning-in-the-age-of-big-data/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=campaigning-in-the-age-of-big-data</link>
		<comments>http://thinkbiganalytics.com/campaigning-in-the-age-of-big-data/#comments</comments>
		<pubDate>Wed, 29 May 2013 17:34:40 +0000</pubDate>
		<dc:creator>Dan Mallinger</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[CaseStudy]]></category>

		<guid isPermaLink="false">http://thinkbiganalytics.com/?p=3176</guid>
		<description><![CDATA[In the past, politics had little targeting to individual voters. Where this existed, it was achieved by local groups that understand the locality such as churches. While locally successful, these efforts could not share insights upward in the campaign or easily with other local efforts in the nation. And these efforts lacked the data analysis...<a href="http://thinkbiganalytics.com/campaigning-in-the-age-of-big-data/"><div class="blog-read-more">Read More</div></a>]]></description>
				<content:encoded><![CDATA[<p>In the past, politics had little targeting to individual voters. Where this existed, it was achieved by local groups that understand the locality such as churches. While locally successful, these efforts could not share insights upward in the campaign or easily with other local efforts in the nation. And these efforts lacked the data analysis to learn where their assumptions were wrong and correct them.</p>
<p>A few years ago, as Big Data took hold of the private sector, it was a story of economic margins. Companies knew that modeling individual consumers, machines, and employees was a revenue-positive proposition but the costs of collecting and analyzing that data were prohibitive. Big Data is changing the story and enabling companies to do this modeling, for example tailoring experiences of individual customers, by providing tools that can work with new volumes of data and do so in a cost effective way.</p>
<p>Campaign managers are now learning a similar story: Big Data can empower campaigns to understand, target, and customize content to individual voters. The Big Data of voting includes surveys, web traffic, voter history, social media, varied consumer data, and more. Here are four ways it will change campaigning:</p>
<p><strong>1. Swing Voters: </strong>One technical definition of a &#8220;swing voter&#8221; is a voter whose ratings of two candidates on the issues are sufficiently similar. In a world where voters are increasingly likely to publish their opinions on social media, share information with campaigns through internet surveys, and traffic news websites sharing cookies with one another, campaigns have many sources of information about individual voters. The data reveal which issues are most important to the election as well as which voters are likely to have values-alignment with multiple candidates. Campaigns will use this information to target these voters and improve their campaign margins.</p>
<p><strong>2. Custom Messaging:</strong> Candidates are known for spinning different messages by region. From gun laws to &#8220;cheesy grits&#8221; candidates know that voters differ in values. Campaigns will use Big Data to identify these differences and leverage them. When you and your cousin down the street log into Youtube, you will both see different messages from the same campaign; all to ensure the most relevant information gets to each of you. Moreover, this content will be tested by showing slight differences to similar voters and evaluating their responses. Everything that could be mentioned in the cable news cycle will be rigorously tested and optimized, then customized for you.</p>
<p><strong>3. Unlikely Voters:</strong> While almost 90% of elderly, highly-educated persons vote, they are atypical for the country. The overall numbers increased again in 2012 but left over 40% of voters as untapped potential, with the number not dropping much below that in the last century. Campaigners understand that voters are social, seeking to act like those around them. They also understand that a latent demand of 40% is game-changing. They will use Big Data to target messages explicitly to voters whose values are aligned with the candidate but are unlikely to vote. They will target messages of an opponent attacking values and of similar others standing up to vote and defend the group. While &#8220;Rock the Vote&#8221; is an intriguing idea, not everyone wants to be a rockstar. Campaigners will start showing voters that people like them, and those in their social group, are voting for their candidate through tools like Facebook and Youtube.</p>
<p><strong>4. Identifying Influencers:</strong> Voting is known as a &#8220;normative decision&#8221;, meaning that voters unconsciously attempt to vote as they feel their friends would. Influencing voting, whether who to vote for or whether to vote at all, is largely a matter of influencing how a voter feels their peer group would act. By leveraging social media, campaigns will target key influencers to shape these perceptions. Unlike traditional media, influencers will not just be those that are vocal but will be those seen as representative of a social clique. By doing so, campaigners will leverage the politics of small groups to win national elections.</p>
<p>Data science cannot change the underlying psychology of voting. Voter behavior is still largely explained by familiarity bias and normative decision making, two concepts that can be summarized by the questions: &#8220;Which name and face seems most familiar to me?&#8221; and &#8220;Who would my friends vote for?&#8221; But data science can change the way politicians garner votes and how effective they are in campaigns.</p>
]]></content:encoded>
			<wfw:commentRss>http://thinkbiganalytics.com/campaigning-in-the-age-of-big-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BOSTON’s STRONG with Big Data</title>
		<link>http://thinkbiganalytics.com/bostons-strong-with-big-data/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=bostons-strong-with-big-data</link>
		<comments>http://thinkbiganalytics.com/bostons-strong-with-big-data/#comments</comments>
		<pubDate>Wed, 01 May 2013 18:31:35 +0000</pubDate>
		<dc:creator>Rick Farnell</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Blogroll]]></category>
		<category><![CDATA[Data Science]]></category>

		<guid isPermaLink="false">http://thinkbiganalytics.com/?p=3011</guid>
		<description><![CDATA[Sadly, the 2013 Boston Marathon may mark the day that the world witnessed first hand the value of Big Data. We started Think Big Analytics in 2010, when only a handful of companies even knew what Big Data meant or what value it could create.  It may be the horrific events of the 2013 Boston...<a href="http://thinkbiganalytics.com/bostons-strong-with-big-data/"><div class="blog-read-more">Read More</div></a>]]></description>
				<content:encoded><![CDATA[<p dir="ltr">Sadly, the 2013 Boston Marathon may mark the day that the world witnessed first hand the value of Big Data.</p>
<p dir="ltr">We started Think Big Analytics in 2010, when only a handful of companies even knew what Big Data meant or what value it could create.  It may be the horrific events of the 2013 Boston Marathon and what transpired leading to the capture of the suspects that mark the point in time when the world (outside of Silicon Valley) fully comprehends Big Data and the value it can create.  </p>
<p dir="ltr">While we were not involved with the investigation, here’s what we all know: Authorities and intelligence officials collected, stored and analyzed video footage from street cameras and countless digital videos and photographs taken near the finish line that were posted to social media sites. Investigators then released photos of the two suspects to the public.  The suspects murdered a policeman, robbed a convenience store and stole a car. Utilizing GPS device tracking of the hijacked automobile, the police engaged in a car chase into Watertown, MA.  25 hours after releasing the suspects’ photos, police shut down an entire metropolitan city, located the final suspect and took him into custody.</p>
<p dir="ltr">Numerous Big Data applications were likely utilized in the capture of the suspects:</p>
<ol>
<li dir="ltr">
<p dir="ltr"><strong>Streaming Video Analytics</strong> – Video footage from the finish line on Boylston Street and surrounding streets were likely used to analyze millions of people and narrow in on possible suspects.</p>
</li>
<li dir="ltr">
<p dir="ltr"><strong>Social Data Analytics</strong> – Twitter, Facebook and LinkedIn were utilized for communication surrounding the event.  It is likely that many photos and videos posted to these sites were used in the days of analysis leading to Thursday’s posting of the photos of the two suspects.  Social Media was often the primary way in which information went out to the public, causing the TV and media to lag behind and try and catch up with what was first posted and circulated online.</p>
</li>
<li dir="ltr">
<p dir="ltr"><strong>Mobile Phone Analytics</strong> – Phone records, calling patterns and GPS locators were likely utilized to narrow down the list of people who might have had a connection to the suspects.</p>
</li>
<li dir="ltr">
<p dir="ltr"><strong>GPS/Device Data Analytics</strong> – The hijacked car was tracked by authorities and engaged in the initial car chase leading to Watertown.</p>
</li>
<li dir="ltr">
<p dir="ltr"><strong>Financial Analytics</strong> – Once suspects were identified it is likely that all transactions that they made were analyzed.</p>
</li>
<li dir="ltr">
<p dir="ltr"><strong>Big Data Analytics</strong> – Numerous intelligence agencies and authorities collaborated together to share data and information in order to help compress the analysis of the situation and identify the suspects in a timely manner.</p>
</li>
</ol>
<p dir="ltr">These applications could only come alive and provide the invaluable, timely, analytic insights by using modern Big Data technologies for data capture, data storage, data integration, data sharing, analysis and alerts.  Now when people ask me what is this Big Data thing? I can tell them, using Big Data analytics allowed Boston to analyze camera, video, picture, web traffic, social, mobile, financial and GPS data to capture the Boston bombing suspects.  I have a feeling that, once people realize this, they’ll change their position from “what is the value of Big Data?” to “Thank God for the value you can create with Big Data!”</p>
<p dir="ltr">Our hearts and prayers go out to the individuals and families impacted by the events surrounding the Boston Marathon.  Stay strong forever! </p>
]]></content:encoded>
			<wfw:commentRss>http://thinkbiganalytics.com/bostons-strong-with-big-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Innovator&#8217;s Dilemma for Established Analytics Vendors</title>
		<link>http://thinkbiganalytics.com/innovators-dilemma-for-established-analytics-vendors/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=innovators-dilemma-for-established-analytics-vendors</link>
		<comments>http://thinkbiganalytics.com/innovators-dilemma-for-established-analytics-vendors/#comments</comments>
		<pubDate>Tue, 02 Apr 2013 00:02:22 +0000</pubDate>
		<dc:creator>Ron Bodkin</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Science]]></category>

		<guid isPermaLink="false">http://thinkbiganalytics.com/?p=2708</guid>
		<description><![CDATA[Until recently, large scale data processing, analysis and computational statistics meant working with companies like SAS, Informatica, Teradata, SPSS, and Oracle. Today, those companies face an &#8220;innovator&#8217;s dilemma.&#8221; Do they keep their prices high to maintain revenue and risk losing customers to new Big Data and Analytics vendors, or embrace the new, lower-cost computing options...<a href="http://thinkbiganalytics.com/innovators-dilemma-for-established-analytics-vendors/"><div class="blog-read-more">Read More</div></a>]]></description>
				<content:encoded><![CDATA[<p dir="ltr">Until recently, large scale data processing, analysis and computational statistics meant working with companies like SAS, Informatica, Teradata, SPSS, and Oracle. Today, those companies face an &#8220;innovator&#8217;s dilemma.&#8221; Do they keep their prices high to maintain revenue and risk losing customers to new Big Data and Analytics vendors, or embrace the new, lower-cost computing options and reduce their large legacy revenue stream? Tomorrow may be too late for them to establish themselves in this evolving market, or protect their customer base.</p>
<p dir="ltr">The Innovator&#8217;s Dilemma: When New Technologies Cause Great Firms to Fail, published by Harvard Business School professor Clayton Christensen in 1997, outlines the dilemma now faced by SAS and the others. Responses to this dilemma can be seen as Microsoft moves from their highly profitable shrink-wrapped Office desktop software to push their cloud version to compete with Google Docs and other online options. Siebel didn&#8217;t move to the cloud when Salesforce.com entered their market, and they wound up selling to Oracle in 2005 at a substantial discount.</p>
<p dir="ltr">Mentioned in an earlier post, Oracle has missed their revenue projections in three of the last eight quarters. Teradata&#8217;s stock is currently 31 percent below their 52 week high. New Open Source technologies dramatically lower the cost of Big Data statistical analysis.</p>
<p dir="ltr">SAS and the rest of the group&#8217;s innovator&#8217;s dilemma is being forced by the success of open source technologies like R, Hadoop, and NoSQL. R is an open source statistical programming language now used by over 2 million analysts. With open source projects like RHadoop, it can be scaled to run on clusters of computers using Hadoop, the open source standard for processing large data sets across clusters of computers. This combination offers an order of magnitude more performance at an order of magnitude lower price.</p>
<p dir="ltr">The ability for companies to run big clusters that analyze data more affordably than ever before is already here. The issue is not technical &#8211; from a business standpoint, if SAS, SPSS, Oracle, and the others price their products to win Big Data contracts, their established customers could save a lot of money by porting to the new architecture, and the companies lose a lot of revenue.</p>
<p dir="ltr">As happens in technology on a regular basis, new startups like us, Revolution Analytics, and others offer better performance at lower prices. We don&#8217;t do everything SAS and Oracle and the others do, but what we do, we do well. The ability to harness massive resources at disruptively low cost is enabling a revolution whereby the enterprise can tap all its data and move from intuition to data-driven decisions.</p>
]]></content:encoded>
			<wfw:commentRss>http://thinkbiganalytics.com/innovators-dilemma-for-established-analytics-vendors/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Programming Trends to Watch: Logic and Probabilistic Programming</title>
		<link>http://thinkbiganalytics.com/programming-trends-to-watch-logic-and-probabilistic-programming/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=programming-trends-to-watch-logic-and-probabilistic-programming</link>
		<comments>http://thinkbiganalytics.com/programming-trends-to-watch-logic-and-probabilistic-programming/#comments</comments>
		<pubDate>Thu, 28 Mar 2013 08:01:11 +0000</pubDate>
		<dc:creator>Dean Wampler</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Data Science]]></category>

		<guid isPermaLink="false">http://thinkbiganalytics.com/?p=2697</guid>
		<description><![CDATA[I’ve made the argument (PDF) that Functional Programming (FP) is the best way to approach data problems. Why? Because our work with data is essentially Mathematics and FP is inspired by the same, so it emphasizes the abstractions that are most appropriate for data analysis. Object-oriented Programming, on the other hand, doesn’t promote the same useful abstractions, which is...<a href="http://thinkbiganalytics.com/programming-trends-to-watch-logic-and-probabilistic-programming/"><div class="blog-read-more">Read More</div></a>]]></description>
				<content:encoded><![CDATA[<p>I’ve made the <a href="http://polyglotprogramming.com/papers/MapReduceAndItsDiscontents.pdf" target="_blank">argument</a> (PDF) that <a href="http://www.haskell.org/haskellwiki/Functional_programming" target="_blank">Functional Programming</a> (FP) is the best way to approach data problems. Why? Because our work with data is essentially Mathematics and FP is inspired by the same, so it emphasizes the abstractions that are most appropriate for data analysis. Object-oriented Programming, on the other hand, doesn’t promote the same useful abstractions, which is why I consider the use of Java in <em>Big Data</em> applications to be counterproductive. (There are of course hybrid programming languages, like <a href="http://scala-lang.org/" target="_blank">Scala</a>, <a href="http://fsharp.org/" target="_blank">F#</a>, and <a href="http://ocaml.org/" target="_blank">OCaml</a>, that combine both <em>paradigms</em>. If you want a comprehensive comparison of programming paradigms, consider <a href="http://www.info.ucl.ac.be/~pvr/paradigms.html" target="_blank">this chart</a>.) In fact, SQL can be considered a functional programming language, since it is derived from <a href="http://en.wikipedia.org/wiki/Set_operations_(SQL)" target="_blank">Set Theory</a>, although it has lots of limitations as a language.</p>
<p>I believe there are two other emerging trends in programming worth watching that will impact the data world.</p>
<p><a href="http://en.wikipedia.org/wiki/Logic_programming" target="_blank">Logic Programming</a>, like FP, is actually not new at all, but it is seeing a resurgence of interest, especially in the <a href="https://github.com/swannodette/logic-tutorial" target="_blank">Clojure community</a>. Rules engines, like <a href="http://www.jboss.org/drools" target="_blank">Drools</a>, are an example category of logic programming that has been in use for a long time.</p>
<p>In logic programming, you write programs using the concepts of Logic, such as <a href="http://en.wikipedia.org/wiki/First_order_logic" target="_blank">first order logic</a>. Simply stated, you specify conditions or constraints (e.g., rules) that must be satisfied, known “facts” about the system you’re modeling, and the runtime finds the values of the system’s variables that satisfy the conditions. One way to think of it is to imagine the runtime searching the space of all possible answers for those that satisfy the conditions and facts.</p>
<p>Why is this interesting for data? Logic programming is a declarative and concise way to express problems that can be framed this way. Hence, if your problem fits the logic programming model, you can work quickly and efficiently, in just the same way that SQL queries are a very concise and expressive way to ask questions of data and to perform analytics.</p>
<p>For example, a classic use of logic programming has been <a href="http://en.wikibooks.org/wiki/Introduction_to_Philosophy/Logic/Fault_Diagnosis" target="_blank">fault diagnosis</a>; given observed events or symptoms and knowledge of the system, what are the possible underlying faults that caused the observations? This approach is applicable for diagnosing malfunctions in cars, chemical plants, medical problems, etc.</p>
<p>There’s one catch, though. Most logic programming systems assume we have absolute knowledge; facts are yes/no, true/false, or some fixed value, while constraints are absolute and comprehensive. Many, if not most real-world scenarios aren’t so clear cut. <a href="http://en.wikipedia.org/wiki/Probabilistic_model" target="_blank">Probabilistic modeling</a> has proven most fruitful for these scenarios where knowledge and constraints are imprecise and contain gaps, but we don’t require absolute answers either. In our example, a list of faults is great, but which one is most likely? What is the probability that an observed event was a false alarm? How do we know we’re monitoring <em>all</em> the relevant data?</p>
<p>I’ll cite a few examples of great interest today. <a href="http://en.wikipedia.org/wiki/Recommender_system" target="_blank">Recommendation engines</a> are widely used in social networks and ecommerce. For example, Netflix might observe that you rent action movies more often than romantic comedies, but does that reflect a hard and fast rule for you? What about romantic comedies with car chases? You’re <em>probably</em> going to rent another action movie the next time, but there’s a nonzero chance a romantic comedy will appeal to you some day.</p>
<p>Self-navigating robots have a model of the world, e.g., a map of the terrain and sensors used to detect where they are. There are sources of error and uncertainty. Real sensors aren’t 100% accurate. The map could have errors and obstacles could be in the way (like people crossing the street!) that are not represented on the map. So, the world is modeled probabilistically and the robot calculates the most likely location, given it’s measurements and how they correlate to the map.</p>
<p>Finally, how do we automate the understanding and processing of human language? If you’ve used a voice-recognition system, like <em>Siri</em> on an iPhone, you’ve used just one example of the amazing progress we’ve made. Fundamentally, we now think of human language as a probabilistic process, where previously we thought of it as the outcome of sophisticated internal models. This <a href="http://www.tor.com/blogs/2011/06/norvig-vs-chomsky-and-the-fight-for-the-future-of-ai" target="_blank">argument between Noam Chomsky and Peter Norvig</a> illustrates the sea-change in our thinking.</p>
<p>We already have powerful probabilistic modeling techniques and tools, such as <a href="http://en.wikipedia.org/wiki/Bayesian_network" target="_blank">Bayesian networks</a>, <a href="http://en.wikipedia.org/wiki/Markov_network" target="_blank">Markov networks</a>, and their variants, generically called <a href="http://en.wikipedia.org/wiki/Graphical_model" target="_blank">Probabilistic Graphical Models</a> (because they model probabilities about systems using graphs). Implementations are available in many languages. There are <a href="http://zinkov.com/posts/2012-10-04-ml-book-reviews/" target="_blank">excellent textbooks</a>, including <a href="http://aima.cs.berkeley.edu/" target="_blank">Artificial Intelligence, A Modern Approach</a>, by Russell and Norvig, that describe them. However, deep technical expertise is required to understand and use these techniques effectively.</p>
<p>We’re on the verge of moving to the next level, <em>probabilistic programming</em> languages and systems that make it easier to build probabilistic models, where the modeling concepts are promoted to first-class primitives in new languages, with underlying runtimes that do the hard work of inferring answers, similar to the way that logic programming languages work already. The ultimate goal is to enable end users with limited programming skills, like domain experts, to build effective probabilistic models, without requiring the assistance of Ph.D.-level machine learning experts, much the way that SQL is widely used today.</p>
<p>DARPA, the research arm of the U.S. Department of Defense, considers this trend important enough that they are starting an initiative to promote it, called <a href="http://www.solers.com/BAAinfo-reg/ppaml/index.htm" target="_blank">Probabilistic Programming for Advanced Machine Learning</a>, which is also described in this <a href="http://www.wired.com/dangerroom/2013/03/darpa-machine-learning-2/all/1" target="_blank">Wired article</a>.</p>
<p>This is a next logical step in the <em>democratization of data</em>, making the sophisticated analysis of large data sets accessible to a wider audience. It’s amazing how universal SQL knowledge has become. I often meet very nontechnical people who have learned enough basic SQL to get the answers they need for themselves. Achieving the same level of fluency in logic and probabilistic programming will be harder, even if good languages and tools are developed, because the core concepts are harder for people to grasp. Still, it’s an important challenge and the results will benefit us all.</p>
]]></content:encoded>
			<wfw:commentRss>http://thinkbiganalytics.com/programming-trends-to-watch-logic-and-probabilistic-programming/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Op-Ed: Cracks in the Oracle Empire</title>
		<link>http://thinkbiganalytics.com/big_data_affects_database_oracle/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=big_data_affects_database_oracle</link>
		<comments>http://thinkbiganalytics.com/big_data_affects_database_oracle/#comments</comments>
		<pubDate>Mon, 25 Mar 2013 17:29:12 +0000</pubDate>
		<dc:creator>Ron Bodkin</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Blogroll]]></category>

		<guid isPermaLink="false">http://thinkbiganalytics.com/?p=2661</guid>
		<description><![CDATA[Will March 2013 be the turning point where Big Data begins to noticeably eat away at the database establishment? In &#8220;Cracks in the Oracle Empire,&#8221; Wall Street Journal writer Steve Rosenbush reported Oracle Corporation executives blaming the sales force for &#8220;a disappointing quarter.&#8221; In fact, this is Oracle&#8217;s third miss in the last eight quarters,...<a href="http://thinkbiganalytics.com/big_data_affects_database_oracle/"><div class="blog-read-more">Read More</div></a>]]></description>
				<content:encoded><![CDATA[<p dir="ltr">Will March 2013 be the turning point where Big Data begins to noticeably eat away at the database establishment? In &#8220;Cracks in the Oracle Empire,&#8221; Wall Street Journal writer Steve Rosenbush reported Oracle Corporation executives blaming the sales force for &#8220;a disappointing quarter.&#8221; In fact, this is Oracle&#8217;s third miss in the last eight quarters, according to <a href="http://blogs.wsj.com/cio/2013/03/22/fast-cheaper-and-not-oracle/" target="_blank">Michael Hickins.</a></p>
<p dir="ltr">Technology always changes, and Oracle, a pioneer in enterprise relational database technology, has been on top for years. Deservedly so, based on their many successful products. But innovation and excitement happen most often in smaller, hungrier, more nimble companies, and employees are drawn by that energy.</p>
<p dir="ltr">Just like automotive engineers who want to leave Detroit to join innovators like Tesla, programmers and sales people interested in Big Data want to join companies pushing into new areas. The core Open Source foundation of Big Data technologies like Hadoop, Cassandra, MongoDB, Storm, and R are fundamentally changing the economics of enterprise storage and processing.</p>
<p dir="ltr">When sales expenses increase, mainly due to turnover, that is a clear sign that Oracle is under pressure from new competitors, including Big Data. Looking beyond Oracle, at Think Big we see most traditional enterprise software, hardware, and legacy systems integration firms struggling to be relevant in a new world. New companies, powered by lean, efficient innovation that embraces diverse data sets to create value in a company&#8217;s core offerings, are gaining market share.</p>
<p dir="ltr">Oracle&#8217;s top, most experienced sales reps are leaving to sell Big Data, among other technologies, because they are finding that selling legacy approaches is getting a lot harder. The transition speed is accelerating.</p>
<p dir="ltr">As part of the new breed of Big Data companies, I&#8217;m often asked to brief Wall Street analysts about how this wave is remaking the world order. The trends have been clear, but now we&#8217;re starting to see just how fast this transition is rippling through the markets. March 2013 may be remembered as turning point from legacy relational databases, like Oracle, to the reduced cost and increased performance of Big Data.</p>
]]></content:encoded>
			<wfw:commentRss>http://thinkbiganalytics.com/big_data_affects_database_oracle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How Hackings Against the NYT, Twitter, and the Federal Reserve Might Have Been Prevented with Big Data</title>
		<link>http://thinkbiganalytics.com/how-hackings-against-the-nyt-twitter-and-the-federal-reserve-might-have-been-prevented-with-big-data/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-hackings-against-the-nyt-twitter-and-the-federal-reserve-might-have-been-prevented-with-big-data</link>
		<comments>http://thinkbiganalytics.com/how-hackings-against-the-nyt-twitter-and-the-federal-reserve-might-have-been-prevented-with-big-data/#comments</comments>
		<pubDate>Fri, 22 Feb 2013 23:03:09 +0000</pubDate>
		<dc:creator>Rick Farnell</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://thinkbiganalytics.com/?p=2437</guid>
		<description><![CDATA[Earlier this year, Twitter admitted they lost personal information on 250,000 or so users to hackers. Other companies, including the New York Times and the Federal Reserve, reported hackers had been inside their systems. Companies of all sizes, and the network security community, battle constantly against advanced, persistent threats. It’s time to fight back on...<a href="http://thinkbiganalytics.com/how-hackings-against-the-nyt-twitter-and-the-federal-reserve-might-have-been-prevented-with-big-data/"><div class="blog-read-more">Read More</div></a>]]></description>
				<content:encoded><![CDATA[<p>Earlier this year, Twitter admitted they lost personal information on 250,000 or so users to hackers. Other companies, including the New York Times and the Federal Reserve, reported hackers had been inside their systems. Companies of all sizes, and the network security community, battle constantly against advanced, persistent threats. It’s time to fight back on a level playing field.  A new weapon in this fight is the ability to use Big Data techniques to build comprehensive predictive analytic applications on your data.</p>
<p>Big Data, and the analysis of massive data sets including all data going back years and years, has been nearly impossible to manage until recently. New technologies such as Hadoop and NoSQL, in the hands of a new wave of Big Data vendors, now make it possible for companies to cost effectively store, process, and analyze 100 percent of their data. Before, they could store and analyze only a fraction of their information, due to the cost of traditional data warehouse systems</p>
<p>A new role exists today to fight back against the Hacker and the market is recognizing them as Data Scientists. Today’s Data Scientists can build analytic applications to detect problems and analyze small signals in the long tail of data to troubleshoot areas and take preventive actions before problems materialize. Now that companies can easily store and keep on hand 100 percent of their data, viruses and malware, hidden inside company data just like in personal computers, can be located. Delayed-action malware, designed to &#8220;sleep&#8221; in the data until it&#8217;s no longer current then wakes up and enables the attack. Companies that don’t manage old data have a poor chance of ever detecting their presence.</p>
<p>Built on Big Data technology, Device Data Analytic Applications read individual machine records and find patterns before they become problems. For a manufacturer, the problem may be trends pointing to device or part failures. For every company, the pattern may be hidden malware ready to take down critical systems or worse steal customer information or financial records. </p>
<p>Many firms are already tackling the problem of both external and internal threats.   Think Big has helped several clients use data science to understand how disparate computers are coordinating and communicating to prepare for attacks months in advance.  These “botnets” are armies of infected machines repeatedly making network requests to a central “commander,” who moves to different network locations throughout the day.  By comparing network requests across machines, using both supervised and unsupervised algorithms, data scientists bubble up only the malicious traffic, the location of the infected machines, and the potential locations of the commander.  This type of solution will change the way the security community looks at threats and has already been shown to improve the lead-time on threat detection by as much as three months.</p>
<p>Other clients of Think Big have looked to their own intranets for security issues.  Like most companies, they have known for years that by the time viruses or malware are detectable via software, the infections have often already taken place.  Infected computers may perform malicious behaviors such as attacking non-infected corporate machines.  By utilizing intranet network patterns, data science, and Hadoop, these clients build network-path based propensity models over an acyclic graph having the infection point center.  Across the graph, the propensity for a machine to be infected is measured as a function of machine characteristics as well as its place in the network.  With this data science power, companies are able to alert security teams to audit certain machines rather than waiting for Anti-Virus software to identify infection(s).</p>
<p>What is it worth to organizations to have the flexibility to run 50x more unique analytics on their data? To run analysis previously not possible?  To do at a fraction of the cost?  With the right Big Data tools and analytics applications applied to your entire data pool, you may not have to admit hackers have done a better job than you of exploiting the value of your own data.</p>
]]></content:encoded>
			<wfw:commentRss>http://thinkbiganalytics.com/how-hackings-against-the-nyt-twitter-and-the-federal-reserve-might-have-been-prevented-with-big-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Race is On with Open Source</title>
		<link>http://thinkbiganalytics.com/the-race-is-on-with-open-source/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-race-is-on-with-open-source</link>
		<comments>http://thinkbiganalytics.com/the-race-is-on-with-open-source/#comments</comments>
		<pubDate>Thu, 21 Feb 2013 19:47:42 +0000</pubDate>
		<dc:creator>Ron Bodkin</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Hive]]></category>

		<guid isPermaLink="false">http://thinkbiganalytics.com/?p=2340</guid>
		<description><![CDATA[At Think Big Analytics we&#8217;re focused on helping customers create measurable value from Big Data. Fast access to that data is becoming mandatory as enterprises move beyond the first phase of adoption of Big Data. Over the last few months a number of projects and products have been announced to address this need, both within Hadoop and...<a href="http://thinkbiganalytics.com/the-race-is-on-with-open-source/"><div class="blog-read-more">Read More</div></a>]]></description>
				<content:encoded><![CDATA[<p>At Think Big Analytics we&#8217;re focused on helping customers create measurable value from Big Data. Fast access to that data is becoming mandatory as enterprises move beyond the first phase of adoption of Big Data. Over the last few months a number of projects and products have been announced to address this need, both within Hadoop and in related architectures: <a href="http://thinkbiganalytics.com/leading_big_data_technologies/real-time-query/" target="_blank">Impala</a>, <a href="http://hadapt.com/">Hadapt</a>, <a href="http://thinkbiganalytics.com/leading_big_data_technologies/real-time-query/" target="_blank">Shark,</a> <a href="http://www.splicemachine.com/">Splice Machine</a>, <a href="http://incubator.apache.org/drill/">Drill</a>, <a href="http://aws.amazon.com/redshift/">Redshift</a>, <a href="http://www.platfora.com/">Platfora</a>, <a href="http://www.pentahobigdata.com/ecosystem/capabilities/instaview">Instaview</a>, and now <a href="http://hortonworks.com/blog/100x-faster-hive/">Stinger</a> to improve performance in core <a href="http://thinkbiganalytics.com/leading_big_data_technologies/hadoop/" target="_blank">Hive.</a></p>
<p>The race is on. With these technologies all driving faster queries for Hadoop clusters, the winner is our customers. I view this as an example of healthy competition in the open source ecosystem, where the community is investing in a number of different technologies. It&#8217;s also worth noting that while there&#8217;s competition in the approaches, the open source contributions to projects like Hive, MapReduce, and YARN are likely to actually benefit all the open source Hadoop ecosystem-based technologies like Hive and Impala. I believe that open source efforts will gain traction and participation and ultimately become the dominant approach.</p>
<p>Getting Hive to run queries in under 30 seconds will be welcome progress. We&#8217;ve heard that future enhancements to Hive that leverage YARN could further reduce minimum query times to 10 seconds. Even given such 10-30 second queries, we believe Impala will continue to lead the way in faster performance, allowing for subsecond queries. The net effect is that there will be dramatic improvements in query technologies for Hadoop and Big Data. I think core Hive will get a lot better thanks to these initiatives. Stinger will spur investment in new approaches and lead tools like Impala to become more full featured more quickly, with scalable processing times and User Defined Functions. Customers will have more choices and better options. I predict Hive will remain a workhorse for production jobs and large scale processing but that Impala will become the go-to tool for fast analytics. I think it will allow most of the important analytic queries to run 10x faster. But if you ever need more horsepower, Hive will be right there, ready to run on the same cluster.</p>
<div>
<p>Why is this becoming so important? In the first phase of adoption, Scalability and Cost Containment, archival and ETL are dominant use cases and just being able to run queries at all is a killer app. But as organizations mature and tackle the more advanced stages of agile analytics and business innovation, it&#8217;s critical to have quick access to data of all sizes. Data analysts want to be able to run queries in a second when exploring information. Data scientists want to do that and to investigate complex data sets, finding outliers in a few seconds.</p>
<div>
<dl id="attachment_2342">
<dt><img alt="Big Data Adoption Stages" src="http://thinkbiganalytics.com/wp-content/uploads/2013/02/Screen-Shot-2013-02-21-at-11.31.05-AM.png" width="491" height="304" /></dt>
<dd>Big Data Adoption Stages</dd>
</dl>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://thinkbiganalytics.com/the-race-is-on-with-open-source/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Much Hadoop about Nothing? We Don’t Think So.</title>
		<link>http://thinkbiganalytics.com/much-hadoop-about-nothing-we-dont-think-so/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=much-hadoop-about-nothing-we-dont-think-so</link>
		<comments>http://thinkbiganalytics.com/much-hadoop-about-nothing-we-dont-think-so/#comments</comments>
		<pubDate>Tue, 05 Feb 2013 01:30:17 +0000</pubDate>
		<dc:creator>Ron Bodkin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://50.97.96.110/~thinkbig/?p=1777</guid>
		<description><![CDATA[In the mid 2000’s when I was Chief Architect and later, VP Engineering, at Quantcast, we were using one of the first Hadoop clusters as a low-cost way to process the vast amount of data needed to directly measure many of the largest websites in the world. We even helped a skeptical Facebook see the...<a href="http://thinkbiganalytics.com/much-hadoop-about-nothing-we-dont-think-so/"><div class="blog-read-more">Read More</div></a>]]></description>
				<content:encoded><![CDATA[<p>In the mid 2000’s when I was Chief Architect and later, VP Engineering, at Quantcast, we were using one of the first Hadoop clusters as a low-cost way to process the vast amount of data needed to directly measure many of the largest websites in the world. We even helped a skeptical Facebook see the value of this new, cool technology. Back then, we were processing Petabytes of data each day to apply predictive models to offer our customers new features that we couldn’t offer before, all thanks to Hadoop. For example, we developed “lookalike” programs, a new advertising approach to find millions of new people who are likely to respond to any one of thousands of ad campaigns. We also built realtime systems powered by NoSQL databases to compute tens of millions of these predictive models each second, all within milliseconds.</p>
<p>When I saw the power of what Hadoop, NoSQL, and predictive analytics could do, I knew enterprises would benefit greatly if they could find a way to intelligently process the vast amounts of data they were collecting, or their “Big Data”. I also knew it would be a very long time before the skills and products would mature to the point where Big Data could be considered “plug and play.” Images of repeating my experience building C-bridge came into view, as I recognized the “skills and technology gap” that existed and made C-bridge successful back then was about to repeat itself.</p>
<p>So, on Easter Sunday 2010, Katie, Rick, and I decided to start Think Big Analytics. Our goal was to recruit the brightest data scientists and data engineers to fill the knowledge gap that is so critical to the success of any Big Data project. From the very beginning, we knew we could differentiate ourselves from traditional system integrators by developing a unique methodology, combining it with the best talent, and working side-by-side with clients to implement Big Data projects that would produce amazing results. Although we are often unable to speak about them publicly, we have helped our Fortune 50 client list exceed their business goals by using Big Data to:</p>
<ul>
<li dir="ltr">Launch and tailor new product offerings to consumers;</li>
<li dir="ltr">Building predictive patterns to prevent device failures before they occur, optimization</li>
<li dir="ltr">Identify ways to increase operational efficiencies.</li>
</ul>
<p>Our goal is to provide services that allow clients to harness the power of Big Data to create radical new value by automating decisions, tailoring interactions to consumers, and enabling intelligent networks of devices.  Along the way, we intend to turn the traditional consulting model on it’s head, evangelizing fast, nimble projects that emphasize learning instead of the long, drawn-out ones that people envision when they hear the word, “consultant.” I agree with Geoffrey Moore that <a href="http://www.linkedin.com/today/post/article/20130122182426-110300724-the-tide-has-turned" target="_blank">The Tide Has Turned</a>: Big Data is going to usher in a new wave of innovation and economic growth across a range of industries, and companies will need purpose-built consultancies to enable that transformation. That’s why we created Think Big and we operate focusing on one thing only: Help Clients Achieve the Promise and Value of Big Data. .</p>
<p>To do so requires a perfect blend of technology, skills and planning, the three of which encompass Think Big’s three major service offerings: Imagine, Illuminate and Implement. Big Data is complex and no one should enter into it blindly so we offer our “Imagine” services to ensure roadmap and use cases are properly defined and prioritized. We offer “Illuminate” services, which include expert training and side-by-side mentoring so our clients are prepared internally to fully realize the value their data can bring. Finally, we offer “Implement” services, which really bring our customers’ Big Data to life. The combination of our data scientists and engineers and our proven “test and learn” methodology enables our clients to innovate by building large-scale Big Data analytics, data integration and real-time systems for device data, advertising, network data, consumer recommendations, retail and financial market data, and more.</p>
<p>Recently, there has been backlash about <a href="http://blogs.wsj.com/cio/2013/01/25/much-hadoop-about-nothing" target="_blank">Hadoop</a> because companies are struggling to find the ROI behind their Big Data technology. The question that people should be asking isn’t “Is this ‘Much Hadoop about Nothing’ “ but rather “Do you have the right partner?” Think Big is here to help.<b id="internal-source-marker_0.005510813789442182"> </b></p>
]]></content:encoded>
			<wfw:commentRss>http://thinkbiganalytics.com/much-hadoop-about-nothing-we-dont-think-so/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Predicting the future of Big Data: With Big Data comes big value and the need for Data Scientists.</title>
		<link>http://thinkbiganalytics.com/with-big-data-come-big-insights-big-value-and-the-need-for-data-scientists/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=with-big-data-come-big-insights-big-value-and-the-need-for-data-scientists</link>
		<comments>http://thinkbiganalytics.com/with-big-data-come-big-insights-big-value-and-the-need-for-data-scientists/#comments</comments>
		<pubDate>Tue, 05 Feb 2013 01:02:20 +0000</pubDate>
		<dc:creator>Rick Farnell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://50.97.96.110/~thinkbig/?p=1749</guid>
		<description><![CDATA[Is all the Big Data hype justified? In a word, yes. How did we get here? Our database systems of yesteryear were not designed to store, process and analyze the massive amount of data being generated by today’s devices and applications. The pace at which we are adding data and connected devices like mobile phones,...<a href="http://thinkbiganalytics.com/with-big-data-come-big-insights-big-value-and-the-need-for-data-scientists/"><div class="blog-read-more">Read More</div></a>]]></description>
				<content:encoded><![CDATA[<p>Is all the Big Data hype justified? In a word, yes.</p>
<p>How did we get here? Our database systems of yesteryear were not designed to store, process and analyze the massive amount of data being generated by today’s devices and applications. The pace at which we are adding data and connected devices like mobile phones, tablets, smart TVs, connected vehicles, and any device or machine that connects home via the internet is growing at an unprecedented pace.  Companies like Google set the tone with their commitment to store anything that’s digitized and building data products and services based on the relationships and search patterns of where users meet the data.  Enter the Big Data Era.  We are witnessing the realization of Sun’s tagline “The Network is the Computer” except that it’s more accurate to say “The Network is the Data”.  Business leaders are investing in new ways to capture more information than they ever thought of before but with one major strategy funded in parallel – building a team of data scientists that can live in the data along with unprecedented processing power to build analytics that create measurable value for their organizations.  </p>
<p>But what’s different?  We’ve had big data for years now.  What’s different is the ability to use new forms of data, new data storage techniques, new data processing techniques and new data analytics techniques to conduct hundreds of business experiments in the same time that our legacy processes and systems would allow one project in the past. The pace at which the data is expanding and changing and the opportunity to create new data partnerships cannot be harnessed with brittle systems not built or designed for this new world.  Business change in our old world is costly, time consuming and extremely difficult.  What if, as a business leader, you could create an organization that resulted in a 400% increase in the amount of new projects that get to production?  What is this pace of innovation worth to you, to your career, to your teams, to your company, to your customers, to your shareholders. We are working with companies to execute fast paced test and learn approaches with their data science teams that allow them to continually come up with new ideas and roll them into action to see what outcomes they drive.  If negative outcomes they modify, if positive outcomes they try and understand why and what events led up to the positive outcome and then try to predict and repeat that pattern more often.  </p>
<p>We started Think Big in 2010 at a time when most people, even in Silicon Valley didn&#8217;t quite know what Big Data or Big Data analytics were.  Our team predicted the demand that was building and the market momentum of opensource software innovation, low cost cloud services and connected device explosion and we choose to build a services company 100% focused on building big data analytics applications side by side with our clients.  Our clients want to use more data to improve their business outcomes.  After working with some of the most innovative and respected companies in their industries, we see three primary Big Data application areas emerge:</p>
<ol>
<li dir="ltr">Device Data Analytics</li>
<li dir="ltr">Customer Recommendations</li>
<li dir="ltr">Analytics Laboratory</li>
</ol>
<p>&nbsp;</p>
<p>More to come on each of these Big Data application areas in my future blogs.  </p>
<p>One Big Data pattern that we can predict with 100% accuracy is that like all technology advances of our past, from the printing press, to computers, to the internet, to smart phones, to the cloud, to tablets and now all together Big Data it will create demand for entrepreneurial minded humans to innovate and create new things, new products, new services, new companies and new jobs.  We’re hiring!</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://thinkbiganalytics.com/with-big-data-come-big-insights-big-value-and-the-need-for-data-scientists/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Announcing Stampede: Flyweight Workflow Tool for *nix</title>
		<link>http://thinkbiganalytics.com/announcing-stampede-flyweight-workflow-tool-for-nix/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=announcing-stampede-flyweight-workflow-tool-for-nix</link>
		<comments>http://thinkbiganalytics.com/announcing-stampede-flyweight-workflow-tool-for-nix/#comments</comments>
		<pubDate>Tue, 08 Jan 2013 03:17:04 +0000</pubDate>
		<dc:creator>Dean Wampler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://50.97.96.110/~thinkbig/?p=1852</guid>
		<description><![CDATA[When you’re building nontrivial workflows, you need a tool that lets you express the dependencies between tasks, schedule their execution, detect failures and attempt retries, etc. You also want that tool to be concise, easy to use, yet powerful. Welcome to Stampede, the workflow tool that works as Cthulhu intended for *nix systems, using make for dependency management and task...<a href="http://thinkbiganalytics.com/announcing-stampede-flyweight-workflow-tool-for-nix/"><div class="blog-read-more">Read More</div></a>]]></description>
				<content:encoded><![CDATA[<p>When you’re building nontrivial workflows, you need a tool that lets you express the dependencies between tasks, schedule their execution, detect failures and attempt retries, etc. You also want that tool to be concise, easy to use, yet powerful.</p>
<p>Welcome to <em>Stampede</em>, the workflow tool that works as <a href="http://en.wikipedia.org/wiki/Cthulhu">Cthulhu</a> intended for *nix systems, using <code>make</code> for dependency management and task seqeuencing, <code>bash</code> for scripting, and<code>cron</code> for scheduling.</p>
<p><em>Stampede</em> originated as an alternative workflow tool for <a href="http://hadoop.apache.org/">Hadoop</a>, but it is not limited to Hadoop scenarios.</p>
<h2 id="embracingtheunixphilosophy">Embracing the Unix Philosophy</h2>
<p><em>Stampede</em> was born out of frustration with heavyweight “enterprisey” tools that are hard and frustrating to use. We have a ~40-year tradition, <em>the Unix Philosophy</em>, of flyweight, flexible tools that compose together to build sophisticated applications.</p>
<p>How can you specify dependencies between tasks? <code>Make</code> does this concisely and flexibly. How do you script the tasks themselves? One of the powerful Unix shells, such as <code>bash</code>, is platform portable and supports the concise expression of complex tasks. How do you schedule when a workflow should start? <code>Cron</code> and its sibling <code>at</code> make this easy.</p>
<p><em>Stampede</em> won’t appeal to you unless you know <code>make</code> and <code>bash</code>. It doesn’t provide a GUI (at least not yet). It’s a tool for <a href="http://polyglotprogramming.com/">polygot programmers</a>, developers who use a diverse set of languages and tools, adopting the most appropriate tool for a given job. If the word<a href="http://devops.com/">DevOps</a> means anything to you, then <em>Stampede</em> is the tool for you.</p>
<h2 id="howdoesitwork">How Does It Work?</h2>
<p>In fact, <em>Stampede</em> is less than meets the eye. Really. Most of its power comes from <code>make</code>,<code>bash</code>, and other *nix command-line tools, like <code>date</code>, <code>mkdir</code>, and their friends. However, those tools by themselves aren’t quite enough for convenient development of workflows, which we call <em>stampedes</em>.</p>
<p>So, <em>Stampede</em> adds lots of helper tools, mostly <code>bash</code> scripts, to make it easier to do common IT tasks, like specify yesterday’s date for an ETL process, watch for a file to appear in a drop zone from an FTP process and then start processing it, retry a failed workflow every hour until it succeeds, etc. <em>Stampede</em> also includes a driver script, called<code>stampede</code> that does various environment setup steps before calling <code>make</code>. Your actual workflows (<em>stampedes</em>) are defined in <code>Makefiles</code>.</p>
<p>In principle, <em>Stampede</em> can support any *nix environment, but currently we only support Linux and Mac OSX. So, we require <code>bash</code> for scripts and <a href="http://www.gnu.org/software/make/">Gnu Make</a>, since these are the standard tools distributed with Linux, Mac OSX, and also Cygwin. Cygwin support should be possible and we welcome patches if anyone wants to take it on. Any Unix system with<br /><code>bash</code> and Gnu <code>make</code> installed should also be able to run <em>Stampede</em> out of the box. Patches are welcome if you encounter problems.</p>
<p>Here is an example <code>Makefile</code> for a fictitious Hadoop workflow, taken from the distribution’s Hadoop example. We’ll use the environment variable <code>$STAMPEDE_HOME</code> to reference where you installed <em>Stampede</em>. (It’s value is set by the <code>stampede</code> driver script when you run a workflow.) The <code>Makefile</code> comments describe what’s going on:</p>
<pre><code># Example Makefile for a Stampede project for a Hadoop workflow.
# For more details, see $STAMPEDE_HOME/examples/hadoop/README.md.

# Call the "ymd" and "yesterday-ymd" tools (bash scripts that 
# are part of Stampede) to get the YYYY-MM-DD for today and 
# yesterday, respectively, e.g., 2013-01-01 and 2012-12-31:
YMD           = $(shell ymd '-')
YESTERDAY_YMD = $(shell yesterday-ymd '-')

# Local (as opposed to HDFS) file system location where FTP'ed incoming
# files are dropped. 
DROP_ZONE = /var/ftp/drop-zone

# Locations in HDFS for the ingested files for yesterday.
HDFS_FTP_YYMD_DIR = /ftp/${YESTERDAY_YMD}
HDFS_ORDERS       = /orders/${YESTERDAY_YMD}

# Data from our "partners", BargainMonsters.com and ElectronicsHut.com
BM_FILE      = bargain-monster-orders-${YESTERDAY_YMD}.gzip
EH_FILE      = electronics-hut-orders-${YESTERDAY_YMD}.gzip
BM_FTP_FILE  = ${DROP_ZONE}/${BM_FILE}
EH_FTP_FILE  = ${DROP_ZONE}/${EH_FILE}

# Data used by our recommendation engine that analyzes click streams and orders.
RECOMMENDER_DATA_DIR = /recommendation-engine/clicks-orders

# The location for Hive's internal/managed tables, given by the property:
#   hive.metastore.warehouse.dir
HIVE_WAREHOUSE_DIR = $(shell hive-prop --print-value hive.metastore.warehouse.dir)

# URL for the NameNode.
HADOOP_NAMENODE = $(shell mapreduce-prop --print-value )

HADOOP = hadoop
PIG    = pig
HIVE   = hive
SQOOP  = sqoop

all: etl analysis export
  @echo Hadoop stampede finished!

etl: ingest cleanse

ingest: from-production-db from-ftp-drop-zone

# Use Sqoop to ingest yesterday's click stream data from the production database.
from-production-db:
  @echo "Ingesting clickstream data for yesterday: ${YESTERDAY_YMD} (today: ${YMD})
  ${SQOOP} import \
    --connect jdbc:mysql://db-server:3306/clickstream-prod \
    --username some_user -P \
    --table adclicks \
    --query "select * from adclicks where ymd = '${YESTERDAY_YMD}';" \
    --num-mappers 5 \
    --hive-import

from-ftp-drop-zone: ${BM_FTP_FILE} ${EH_FTP_FILE}

# Wait up to 4 hours, checking every 10 minutes, for yesterday's data from 
# BargainMonster.com and ElectronicsHut.com of orders that originated
# as ad clicks. Once each arrives, put it in HDFS.
${BM_FTP_FILE} ${EH_FTP_FILE}: ${HDFS_FTP_YYMD_DIR}
  @try-for 4h 10m 'test -f $@'
  ${HADOOP} fs -put $@ ${HDFS_FTP_YYMD_DIR} 

${HDFS_FTP_YYMD_DIR}:
  ${HADOOP} fs -mkdir ${HDFS_FTP_YYMD_DIR}

# Use Pig for data cleansing. Pass in parameters that tell the "cleanse-orders.pig"
# script the location of the input and where to write the output (both in HDFS).
cleanse:
  ${PIG} \
    -param INPUT_DIR=${HDFS_FTP_YYMD_DIR} \
    -param OUTPUT_DIR=${HDFS_ORDERS} \
    -f cleanse-orders.pig 

analysis: reports-analysis recommendations-analysis

# Treat the output directory of the Pig script, "${HDFS_ORDERS}" as the
# location of a partition for a Hive external "orders" table. The Hive script
# "clicks-orders-report.hql" will use ALTER TABLE to add this partition, so
# we pass in the location as an $ORDERS_DIR defined variable. The other 
# variable we'll define is "YMD" which will be used for processing; we set it 
# to yesterday's date. The script will also use the internal "adclicks" table 
# created by the previous Sqoop task in the workflow.
reports-analysis:
  ${HIVE} \
    --define ORDERS_DIR=${HDFS_ORDERS} \
    --define YMD=${YESTERDAY_YMD} \
    -f clicks-orders-report.hql 

# A custom Hadoop job that updates the data for a recommendation engine. 
# We assume the Hive clicks data is in the Hive "warehouse" location, inside
# a "finance" database (in a subdirectory named "finance.db"), and an
# "adclicks" subdirectory for the table data.
recommendations-analysis:
  ${HADOOP} \
    jar /usr/local/mycompany/clicks-orders-recommendations.jar \
    --clicks=${HIVE_WAREHOUSE_DIR}/finance.db/adclicks \
    --orders=${HDFS_ORDERS} \
    --ymd=${YESTERDAY_YMD} \
    --output=${RECOMMENDER_DATA_DIR}

# Using Sqoop, export the results of both analysis steps back to tables in
# another database.
export: reports-analysis-export recommendations-analysis-export

reports-analysis-export:
  ${SQOOP} export \
    --connect jdbc:mysql://db-server:3306/orders-warehouse \
    --username uname -P
    --table clicks_orders \
    --num-mappers 5 \
    --export-dir ${HIVE_WAREHOUSE_DIR}/finance.db/clicks_orders_analysis

recommendations-analysis-export:
  ${SQOOP} export \
    --connect jdbc:mysql://db-server:3306/recommendations-prod \
    --username uname -P
    --table clicks_orders_recommendations \
    --num-mappers 5 \
    --export-dir ${RECOMMENDER_DATA_DIR}
</code></pre>
<h2 id="hadoopsupport">Hadoop Support</h2>
<p><em>Stampede</em> originated as a tool for Hadoop-related projects, although it’s not limited to those scenarios.</p>
<p>As you can see from the previous example, because Hadoop tools have command-line interfaces, we simply call them in the <code>Makefile</code>.</p>
<p>The additional Hadoop support consists of <code>bash</code> scripts and compiled Java code in the<code>$STAMPEDE_HOME/bin/hadoop</code> directory.</p>
<p>Currently, there are three additional tools provided by <em>Stampede</em> for determining configuration property settings for <em>MapReduce</em>, <em>Hive</em>, and <em>Pig</em>, by actually running those tools, as opposed to reading static configuration files. More Hadoop-specific tools are planned, e.g., basic integration with the <em>JobTracker</em>, <em>NameNode</em>, and <em>HCatalog</em>.</p>
<h2 id="wheretogofromhere">Where to Go from Here</h2>
<p>Clone the <a href="https://github.com/ThinkBigAnalytics/stampede">Stampede GitHub repo</a> (release downloads are TBD) and follow the instuctions in the <a href="https://github.com/ThinkBigAnalytics/stampede">README</a> for installing <em>Stampede</em> and using it. You’ll also find the Hadoop example we discussed above in <code>$STAMPEDE_HOME/examples/hadoop</code>. See also the Stampede <a href="https://github.com/ThinkBigAnalytics/stampede/wiki">Wiki</a>.</p>
<p>We hope you find <em>Stampede</em> useful. Consider joining our Google Group, <a href="https://groups.google.com/forum/#!forum/stampede-users">stampede-users</a>and following us on Twitter <a href="https://twitter.com/StampedeWkFlow">@StampedeWkFlow</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://thinkbiganalytics.com/announcing-stampede-flyweight-workflow-tool-for-nix/feed/</wfw:commentRss>
		<slash:comments>38</slash:comments>
		</item>
	</channel>
</rss>
