alfredo sahagun: Hadoop Technologies - explained - by @alfredosahagun asian tech colab

What´s Hadoop? From the name of the Inventor´s Son Little Yellow Toy Elephant named “Hadoop” comes the newest Distributed Computing and Processing Framework Technologies from Apache software named HADOOP, which is designed to compute and manage big data, with Hadoop is now faster, more reliable for business users to profile, transform and cleanse big data information (whether residing on Hadoop or anywhere else) and using an intuitive user interface. Happod made by Apache software is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. In easier terminology, Hadoop is essentially software with the ability to handle large arrays of data, remember that after almost 40 years of growth of the web, the Word Wide Web (www) went exponentially from hundreds of pages to millions. At the beginning searches were returned by human search team, soon automated search engines were developed, and these have improved significantly from the initial Yahoo or Altavista search engines and now deal with massive data files and information, for which better algorithms, such as the renown Google search engine was developed. The key advanced here was the faster and more accurate return of the results since Google´s new algorithm was able to store and processes data distrusted by relevance of web search results. To understand the beginnings of HADOOP we must look at what was happening around that same time in the Open Source and Linux world. A new Open-Source project for a web search engine called Nutch was starting, much alike Google, Nutch was the innovative creation of Doug Cutting and Mike Cafarella. A year after Google published a white paper describing the MapReduce framework, inspired by it, both Doug Cutting and Mike Cafarella, created Hadoop to apply these concepts to an open-source software framework to support distribution for the Nutch search engine project. They endeavored into distributing the data and the calculations of the engine search across many different computers so that multiple errands could be accomplished simultaneously, speeding up the whole process. Also big data information could be arranged, distributed by information nodes. This was the beginning of Hadoop, and this occurred in 2002, four years later Doug Cutting joined Yahoo Inc. and brought with him his Nutch newborn engine idea, and some ideas about automated distributed data storage and processing. Two years later in 2008, Nutch was divided into two main projects: The search web crawler function kept being called Nutch, while the Distributed Computing and Processing function was named HADOOP (after Doug Cutting´s Son Elephant toy name!). Hadoop maintain the parallel processers idea and went on to deal with Nodes, Info Locality for selection, That same year, in 2008 Yahoo launched officially Hadoop Open Source Software releasing it eventually under the maintenance and management of Apache Software Foundation (ASF). Which is a Global Community of contributing programmers. One year later, Doug Cutting leaves Yahoo. In year 2011 the Commercial version made by Yahoo is named Hadoop Distro in a spiff off named Hortonworks together with Mapft Technologies and more recently in 2013 the company Greenplum released the Pivotal HD of Hadoop Distro. But what makes Hadoop so important when quickly analyzing big data: Firstly, as explained above, the parallel processing and selection of relevant distributed data makes quicker store and process huge amounts of data, remember that the growth of information in the internet is also exponential as social media, and innovations everyday more and more record in data, video, sound or any other media. Nowadays many devices and gadgets store your personal information and connect to the internet to send it, this connection of gadgets, appliances, devices in general to a central online information has been often referred to as IoT (the Internet of Things). Second, the processing power is higher the more computing nodes Hadoop uses to process big data fast. (Big data analytics is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. The challenges of big data include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations) Third, it has a high tolerance to failing, since data processing is protected against hardware failure, if a node goes down; processes are automatically redirected to other nods while multiple copies of all data are automatically stored. And Finally Hadoop is more flexible than traditional relational databases, the open source framework guarantees its low cost and finally little administration is needed since more data is easily handled by adding more nodes. Recently Hadoop is developing its own kind of SQL relational data technology on top of the Hadoop, named MapReduce. The current difficulties of Hadoop include the further development and improvement of MapReduce, which tough is good for problems easily divided into independent units or simple information requests it still possesses tough challenges in being efficient with interactive analytic tasks. MapReduce is inefficient since it´s too file-intensive, it´s algorithms required multiple shuffling of map to control de nodes distributions. And finally another area of opportunity for Hadoop is the development of tools for data standardization, governance, quality and management. The final area of improvement for Hadoop is related to the security of its data, at the present time the Kerberos authentication protocol is a great step to keep the Hadoop Framework and Environment secure. The Bloor Group recently refereed to Hadoop Ecosystems as the Ugly duckling turn Swan. At the present time Hadoop is Sponsored by the International Institute for Analytics (IIA) and SAS. by @AlfredoSahagun of @3nglishOnline content developer, @englishxspanish translation service, @institutoidiomas online language teaching

alfredo sahagun

Friday, April 15, 2016

Hadoop Technologies - explained - by @alfredosahagun asian tech colab

No comments:

7 TED Talks en Español

Followers

Links