by Ventana Research | 2012-01-13 | Article ID: QT12-03 | Article Type: QuickTake
Beginning with Version 9.1, introduced earlier this year, Informatica’s flagship product has been able to access data stored in the Hadoop Distributed File System (HDFS) as either a source or a target for information management processes. However, it could not manipulate or transform the data within the Hadoop environment. Informatica’s HParser is designed to improve this process. Using DT Studio, Informatica’s Eclipse-based integrated development environment (IDE), organizations can create data transformation routines via a graphical user interface that parses the information in log files and other types of data typically processed with Hadoop. Once developed, these routines get deployed to the Hadoop cluster and are invoked as part of the MapReduce scripts, which enables them to use the full distributed processing and parallel execution capabilities of Hadoop. Using a graphical environment to develop these routines should make it easier and faster to create the code necessary to parse the data. Our benchmark research shows that staffing and training are the two biggest obstacles to leveraging Hadoop, so tools like HParser that can minimize the specialized skills required can be valuable to organizations deploying Hadoop.
Business intelligence vendors and information management vendors alike have embraced Hadoop. We expect to see more investment from Informatica and others as organizations work to make Hadoop a disciplined part of their IT infrastructure processes. As our research shows, integration is one of the top four issues for organizations working with Hadoop. The more that existing products can be extended to incorporate Hadoop or new products can be developed to make Hadoop easier to use, the more widespread its use will become. Die-hard MapReduce programmers may not feel that they need HParser. However, enterprise IT organizations already using Informatica should find it a welcome addition in their efforts to deal with Hadoop-based data sources.