“Common Sense” Project Management (Part 1)

Wednesday, March 10th, 2010

It has often been said that there is nothing “common” about common sense. Nowhere have I found that truer than in the area of project management. The intent of this series of blogs is to explore some of the more common subjective reasons why some projects succeed and some fail. I believe that there are some very important hallmarks of a successful project that are often undervalued because they deal with some of the more subjective aspects of leadership.

There are many factors that differentiate a successful IT project from a mediocre one. Surprisingly, unsuccessful IT projects more often result from not following some simple “common sense” principles of leadership rather than not using the correct project management methodology or because the technology being implemented is too difficult. I don’t want to discount the benefit of all the new project management methodologies and processes available today, and their importance to a project. However, I believe that there are other intrinsic factors critical to the successful execution of a project that while being more subjective, are every bit as important as some of the more quantified aspects of project management.

My opinion is based on 34 years of working on a variety of technical projects, both as a participating team member and as the project manager. I admit that many of the important lessons that I have learned about project management stem from having done it wrong and learning from the experience. (i.e. good judgment comes from experience, and experience comes from bad judgment). I’ve worked on some extremely successful projects as well as participated in a few “death marches”. Through these experiences, I discovered some traits that were often present in the successful projects and absent from the unsuccessful ones. It is these successful traits that I want to explore further.

While there are many facets to these “success traits”, they all have at their core a basic understanding of human nature. Over the years, I have encountered some extremely smart people that while capable of keeping up with the ever increasing tempo of technological change, are clueless about what motivates and demotivates people. They appear to be unaware of the negative consequences of their leadership style and their impact on the project, and then wonder why their project is performing so poorly. It is like having a new car with a powerful engine and insisting on driving with the parking brake on, and then complaining that the car doesn’t perform as promised. I have actually had conversations with people that when I pointed out the “parking brake” in their situation, they were surprised that it would have an impact on their project.

I have never met a technology professional whose goal it was to do a bad job. Everyone wants to be successful and feel good about what they do. While sometimes people are miscast in their role on a project, too many times it is a culmination of these subjective factors that lead to poor project performance, not lack of talent on the individual’s part.

The following are four areas that have played a critical role in the successful projects that I have been part of over the years, and that I will be exploring in subsequent postings.

1. Enlightened “people” management and strong project leadership.

2. Adequate communications with BOTH the customer and the project team

3. Understanding the customer and how to determine their “real” requirements.

4. Risk analysis/avoidance (i.e. how to prepare for things going “bump” in the night”

  • Share/Bookmark

Search Wizards Speak: An Interview with Tim Estes

Tuesday, February 2nd, 2010

NOTE: The following is the full interview that Stephen E. Arnold conducted with our CEO, Tim Estes. You can find this interview and others at www.ArnoldIT.com.

In a taxi from Baltimore-Washington Airport to a speaking engagement in Washington, DC, a colleague and I were discussing my search blog. We were sharing a taxi with two other people. One of them asked, “Are you the fellow who writes about search and content processing?” I replied, “I was.”

The person asking the question introduced himself as a reader and began to tell me about his company’s technology. I took his card, did some research, and this interview is one outcome of that encounter.

Digital Reasoning, based in Franklin, Tennessee, has developed a suite of software that adds value to content.

I learned that the company develops technologies that help solve the problem of information overload. The company’s tools allow users to read, understand, and make use of vast amounts of data.

Digital Reasoning has patented its technology that, according to the firm’s Web site, “deeply, conceptually searches within unstructured data, analyzes it and presents dynamic visual results with minimal human intervention. It reads everything, forgets nothing and gets smarter as you use it.”

I followed up with the company’s chief executive officer, Tim Estes. The full-text of my interview with him appears below.

What is “digital reasoning”?

Digital Reasoning is unique in the market in its ability to bootstrap a model from the data down to the entity level and then start resolving entities and aggregating their connections to give you a much better picture of the data. We are a real summarization technology that is not limited to the a priori model or ontology that is applied. I think this is where the market is going– but time will tell.

What is your background?

I went to the University of Virginia.

That’s interesting. My son attended UVA .

Quite a coincidence.

I’m a philosopher by training. Ironically – when I graduated we had T-shirts that said: “Philosophy – I’m in it for the money.” My background was in semiotics, Philosophy of Language and Philosophy of Mathematics. A principle area of interest was in the works of Wittgenstein and Leibnitz. I have a passion to find hidden structure in things and proceed from the assumption that the world is held together by necessary and intrinsic order (thus the Leibnitz bias). In founding the company, the idea was that with sufficient introspection of mathematical and structural invariances that present themselves inside of data, a “model” would emerge from the data that could allow software to execute on imprecise goals using learned contexts.

Were there key influencers that shaped your firm’s technical approach?

I credit two primary influences with driving me to start the company. One was a brilliant article written by David Gelernter called the “The Second Coming” and the other was an interview that Bill Gates gave in Red Herring in the Spring of 2000. Bottom line – they both pointed forward to a day when all software would learn and the other software would be commoditized and simply infrastructure. Digital Reasoning was really about trying to bridge that gap – and it still is. We saw the most opportunity and challenging problems in the area of having systems understand unstructured data to be able to help bootstrap the context necessary for a new level of software automation – i.e. ambient intelligence in software that could prioritize, summarize, and make a reasonable level of proxy decisions for humans that are overloaded with information. To me – most of the buzzwords in search are just repackaging these core ideas.

Faceted navigation, for instance, is really just prioritization and summarization that draws more out of the user to substitute for a system not having sufficient context or understanding of a users intention to bring back the right results. It has the ancillary benefit of surfacing connections or facets in the data that probably were not known at the outset (the summarization function that lists mentioned entities or histograms of hits over time give you).

What was the trigger in your career that made search and retrieval and content processing focal points? Weren’t there other, easier opportunities for you to use your technical training and expertise?

Well – Digital Reasoning pretty much is my career. I started it in my 3rd year in school and have been doing it ever since. I can’t think of anything else that would make sense in the Industry – I’d probably be teaching if I weren’t running DRSI.

I suppose after 9/11, I could have taken a route to get into the Government and Intelligence space as a Blue Badger. But given my age – just turned 30 last year – I doubt at the time I could have had the impact I wanted. Now – after 8-9 years of working hard problems in this space, I think we are really starting to make a difference.

What type of performance can a licensee expect with your system?

Digital Reasoning’s core product offering is called “Synthesys.” It is designed to take an enterprise from disparate data silos (both structured and unstructured), ingest and understand the data at an entity level (down to the “who, what, and wheres” that are mentioned inside of documents), make it searchable, linkable, and provide back key statistics (BI type functionality). It can work in an online/real-time type fashion given its performance capabilities.

Synthesys is unique because it does a really good job at entity resolution directly from unstructured data. Having the name “Umar Farouk Abdul Mutallab” misspelled somewhere in the data is not a big deal for us – because we create concepts based on the patterns of usage in the data and that’s pretty hard to hide. It is necessarily true that a word grounds its meaning to the things in the data that are of the same pattern of usage. If it wasn’t the case no receiving agent could understand it. We’ve figured out how to reverse engineer that mental process of “grounding” a word. So you can have Abdulmutallab ten different ways and it doesn’t matter. If the evidence links in any statistically significant way – we pull it together.

Synthesys trials can be had at around $50k or so (depending on specifics). Enterprise deals are substantially higher – but that is true of just about everyone in our space. We offer all of the typical high-level features you’d find in players in Unstructured Data Analytics – entity extraction, geotagging, faceted navigation, query suggestion, etc. But few, if any of them, can really resolve entities accurately without a lot of “humans in the loop.”

The system can index ~10 million files on large single systems. We are in testing on a large distributed model for Synthesys with a government customer right now where we will crack 150M files on less than a dozen servers. The new model is proven to be horizontally scalable and implements the first “eventually consistent” model for a player in our space that we are aware of. It is our hope to prove web scale (i.e. billions of documents) before too long.

Most of our throughput is tied into memory/caching. For instance, with four cores and 12 GB of memory and standard SATA drives, you would probably see ingestion in the hundreds of KB per second up until the single millions of documents and then degradation as caches start to get lower and miss more often.

The number of new companies entering the search and content processing “space” is increasing. What’s your view on too many hungry mouths and too few chocolate chip cookies?

I think that it is a lot of noise in the system. One of the areas that is particularly disappointing right now is the lack of innovation in the eDiscovery area. Most of that market is using technology that got lifecycled out of the Intel/Defense space 5-10 years ago. In enterprise search, I suppose the many mouths will lead to natural Darwinian results.

My only hope is that the new companies offer some real innovation and don’t rehash the same old marketing (“Bring Order to Information Chaos.” Etc.) with the same failed approaches (extract, load a DB, search it with more metadata, etc.). I think the sophisticated IT buyer/CIO is pretty tired of being promised more than can be delivered in this space.

Like the old commercial – we are hopefully going to be getting into a “where’s the beef?” type attitude soon.

Finally – I think that while the academic conferences and contests have been interesting – I think there needs to be a better way to prove that these technologies generalize to a real customer’s data. Everything looks pretty once the data gets well formed and cleaned up. Boy don’t those Palantir demos look really cool – but what happens when you really hit the junk we call data in real businesses or Federal enterprises? We need to focus on the real data – not the slickest demos. The people in the Intel community especially understand the “bait and switch” of demoing on clean, structured data and then having to face the reality of their data on the inside where these demos never seem to work against the large amount of noise.

When the market leaders get honest about the challenges of noisy data and start delivering predictable quality over that real data that’s when we (speaking as a member of the unstructured data analytics market) will get our credibility back.

What are the functions that you want to deliver to your customers?

Well, I think we want all data to be available to users from a content/entity level versus a document level. Documents are containers of facts and ideas. We don’t have time to read 1/10th of what we want to or need to. We need summarization and prioritization feeding visualization. We’d like to see that as common practice.

In the Intel business – why do we read stuff before we start creating charts and graphs of key connections? Because the software is too stupid to do it for us right now in an automated fashion. That needs to change. Our analysts our overworked, our managers have to consume too much at too high a level, and we are drowning in email and Facebook/Twitter feeds. Something has to sit between us and the firehose of content and status updates that are overwhelming us. It’s not just new tools to navigate it and read. It’s really something quite different – show me what it means in a snapshot and let me dig in to whatever looks important and novel. And, do it as fast as Google but from a concept or entity-centric point of view.

That’s what we deliver in our Defense/Intel efforts and it’s what we look to deliver to other contexts and markets as we expand into those this coming year.

Are you able to give me some insight into new features you will be offering your licensees in the next release?

I don’t want to go into too much detail. But on the backend side, we have two major efforts going on that we believe will disrupt the market. First, we have a real answer to entity resolution that works at scale. Right now we are integrating it with the ability to apply it to both structured and unstructured data. That’s going to be a real killer. It conceptually integrates the actual entities in enterprise data and does so with minimal a priori modeling and customization (especially compared to the other approaches on the market today).

Next, we are implementing a backend that is very similar to what Amazon has as software infrastructure. It is going to allow horizontal scalability of the underlying storage and processing and allow for multiple datacenters and clouds to synchronize this understanding. This means that Digital Reasoning is positioned to have a real offering for understanding data in the hybrid cloud space.

There’s a push to create mash ups–that is, search results that deliver answers or reports. What’s your view of this trend?

I think it’s pretty useful so long as the quality of analytics is good. It’s always tricky when you automate a process that has a 0.8 F-measure (F1) at best on noisy data. You end up getting some very humorous mistakes. But that’s the price of the early stage of disruptive technologies. If we can create supplemental processes (like ensembles that are tuned toward recall paired with others toward precision) we can emulate what’s worked well in the medical community in terms of the testing process. I want to credit Ted Senator (used to be at DARPA now at SAIC) with the above analogy. He used it in a paper a few years ago and I think is still one of the better analogies I’ve heard in this space.

What sets your technology apart from some other vendors’ systems?

Our solution is generally complementary to the Oracle/MySQL/MSSQL solutions we find in the government and enterprise. It can be stood up on its own – this is the default – but we don’t have issues integrating into the broader enterprise with those other systems.

I think I’ve covered the differentiation point already – but really the ability to find entities, resolve them, and then retain their connections to other entities and all related data is a pretty big differentiation. We also believe that scale and speed are differentiators for us. While others may index for search faster, few if any can match our depth of understanding of the data at scale or with the speed we have.

Our approach is fundamentally different from 90% or more of the market, because we have a real bias against trying to leverage a priori models against the data (i.e. exhaustive extraction or ontology type models). Digital Reasoning tells you what you didn’t already know and also sorts out data easily so you can find what you expect to find if it’s there – we deal with both the knowns and the unknowns elegantly. That’s how we are different. We’re particularly good enabling the discovery of the non-obvious and the unknown from noisy unstructured data.

Semantic systems have been getting quite a bit of coverage, yet the Powerset technology and other semantic players like Hakia.com have been slow out of the gate. What’s your view on semantics and natural language processing? Are these technologies ready for prime time?

It’s getting there. I have a fundamental disagreement with the Extract, Transform, Load (ETL) for text type approach, however. It tends to work well in fixed/stable domains and poorly in domains with evolving semantics and noisy data. I think that is exactly what we see right now in terms of the limitations. I think this approach will ultimately succumb to approaches that can bootstrap form the data (this is a variation of the Peter Norvig camp on the problem). We are still waiting for the iPod of learning algorithms that works at scale to really show how futile all of this a priori modeling investment really is.

I also think that most of these guys probably were optimistic about their ability to scale their analytics to web scale and got caught off guard with how hard it is to go from tens of millions or hundreds of millions of pages and work at tens of billions of pages. It’s just a hard problem. Google succeeds because 6-7/10 hits on the first page helps them keep their business model rolling. Trying to get 9/10 on much more semantically narrow domains is at least an order of magnitude harder problem if not two.

A number of vendors have shown me very fancy interfaces. The interfaces take center stage and the information within the interface gets pushed to the background. Are we entering an era of eye candy instead of results that are relevant to the user?

We are always taken in by the demo. It’s pretty typical. People and enterprises want an information savior – and the demo is like a “miracle proof” even if it is really more Wizard of Oz than anything else. I think that the real work in this space is not being done by the demo artists. It’s being done by those that can make sense of the data while asking less and less of the user.

I think that “Intelligence Augmentation” – something that Palantir was blogging about recently – is very much a cop out. It basically states we still want the human to have to do all of this work but we are going to make it a lot less onerous on them. This doesn’t solve the problem at all. Sure – most of the time investment in applying machine learning algorithms is data normalization – but that’s the point. If we had algorithms that were smart enough to create a model from mathematical order in the data that meant something to a human, we wouldn’t have to ETL it into a specific schema. Data normalization is a machine learning problem. I think that is where they miss the boat. The Intelligence Augmentation approach (left alone) creates false assurance that the user is making progress when, actually, key items are being missed due to the fact the software has no real, evolving understanding of the data. We need computers to see the whole picture of what’s going on in millions or billions of messages because there is no way a human can. No visualization can role up that many nodes to make it tractable for a human to understand. Any visualization without the capacity to understand the underlying data in sophisticated ways is just doing a disservice to the mission.

Like all complex problems, we need substantial automation to grow productivity. To us understanding data is as a lot like automated landing systems in aircraft. At some point in the not-too-distant past it simply became too much for human beings to manage all of the complex subsystems in a commercial jet aircraft. Now pilots only manage those items in emergencies and focus on the major judgment-oriented tasks in flight (direction, altitude, etc.).

We need automated awareness systems across most information-centric activities. That’s the real meat. Visualization is a means to present this underlying capacity for maximum utility. It is not the utility itself.

What text processing functions do you offer?

Currently we offer indexing, entity extraction, geotagging, search, faceted search, relationship extraction (basic), and dynamic graph generation from those relationships. Our entity extraction and language processing is being rebuilt into a next generation capability right now. We plan on offering anaphora resolution, in document co-reference, and deeper extraction in future releases. We are currently English only but also plan to pick up other languages. We hope to do that this year (its not a technology issue for us), but that depends on competing customer demands. Right now, there is a lot of business supporting English since that is what nearly all of the analysts are using.

Also, our new horizontally scalable backend will be in the next release along with new entity resolution capabilities against structured and unstructured data. Other bells and whistles too – but those are the majors.

What is it that you think people are looking for from content technology?

People are looking for semantic technology to help them read less and understand more. Sounds simple right? They don’t readily trust the summarization part – so that’s an area that needs a big step up.

A major source of discontent is the upfront cost of building models (the ETL bias) to turn unstructured data into structured data. This is probably the biggest holdback in the enterprise (especially in a tight budgetary environment). They are tired of software that has an even bigger up front deployment and maintenance cost. Given how we solve the problem, we expect to have a compelling story here.

I think the other big piece that is holding back semantic technology is the obsession with search and reactive applications. Enterprises need to start looking at how to use semantic technology more proactively and vendors need to be delivering better solutions here.

What are the hot trends in search for the next 12 to 24 months?

I think faceted navigation is going to become standard- even passé. The trick will be how well this can happen from noisy data. That’s where it will be interesting to compare what Endeca has (which is heavy on up front modeling of your data) to what Nova Spivack is working on over at Radar Networks (probably a much more elegant approach).

I think the wave that is coming, however, is how do we get into proactive applications in semantics and search – i.e. ambient awareness yielding autonomous action by systems where the principle data streams are unstructured. That’s the next big wave. We are working that both in our direct business in Defense/Intel and in new markets. We expect to pursue partnerships with existing enterprise players during the coming year. Beyond that – well we’ll see.

Where can people get more information?

Our Web site has some current information. Blogging has been a little slow recently since we’ve really been maxed out with new items taking up time from the likely internal contributors but we hope to get a little more diligent on that in the coming months. We’ve got some material on request – we’ve actually got a ton of material but we like to understand the need first so we can maximize both our potential customers time as well as ours.

ArnoldIT Comment

Digital Reasoning has captured the attention of a number of US government agencies. The firm’s profile in the commercial sector is on the upswing. The firm’s approach provides those with a need to know what’s relevant to a particular concept or topic in a large flow of content will find that Digital Reasoning’s approach offers an alternative to the older, one-size-fits-all solutions from vendors with technology dating the from mid 1990s. The company is aggressive and committed to making its licensees get full value from the company’s patented technology. More information is available from http://www.digitalreasoning.com.

Stephen E. Arnold, February 2, 2010

  • Share/Bookmark

We Are Hiring!

Friday, May 1st, 2009

Digital Reasoning Systems is looking for Java expertise.

Please see our Careers page under About Us and use the contact form for more information.

  • Share/Bookmark

Digital Reasoning’s Products and Services Now Available on GSA Schedule

Thursday, April 2nd, 2009

Today we announced that our complete product line and services are now available on the GSA Schedule number GS-35F-4153D.

Special thanks to Intelligent Decisions, a solid partner and VAR reseller for us.

Net effect: It’s easier than ever for government agencies to take advantage of Digital Reasoning solutions. Please contact us if you have any questions or would like to learn more about procuring our solutions via the GSA schedule.

  • Share/Bookmark

Digital Reasoning’s Products Added to NASA’s SEWP Contract

Wednesday, January 21st, 2009

From the press release:

Digital Reasoning Systems, Inc.,the intelligence-software innovator, today announced that its product line is now available to federal agencies on the NASA Solutions for Enterprise-Wide Procurement (SEWP) contract. A SEWP listing allows all government agencies the ability to procure the company’s products at discounted prices.

Digital Reasoning products added to the SEWP contract are:

  • Interceptor: Interceptor allows you the ability to look through all of your data rapidly and easily discover what is inside
  • GeoLocator: GeoLocator is a tool that extracts populated places from your data and returns the extracted locations with their geo-coordinates
  • Synthesys: Synthesys allows you the ability to easily create applications that leverage vast amounts of unstructured data

The stated vision of the SEWP contract is “to be the premier customer-focused contract vehicle for Federal Government purchase of mission critical, state of the art IT products.”

“It’s a major coup for Digital Reasoning to be added to the SEWP contract. It makes our unique technology accessible across the Federal Government and positions us as a true platform for unstructured data analytics”, said Tim Estes, CEO of Digital Reasoning Systems.

  • Share/Bookmark

Market Catches Up to Digital Reasoning

Saturday, October 4th, 2008

From the press release:

In a recent article for CNET News, Stephanie Olsen explained that investment in web technology initially dealt with commercializing the Web, helping companies like Amazon.com and eBay get on their way. The second wave of investment has been about helping people socialize and connect through sites like Flickr, YouTube, and Facebook. The third, she writes “will be about making sense of all the data people create around the Web, and then searching for patterns in the data to improve the delivery of personalized content, search results, or advertising.”

To make sense of the data, Olsen proposes, will require “building an intelligent system that understands the relationships between Web sites and how people use them” with the use of algorithms that understand keywords, context, and natural language on a massive scale. VCs (Venture Capitalists), for example, are looking to so-called semantic technology to significantly boost the amount of searches that result in an advertising click.” Right now, an estimated 30 percent to 40 percent of Web searches do not return advertising revenue. But if a search engine understood the context of a person’s Web search more often, those numbers would improve, they say.”

“Simply put, the problem is information overload – there is so much good information that you have to look really hard to find the great information that you care about most”, said Tim Estes, CEO of Digital Reasoning Systems.

The typical approach to understanding unstructured data involves having to either read the data manually or use a keyword search tool. Both methods present challenges to accuracy and efficiency. Manual reading, while reliable, can take an inordinate amount of time. A keyword search, while fast, returns only limited results.

At Digital Reasoning, we apply advanced algorithms to solve both problems without sacrificing quality. In fact, because the software builds its models of meaning from the data and understands concepts – the end product is better information.

Since 2002, Digital Reasoning has worked with a variety of organizations and agencies both large and small to help them make sense of what is in their data. We are proud of the patented technology we’ve developed and our clients’ successes in the federal and intelligence market. Now we are making that same technology available to commercial clients.

  • Share/Bookmark

What is the Synthesys Platform?

Thursday, August 28th, 2008

You may have noticed SynthesysSDK.com.

This begets the obvious question: What is the Synthesys Platform?

The short answer: Digital Reasoning‘s Synthesys Platform provides the first true Software Development Kit (SDK) and server platform for Unstructured Data Analytics (UDA).

The slightly longer answer: The Synthesys Platform helps you find unexpected, critical knowledge hidden in your data. Synthesys takes unstructured text as input, uses entity extraction with strong semantic relationship analysis to operate on the input, and then outputs abstracted knowledge objects. You can then use these objects (people, places, connections, etc.) to understand and analyze what’s important.

For an in depth answer and to speak to us about possibly joining our limited beta, please contact us via the form on SynthesysSDK.com.

You can also attend one of our upcoming events or tech talks, for example my “Hacking the Meaning in Human Communication” presentation at the upcoming Tulsa TechFest in early October.

Watch for much more information on the Synthesys Platform on this blog in the weeks to come.

  • Share/Bookmark

Measurement improves software development

Thursday, July 10th, 2008

There are two possible outcomes:

if the result confirms the hypothesis, then you’ve made a measurement.

If the result is contrary to the hypothesis, then you’ve made a discovery.

Enrico Fermi

A couple of years ago we started a process of programming that was very different than anything I’ve seen in the last 15 years or so that I’ve been at it. We had a challenge given to us to produce a geographical location service built upon our entity extraction technology. It was an interesting exercise which at the time we had no experience doing. The object of the game is to read in text documents, discover location references, disambiguate them, look them up in a gazetteer and mark them up with the coordinates. This can be done either as an additional final section or, the more difficult case, in-line.

So off we went. Now the very first attempts at measuring this were done by me. I had had a lot of statistics in college but never thought I’d really get to use it. I came up with my own measures which were pretty close to recall and precision. Giving both numbers just didn’t fly with the management at the time. It was confusing. They wanted one number. After a little research I discovered both recall, precision and the mysterious F1 (or F-Measure).

In the case of this task we defined tokens as either relevant or irrelevant. If the token represented a PPL (populated place) then it was relevent. Otherwise it was irrelevent. So if a relevent item was marked up with the correct location it was a true positive. If it was not marked up or marked up with the wrong location it was a false positive. If an irrelevent item was marked up it was a false negative. The debates raged on what to do in the case where the system found a location but just did not disambiguate it correctly and over what to do when tokens were improperly co-located (as in what if “Rio de Janeiro” came up as “de Janeiro” instead.) Ultimately we decided to keep it simple. Any error below the level marking something right or wrong was deemed just a detail.

It took a lot of measurements and a lot of debate but we got it to work. This learning process produced a lot of healthy discussion and when we did finally decide on what formulas were best everyone could clearly see how to proceed.

The first day we calculated the f-measure of our geo-coordinate markup service it came up an astoundingly low 37 out of 100. I went over the numbers several times. Management wasn’t happy. What was decided next ended up being a great model for future development. We were put in a conference room with our computers and a white board. We were told not to leave until the f-measure was above 80. The way the development worked we had one person who did work on the trained categories system and another guy who did the application programming. I was doing measurements and creating reference sets. Three of us working towards one task, side by side.

We would discuss potential strategies and would then run them through the test harness. Every strategy would impact recall and precision. Often this would show how these concepts are opposed. As one is increased the other is decreased. What you are looking for is opposition that is not equal such that the f-measure rises. You want the decrease to be smaller than the increase. While it seems obvious most people don’t program that way. They come up with a bunch of ideas, implement them and just accept the measurements they get. In our case each change was tested. Yes it was slow but it separated out the good ideas from the bad ideas. We also, in this way, discovered other weaknesses that were fixed. If we had not been looking at this on a case by case basis we would have missed the subtle clues that helped us iron out the other parts of the system that were contributing to the final result.

I believe that honestly measuring your tools’ accuracy is important not just for sales and customer reassurance but also for the whole development life cycle. Efforts are underway to allow the unsupervised portion of the DRS system to aid in getting the Geo Reasoning system at or above 90 f-measure. Right now 75-80 is state of the art. Every point of f-measure gain beyond 80 is far more difficult to achieve than all the ones prior. However a learning system should be capable of this feat. More on that later.

  • Share/Bookmark

Getting your Mojo back with Dojo

Wednesday, July 9th, 2008

Dojo The Definitie Guide by Matthew RussellMatthew Russell, Director of Advanced Technology, joined Digital Reasoning in October of 2007 and has been making an impact since day one. A talented and dedicated programmer, Matthew’s long hours and creative energy have been focused on improving the Interceptor user interface, architecting web applications and devising innovative ways to embed the company’s core technology platform into commercial products .

It has been a busy year for Matthew. Despite the relocation to Nashville and devoting countless hours to the work at Digital Reasoning, Matthew managed to complete his first book -”Dojo – The Definitive Guide” – published by O’Reilly Media and released on June 17th. We talked to Matthew about his book, the writing experience and plans for the future.

Q: What is Dojo?
A: Dojo is a piece of client-side technology – Javascript based – that creates great user experience on the web. It’s a toolkit, technically speaking, it’s something you can use to create a great user experience in a web browser.

Q: What makes Dojo superior to other Javascript toolkits?
A: The overall architecture is very well thought out. It’s industrial strength, it’s battle tested. Big blue chip companies are using it. And it has tremendous breadth and depth. It doesn’t just solve a little narrow problem, it can solve lots and lots of different kinds of problems, but the solutions aren’t just cursory -they are very involved.

Q: How and when were you introduced to Dojo?
A: At a previous company, a colleague and I worked on all these applications for the intelligence community and one really common issue with intelligence datasets was that there was generally a lot of data that needed to be displayed in a tabular format. We started to scope out what other people have done…other technologies in the Javascript toolkit realm and Dojo was on of those. From there, I started to learn all the other things Dojo automates and makes simpler.

Q: What other writing have you done and how did this book come about?
A: I had a great professor while studying Computer Science at the Air Force Academy who was my thesis advisor and he cultivated writing in a way that while writing your thesis you would produce enough materials for white papers and technical papers. I started writing fairly frequently for O’Reilly on the MacDevCenter site at the time. So, I had been doing development for Dojo and thought it would be a neat thing to write about. I sent in a pitch for an article on the topic and it sort of escalated and eventually someone got back to me and said maybe we want to write something bigger -maybe a book.

Q: How long did it take to write the book and what was that experience like?
A: The actual book writing process took roughly 10 months. I signed contract last July and I put the finishing touches on it the first week of June. I would estimate I spent roughly 1200 hours writing the book. One thing about writing a book – it’s not just about knowing the material from a technical standpoint. There’s so much overhead. How do I organize these thoughts? What information do I put in what chapter? What’s the most logical ordering for chapters? How do I keep the content written in such a way that it engages the reader and doesn’t become boring, dry, technical material? I think I stayed true to that O’Reilly style of keeping it fun and engaging the readers. The hardest thing about writing the book in my opinion is that it has always been a moonlighting effort for me, it’s not my daytime job. So, if you can imagine, way more than 50% of your nights and weekends, for almost a year, being taken up. After you’ve been to work, had a long hard day, okay, you come home, eat dinner, bore your family for a while, then sit there for six hours writing till the wee hours of the morning -that’s the hardest part.

Q: What are your expectations for the book?
A: I personally always looked at a book as being successful if it goes into a second edition. It must have been good enough to keep selling beyond that first threshold. I think they’re printing between 8,000 to 12,000 copies of my book. I would be really happy if it goes into a second edition.

Q: As a result of writing  “Dojo – The Definitive Guide” you’ve had a few new opportunities to share your expertise on the subject. Tell us about being invited to speak at OSCON, The Open Source Conference, and the June article in Linux Journal.
A: I was encouraged by my O’Reilly editor to submit a proposal for a talk, and I would imagine that having O’Reilly care enough to publish a book on the topic in the first place, probably helped some. Getting in to do the talk wasn’t a given, but having the book probably helped. My OSCON talk is on a component of Dojo called GFX. It’s a sub-project of Dojo that allows a developer to do drawing and animation on the screen using one of many backends…SVG, Microsoft Silverlight, VML and in theory you could plug in any kind of drawing backend into it, you write the code according to this GFX API, pick the backend you want to render it with and it just happens. You write the code once and point it anywhere.
I submitted a proposal for Linux Journal last summer. I was just perusing their site and noticed they had an issue coming out about web technology. I thought it might be a good way to get Dojo out there into the mainstream even further than the book.

Q: What have you learned going through this book-writing process?
A: I’ve really come to appreciate just how much work it is. The next time I see a typo in a book I’m going to give the author a lot more slack than I used to.
Knowing technical content is one thing. Being able to communicate is another thing.  Being able to communicate technical content is a third thing. Then writing a book about it is entirely different.
Digital Reasoning is fortunate to employ some of the best and brightest minds in their fields and Matthew Russell is no exception. You can find his book – “Dojo: The Definitive Guide” on bookshelves now. Subscribers to Linux Journal can click here to read Matthew’s article “Dojo: the JavaScript Toolkit with Industrial-Strength Mojo”.

  • Share/Bookmark

Measuring associative networks for quality of analytics

Tuesday, June 24th, 2008

“I personally think we developed language because of our deep inner need to complain.” Jane Wagner

When it comes to most text analytical tools the only measures given are recall and precision. Some may tie them up nicely and appropriately in the f-measure which is simply the harmonic mean of those two numbers. Usually the discussion of quality ends there and you are quickly whisked off into discussions of functionality, user interface and processing speed.
As I wrote before there are many issues with measuring NLP tools. One of those issues is a lack of accredited measures to apply to them. At DRS we have at the root of our analysis something called the associative network. You can read about how these work in theory and examine a few examples of them. Generally there are a lot of problems with them revolving around their explosive need for memory and the time it takes to process them. At DRS we’ve solved a lot of those problems and find that medium sized corpora work just fine on your standard 2GB laptop. Let me briefly explain what an associative network is, as we’ve defined it.
An Associative Network is a set of related elements from a distribution of elements based on shared features to one or more elements selected from that distribution. Essentially, it is supposed to give you ranked elements that are semantically “closer” to the element(s) provided for comparison. The effectiveness of Associative Networks generally turns on (a) the selection of features of the elements in the distribution to compare and (b) the features of the element provided for comparison that are relevant in ranking. For instance, if I were to provide “fly” as a linguistic element to a set of linguistic elements in a data set, I might want “flying”, “traveling”, and “moving” as my expected association. This, of course, assumes the “sense” of  “fly” is as a predicate and not as an entity (such as an insect). If the latter were the case, I might expect the associations to be “insect, “bug, and fruit fly.”
The key above is to recognize features about the elements as used in the data (“fly” as predicate and “fly” as entity would have very different features if properly measured) and which features are apt for comparison (the string “fly” may be insufficient to specify the appropriate set of features to prioritize because its sense may be ambiguous without the user selecting “entity” or “predicate” as a qualifier on “fly”). The ideal Associative Network solves the traditional Natural Language Processing problems of automatic thesauri creation, clustering of semantic nearest neighbors, and brings us very close to effective, unsupervised sense disambiguation technology. Those are some ways that Digital Reasoning applies its Associative Network technology.
This technology is exciting and is very new in commercial grade applications. It is important to understand the strength and weaknesses of this tool. If you were evaluating an analytical tool it would be important for you to evaluate the accuracy of such a system and its utility. I was asked months ago to come up with a measure for Associative Networks. Since I am lazy I went and looked high and low for someone else’s measure first! Sadly there wasn’t anything out there. So I started to analyze what was coming out of our tool. Every attempt failed to produce something I would want to show because the scientific side of me rejected the processes I was developing. The problem was subjectivity. Your hard sciences like physics have unambiguous predictions from theory. The Strange Quark charge is always going to be 87 MeV by the Standard Model and as predictions go this one has always been measured this way in experimentation. When we get to softer sciences things start to get a little more ambiguous and subjective. As I stated in a prior post you have to reject subjectivity as much as possible.
So there I was staring at input terms for the associative net and the resulting list of associated terms given as output. What, therefore, defines good associations for “œtree” or “Teddy Roosevelt” or “quark”? When we look at the various ways in which the associative network can be tweeked (there are many variables that control the process) and the fact that different corpora will produce different associations I began to think there was no way to measure this. At least a non-subjective way. Subjectively I can look at a list and using my own knowledge say whether the list “looked right” or not. That is hardly a measure. It certainly isn’t scientific.
Throwing all subjectivity out the window I needed to find a scientific method. I had to make predictions and prove them out through experimentation. Then it hit me. It’s not just the associations. It’s the network. There should be a way of looking at two terms and predicting a third in relationship to both. So, taking an analysts approach I looked at a document from one of my corpora and found that the USS Nimitz has 5,900 sailors and the reactor has a peak-output of 190 MW. Ok, now we are talking. The intersection of associations between USS Nimitz and Sailors should contain 5,900 and the intersection of USS Nimitz and peak-output should contain 190 MW. It seems so simple and yet it eluded me for 2 months trying to solve this problem. I am currently working on a test of this concept and the write-up of the theory. I am sure I will come across some interesting issues and along the way discover more ways of testing associative nets and other semantically related data organization tools. By making these methods open it allows them to be used widely. By making them general in use (this method could be used on a wide variety of systems, including humans) they will have much more Universal applicability. I’ll use this place as my initial forum to announce the results of the experiment and methods one can use to replicate the experiment.

  • Share/Bookmark