ID:	6085	Fixed in:
Issue Date:	2011-02-23 15:37 AEST	Owner:	CVS Support
Last Modified:	2012-03-16 12:17 AEST	Reporter:	Arthur Barrett
Current Est.	0.0 hours	% Complete:	0.0
Status:	NEW /	Severity:	enhancement

Affected:	unspecified
Description:	enh: server: integrate UIMA/Tika for search of bugs/files

Actions:

2011-02-23 15:37 AEST by Arthur Barrett - Server should have integration with Unstructured Information Management Architecture (UIMA).

UIMA is an industry standard for content analytics (an OASIS standard), apparently the only standard.

From wikipedia: an example is a logistics analysis software system that could convert unstructured data 
such as repair logs and service notes into relational tables. These tables can then be used by automated 
tools to detect maintenance or manufacturing problems.

I think this is the sort of feature that commercial users of SCCM would find useful - and therefore a 
good future enhancement.

The UIMA diagram even includes source code and issue management in the diagram:
http://uima.apache.org/images/UimaIs.png

Previously the Bugzilla/Bonsai/Tinderbox toolchain included glimpse, which we proposed as a search 
solution / knowledge management for customers including meetings in LA back in 2004.  Glimpse 
always was challenging because it was very unix centric (though we did have it running on our old NT4 
server with Bugzilla 2.18 and CVSNT) and commercially licensed (dial licensed from memory).  The UIMA 
alternative is under an Apache license so is friendly for both commercial and LGPL/GPL.

The UIMA constsist of:
* components
* infrastructure
* frameworks


The UIMA components includes:
* Annotators - extracting structured information from unstructured data.
* repositories


The UIMA infrastructure includes:
* tooling
* server


The UIMA framework is available as C++ (plus Java and UMIA-AS/JMS (Java Messaging 
Services/ActiveMQ).  

The Frameworks run the components.  Additional infrastructure support components include a simple 
server which makes results of UIMA processing available in a simple, XML-based format (i.e.: as a REST 
service). 

The major goal of UIMA is to transform unstructured information to structured information by 
orchestrating analysis engines to detect entities or relations and thus to build the bridge between the 
unstructured and the structured world.  

Apparently there is already a Tika Anlysis Engine for UIMA, so perhaps running Tika on any incoming 
committed files?
http://tika.apache.org/0.9/formats.html
and
http://uima.apache.org/sandbox.html#tika.annotator

So in short:
* we integrate with Tika and the Tika Annotator during commit
   ( basically offering a UIMA Collection Reader)
* we integrate with UIMA Analysis in cvscontrol/evsmanager to use the results


Based on research conducted today I think that the loading of the data into the UIMA is practical already 
- but querying it and displaying the results is still a distant promise, see:
SemanticSearch 
http://www.alphaworks.ibm.com/tech/uima

2012-03-16 12:17 AEST by Arthur Barrett - Tika also seems to be related to the Lucene project.  There is also a CLucene, but it's based on an old 
(2.3.2) version of Lucene (latest is 3.5).

Note from customer:


The free Java-based full text search engine "Lucence" also used by Eclipse and Jira is this:
	http://lucene.apache.org/core/

There also seem to be Lucene ports to other languages:
	http://wiki.apache.org/lucene-java/LuceneImplementations

E.g. one for C++:
	http://sourceforge.net/projects/clucene/

Query page