Search code

Integration with Black Duck Code Sight (BDCS) is no longer supported in TeamForge 17.1 and later. TeamForge 17.1 (or later) is equipped with its own code search function powered by Elasticsearch. You can do away with BDCS integration while upgrading to TeamForge 17.1 (or later) and set up TeamForge Code Search, which is now one of the integral services of TeamForge. This section discusses the features of TeamForge Code Search and what it takes to set up Code Search on your site.

TeamForge Code Search powered by Elasticsearch

TeamForge's Code Search feature uses Elasticsearch as its engine for the indexing and retrieval of documents. Elasticsearch, used in TeamForge, has no customizations other than what it provides via its engine and API. For more information about Elasticsearch, refer to its documentation.

TeamForge Code Search (Elasticsearch): How it works

TeamForge uses Elasticsearch engine 5.1.1. Elasticsearch provides the engine for the indexing and retrieval of documents. TeamForge is equipped with its own indexers that extract source files from the SCM repositories and send them to the Elasticsearch engine for indexing. These indexers run as an hourly cron job. When the job starts, it calls an internal TeamForge API that returns the list of repositories to be indexed on the specified SCM integration server. The indexer then processes the repositories, one at a time, starting with the repository that was most recently committed to. For each repository, it determines if a full indexing is needed or just an incremental indexing is sufficient to catch up since the last indexing operation. It then goes down the list of files and processes them for indexing. Only files that need to be indexed are extracted. For example, large binary files are not extracted as they would not be indexed anyway. As the indexers crawl a repository, it would skip any file that is greater than 1MB. To index a file it has to be UTF8 text. Files are scanned for file encoding and converted to UTF8 if required. Binary files are skipped by the indexers as would be files that cannot be converted to UTF8 format. All files that are skipped are logged along with the reason.

The engine in Elasticsearch is configured to convert the source to lower case and tokenize it. A Camel case filter is used so that common programming elements are broken into individual tokens based on camel case and other common separators used in function names etc. A set of common programming terms (if, then, else, return, exit etc.) are not indexed by default.

How the data is indexed does effect the search results as partial matches are not generally returned. For example, consider a function name in the source code such as: TeamForgeHelper. This would be indexed so that a search for any of these terms would find a match: Team, Forge, Helper, TeamForgeHelper. Nothing else would match, including "TeamForge" or "Tea".

Search results are returned to users with TeamForge RBAC applied to it at a repository level. In other words, a user must have permission to view the entire repository to be able to search it. This works with SVN Path-based permissions (PBP), but only if PBPs are being used to restrict write access. If PBP's are used to deny the user view access to certain paths in the repository, then TeamForge Code Search denies the user the ability to search the repository.

Considerations while upgrading to TeamForge 17.1 (or later)

Consider the following points while upgrading from TeamForge 16.10 (or earlier) to TeamForge 17.1 (or later) versions.

Code Search is now an integral part of TeamForge, which is installed by default during TeamForge installation.
You can install TeamForge Code Search either on the TeamForge Application Server or on a separate SCM integration Server. It is recommended to install TeamForge Code Search on a separate server if your site's indexing needs are considerably high (large number of repositories, for example).
TeamForge Code Search works with GIT and SVN repositories. TeamForge Code Search has no support for CVS repositories.
Installation of Elasticsearch is determined by adding the "codesearch" identifier to the SERVICES token of either the TeamForge Application Server or the SCM Integration Server (if Code Search runs on a separate server). Refer to the site-options.conf configuration section of TeamForge installation/upgrade instructions for more information.
Elasticsearch needs 2GB of JVM heap size by default on a TeamForge site. You must have adequate RAM to accomodate the JVM heap requirements of Elasticsearch in addition to the JVM heap requirements of other components such as Jboss, integrated applications, and so on.
If required, you can increase the JVM heap size for Elasticsearch. Set the ELASTICSEARCH_JAVA_OPTS token with the Elasticsearch JVM heap size required for your site.
Elasticsearch stores every document, which for code search is each source file that is indexed, and it has its index itself as well. As a rule of thumb, the index for a repository would be roughly the same size as the working copy for that repository but in practice it will likely be smaller. Consider that all binary files and all files greater than 1 MB are not indexed at all. So, obviously any repository that has a large working copy due to these types of files will have a much smaller index. By default, Elasticsearch compresses the original files using LZ4 compression. It also offers a "best_compression" codec that compresses the originals using DEFLATE algorithm. In TeamForge, indexes have been updated to use the maximum compression.
The Elasticsearch data and logs are stored in /opt/collabnet/teamforge folder. Log rotation, startup and so on are all managed the same way as other TeamForge services.
As Black Duck Code Sight (BDCS) is not supported:
- remove all BDCS tokens from your site-options.conf file while upgrading to TeamForge 17.1 (or later). TeamForge create runtime fails otherwise. See Site options change log for a list of obsolete site-options.conf tokens.
- there is no migration support for existing BDCS indexes. All the repositories should be indexed afresh post upgrade to TeamForge 17.1 (or later). The list of repositories to index, remains the same though. After upgrading to 17.1 (or later), whatever repositories were being indexed by BDCS are indexed by the new Code Search without any user intervention. However, if you had customized indexing for one or more repositories using the BDCS administration UI, such information will not have been migrated and would need to be done again, but that can now be done via the TeamForge UI.