Integration with Black Duck Code
Sight (BDCS)
is
no longer supported in TeamForge 17.1 and later. TeamForge 17.1 (or later) is equipped with its
own code search function powered by
Elasticsearch.
You can do away with
BDCS
integration while upgrading to TeamForge 17.1 (or later) and set up TeamForge
Code Search, which is now one of the integral services of TeamForge. This section discusses the
features of TeamForge Code Search and what it takes to set up Code Search on your
site.
TeamForge Code Search powered by Elasticsearch
TeamForge's Code Search feature uses Elasticsearch
as its engine for the indexing and retrieval of documents.
Elasticsearch,
used in TeamForge, has no customizations other than what it provides via its engine and API.
For more
information about Elasticsearch, refer to its documentation.
TeamForge Code Search (Elasticsearch): How it works
TeamForge uses Elasticsearch engine 5.1.1. Elasticsearch provides
the engine for the indexing and retrieval of documents. TeamForge is equipped with its own
indexers that extract source files from the SCM repositories and send them to the
Elasticsearch engine for indexing. These indexers run as an hourly cron job. When the job
starts, it calls an internal TeamForge API that returns the list of repositories to be
indexed on the specified SCM integration server. The indexer then processes the
repositories, one at a time, starting with the repository that was most recently committed
to. For each repository, it determines if a full indexing is needed or just an incremental
indexing is sufficient to catch up since the last indexing operation. It then goes down the
list of files and processes them for indexing. Only files that need to be indexed are
extracted. For example, large binary files are not extracted as they would not be indexed
anyway. As the indexers crawl a repository, it would skip any file that is greater than 1MB.
To index a file it has to be UTF8 text. Files are scanned for file encoding and converted to
UTF8 if required. Binary files are skipped by the indexers as would be files that cannot be
converted to UTF8 format. All files that are skipped are logged along with the reason.
The engine in Elasticsearch is configured to convert the source to lower case and tokenize
it. A Camel case filter is used so that common programming elements are broken into
individual tokens based on camel case and other common separators used in function names
etc. A set of common programming terms (if, then, else, return, exit etc.) are not indexed
by default.
How the data is indexed does effect the search results as partial matches are not generally
returned. For example, consider a function name in the source code such as: TeamForgeHelper.
This would be indexed so that a search for any of these terms would find a match: Team,
Forge, Helper, TeamForgeHelper. Nothing else would match, including "TeamForge" or
"Tea".
Search results are returned to users with TeamForge RBAC applied to it at a repository
level. In other words, a user must have permission to view the entire repository to be able
to search it. This works with SVN Path-based permissions (PBP), but only if PBPs are being
used to restrict write access. If PBP's are used to deny the user view access to certain
paths in the repository, then TeamForge Code Search denies the user the ability to search
the repository.
Considerations while upgrading to TeamForge 17.1 (or later)
Consider the following points while upgrading from TeamForge
16.10 (or earlier) to TeamForge 17.1 (or later) versions.
- Code Search is now an integral part of
TeamForge,
which is installed by default during TeamForge installation.
- You can install
TeamForge
Code Search either on the TeamForge Application Server or on a
separate SCM integration
Server. It is
recommended to install TeamForge Code Search on a separate server if your site's
indexing needs are considerably high (large number of repositories, for
example).
- TeamForge Code Search works with GIT and SVN repositories. TeamForge Code Search has
no support for CVS repositories.
- Installation of
Elasticsearch
is determined by adding the "codesearch" identifier to the SERVICES token of either the
TeamForge Application Server or the SCM Integration Server (if Code Search runs on a
separate server). Refer to the site-options.conf configuration section
of TeamForge installation/upgrade instructions for more
information.
- Elasticsearch needs 2GB of JVM heap size by default on a TeamForge site. You must have
adequate RAM to accomodate the JVM heap requirements of Elasticsearch in addition to the
JVM heap requirements of other components such as Jboss, integrated applications, and so
on.
- If required, you can increase the JVM heap size for Elasticsearch. Set the ELASTICSEARCH_JAVA_OPTS token with the Elasticsearch JVM
heap size required for your site.
- Elasticsearch stores every document, which for code search is each source file that is
indexed, and it has its index itself as well. As a rule of thumb, the index for a
repository would be roughly the same size as the working copy for that repository but in
practice it will likely be smaller. Consider that all binary files and all files greater
than 1 MB are not indexed at all. So, obviously any repository that has a large working
copy due to these types of files will have a much smaller index. By default,
Elasticsearch compresses the original files using LZ4 compression. It also offers a
"best_compression" codec that compresses the originals using DEFLATE algorithm. In
TeamForge, indexes have been updated to use the maximum compression.
- The
Elasticsearch
data
and
logs are stored in /opt/collabnet/teamforge folder. Log rotation,
startup and so on are all managed the same way as other TeamForge services.
- As Black Duck Code Sight (BDCS) is not
supported:
- remove all BDCS tokens from your site-options.conf file while
upgrading to TeamForge 17.1 (or later). TeamForge create runtime fails otherwise.
See Site options change log for a list of obsolete
site-options.conf tokens.
- there is no migration support for existing BDCS indexes. All the repositories
should be indexed afresh post upgrade to TeamForge 17.1 (or later). The list of
repositories to index, remains the same though. After upgrading to 17.1 (or later),
whatever repositories were being indexed by BDCS are indexed by the new Code Search
without any user intervention. However, if you had customized indexing for one or
more repositories using the BDCS administration UI, such information will not have
been migrated and would need to be done again, but that can now be done via the
TeamForge UI.