A comparative study of grayscale conversion techniques applied to SIFT descriptors

In computer vision, gradient-based tracking is usually performed from monochromatic inputs. However, a few research studies consider the influence of the chosen color-tograyscale conversion technique. This paper evaluates the impact of these conversion algorithms on tracking and homography calculation results, both being fundamental steps of augmented reality applications. Eighteen color-to-greyscale algorithms were investigated. These observations allowed the authors to conclude that the methods can cause significant discrepancies in the overall performance. As a related finding, experiments also showed that pure color channels (R, G, B) yielded more stability and precision when compared to other approaches.


I. INTRODUCTION
Tracking algorithms based on descriptors often use images in grayscale to detect and extract features [11][13] [2].One of the reasons for using grayscale images instead of full color images is to simplify the three-dimensional color space (R, G and B) to a single dimensional representation, i.e., monochromatic.This fact is important because it reduces the computational cost and simplifies the algorithms.However, according to [10], the decision as to which color-to-grayscale mechanism should be used is still little explored.Studies tend to believe that, due to the robustness of the descriptors, the chosen grayscale conversion technique has little influence on the final result.
Several different methods are used in computer vision to perform color-to-grayscale conversion.The most common are techniques based on weighted averages of the red, green and blue channels, e.g., Intensity and Luminance [7].Moreover, there are methods that adopt alternative strategies to generate a more accurate representation with respect to brightness, such as Luma and Lightness [9], although at first none of these techniques has been developed specifically for tracking and pattern recognition.
In [10], a case study is presented to demonstrate that there are significant changes in tracking results for different colorto grayscale conversion algorithms .The influences of these methods were also presented using SIFT and SURF descriptors ( [12] and [2], respectively), Local Binary Patterns (LBP) [15] and Geometric Blur [3].Nevertheless, the mentioned studies did not examine the red, green and blue pure channels as options for color-to-grayscale conversion.
Since color-to-grayscale conversion results are a function of the three pure channels (R, G and B), converted images lose part of the information they contained because a pixel in the input image, formerly represented by three dimensions in the color space, is then represented by a single one.Furthermore, there is no guarantee that the various color-to-grayscale conversion techniques converge to the same grayscale intensity value, i.e. the input image changes depending on the grayscale conversion method.Having said that, this paper aims to replicate the experiments on [10] and perform further experiments using the R, G and B pure channels as grayscale conversion techniques themselves.

II. METHODS
This section describes basic concepts involved in the experiments and results presented in this article.It encompasses the explanation of color system concepts, brightness correction, color-to-grayscale conversion, SIFT and matching.

A. Color systems
Color-to-grayscale conversion algorithms all generate a monochromatic representation as output of the original color image.There are many color space representations, each one designed for a specific purpose and with its own coordinate system, for example, CYM [7], CYMK [7], RGBE [6], RGBW [5], HSL [1], HSV [1] and HSI [1].The most common way of representing a pixel in a color space is RGB [7].
HSL and HSV are other common representations of points in the color space.These color space representations are cylindrical in shape and aim to rearrange the geometry of the color space in an attempt to be more intuitive than the cube-shape representation employed in the RGB model.The representation of the color space in HSL, HSV and RGB versions can be seen in Figure 1 B. Gamma Correction Gamma correction is a nonlinear operation used to control signal amplitude.In digital image processing, this operation is used to control brightness and reduce gradient variation in video or still image systems.Human vision, under common illumination conditions (neither pitch black nor blindingly bright) follows an approximate gamma function, with greater sensitivity to relative differences between darker tones than lighter ones.The gamma correction follows the power law [16] according to the equation 1. SBC Journal on Interactive Systems, volume 6, number 2, 2015 ISSN: 2236-3297 where x is the pixel intensity in the image matrix representation, A is a scalar and γ is the correction parameter.The usual value for A and γ is 1 and 1/2.2 respectively [16].The figure 2 displays examples of different A and γ.We denote gamma corrected channels as R, G, and B. The gamma correction may also be applied to a function output -for example, the Luma algorithm corresponds to the Luminance algorithm using gamma corrected input (R', G' and B').

C. Color-to-Grayscale Conversion Algorithm
This section will briefly describe the color-to-grayscale methods studied during this research.The most popular ap-proach to the problem of conversion is the Luminance algorithm, which approximates the image gradient to the human vision perception [4].Every grayscale conversion technique is a function G that receives a colorful image as input R 3mn and outputs a monochromatic image R mn .Assuming all digital images used in this research are typically 8-bit per channel, the discrete pixel values respect a limit of L = 2 8 possible levels of intensity, in other words, all values of the color input and the grayscale output are located between 0 and 255.In this scale, 0 represents black and 255 represents white.It is also assumed that R, G and B stand for a linear representation of the red, green and blue channels respectively.The colorto-grayscale conversion produces a pixel matrix with values between 0 and 255.The table I shows the conversion algorithm used in [10].
In addition to replicating the experiments, this paper also extends the research adding six more grayscale conversion techniques: the pure channels (R, G and B) and the pure channels corrected by gamma function (R , G and B ).The idea is to evaluate tracking behavior using a grayscale that is an actual output of the camera and not a function of the values given by this camera.

D. Scale Invariant Feature Transform
The image feature generation used in Lowe's method transforms an image into a large collection of feature vectors, each of which describes a point, named keypoint, in the image.These vectors are named as descriptors and they are invariant to translation, scaling, and rotation, partially invariant to illumination changes and robust to local geometric distortion.Firstly, the keypoints locations are defined as maxima and minima according to Difference of Gaussians (DoG) function applied in scale space to a series of smoothed and resampled images [12].Low contrast candidate points and edge response points along an edge are discarded using interpolations of the samples and the Hessian matrix [12].Dominant orientations are assigned to localized keypoints.These steps ensure that the keypoints are more stable for matching and recognition.SIFT descriptors are then obtained by considering pixels around a radius of the key location.

E. Feature matching
The SIFT descriptors are stored and indexed, and then they are matched with the SIFT descriptors in the other image.The best candidate match for each keypoint is found by identifying its nearest neighbor in the database of keypoints from training images.The nearest neighbors are defined as the keypoints with minimum Euclidean distance from the given descriptor vector.But in some cases, the second closest-match may be very near to the first.It may happen due to noise or some other reasons.In that case, the ratio of closest-distance to secondclosest distance is taken.If it is greater than 0.8, the match is rejected.It eliminates around 90% of false matches while discarding only 5% correct matches [12].The non rejected points are named a good match.

III. EXPERIMENT
The experiment performed in this paper aimed to study the influence of grayscale conversion in descriptor-based tracking behavior and its results.This work focused on the influence of different grayscales on SIFT descriptors [12].
For each image chosen as template, four points were selected as shown in Figure 3.These points were chosen because they have no detectable feature characteristics, i.e. they could only be estimated using homography calculation.Each tracked template path was compared to a correspondent ground truth, which specifies the real path of these four selected points along the video frames.
The performance of different color-to-grayscale conversion techniques was analyzed based on their ability to maintain object tracking throughout the video.In other words, for a particular grayscale conversion algorithm, the number of frames which established sufficient correlation between the template and the actual scene was verified.This allows for homography estimation and the identification of pre-established points in the figure 3.These were the points compared to the ground truth.Algorithms were ranked regarding their tracking stability using the equation 2 as a score function.
where v is the number of tested videos, n i the number of frames in the video i where SIFT obtained sufficient features for homography calculation, and N i the total number of frames in the video i.The experiment is summarized in the algorithm 1.

Algorithm 1 Experiment Steps
Feature extraction of T grayscale Generate SIFT descriptor using T grayscale for All video frames do Pose estimation and corner points estimation end if Campare with the groundtruth end for

IV. RESULTS
The experiment was performed using five videos for each of the four templates, 20 samples in total.Each video had an average time of 10 seconds recorded at 30 fps.The videos were named with the template initials followed by a number.The templates used can be seen in Figure 4. To increase the experiment robustness, each video was filmed randomly according to the following categories: The set of all videos and specifications can be seen in table II.

A. Good matching
To calculate the homography it is necessary to have at least four correspondence points [8].The first part of the experiment is to verify whether all grayscales are capable of producing enough good matching to calculate the homography.The   III, IV, V and VI, all color-tograyscale algorithms are able to produce enough good matches in a majority of frames.But the results in the next section show that good matching does not mean good homography.

B. Homography
The first template studied was Figure a.The default color-tograyscale conversion algorithm in OpenCV [4] is Luminance.Considering OpenCV is widely used in computer vision applications, it was expected that this algorithm would have a satisfactory outcome.The results using the proposed score metric are related in table VII.
Note in Table VII that the pure red channel obtained the highest score (9.24) even in varying lighting, change of camera and/or motion blur.The pure green channel had similar results to the red in many cases with a score of 9.15.The algorithm    Luminance is a weighted average of the three channels and the blue channel achieved a score of 0, the Luminance may have been influenced by the poor performance of the blue channel.
Since the red channel had the best score and the blue one the worst, template 2 was chosen essentially because it is a red and blue image.Therefore, gradient difference is reduced, making it harder to extract image characteristics.Consequently, SIFT descriptors would perform badly.This template was used in order to exemplify a case where a single pure channel would not perform better than a function of all pure channels.The results using template 2 are in table VIII.
The first noteworthy information in table VIII is the red channel performance, whose score decreased from 9.24 to 8.02, a drastic but expected result, as the template contained a lot of red.Hence, no considerable gradient differences existed.The green channel had the best performance in this test with a score of 9.39.This result might seem counter-intuitive at first, for the template used (figure 4b) was manipulated to have no green intensity; however, it is important to note that all colors captured by CCD sensor models [14] (used in the majority of current digital cameras) are composed of R, G and B channels.That means that all pixels are composed by a combination of those three colors.Other relevant results are the blue channel performance (score 8.13) and the Luminance performance (score 5.50).The overall outcome suggests that, even when pure channels present good results, a function of these three channels (such as Luminance) will not necessarily present good results as well.
Template 3 was chosen in order to test the pure green channel, and thus, this template will naturally have little gradient difference in the green channel.For this test, it is expected that the pure green channel tracking performance would decrease as happened with the red channel in the previous test using template 2  As shown in table IX, the red channel achieved the best performance in the group, its score being 9.97, similar to the test with template 1.As expected, the green channel had a similar performance to that of the red channel in the tests, its score decreasing from 9.39 to 8.10.Again, the traditional Luminance approach still reached a lower score than all pure channel approaches.
Based on the tests using templates 2 and 3, it was possible to notice that the predominance of a single color in a template may be prejudicial to SIFT descriptors and to featurebased tracking.Furthermore, multiple channel mix functions achieved a visibly lower performance when compared to pure channel approaches.
As modern cameras usually adopt the CCD system, at least one of the three primary channels should identify gradient variations that allow for feature extraction and tracking.To test this, an evaluation was conducted using a template that essentially presented only one channel.
Template 4 was produced by editing template 3 -it kept virtually none of its original blue and red intensities and suffered a green intensity boost.After this modification, expected results were to undermine the performance of the green channel and other green dependent approaches (Luminance for example).Experiment results can be examined in Table X.
As expected, the results in Table X display the low score achieved by all approaches, including the pure channels, as template 4 had almost no gradient variation.Green channel scores plummeted from 8.10 to 1.34, a predictable result given the previous template manipulation, and the red channel had the best result with a 5.78 score.Luminance scored 0.45, still lower than all pure channel approaches, and was surpassed by Intensity (score 3.49) because the green channel's influence on this method is lower.
The next step of the research was to evaluate the precision of point estimation when compared to the ground truth.Figure 5 shows a path estimated using SIFT descriptors compared to the ground truth.This analysis used the mean squared error (MSE) technique for each template in each video as a metric to evaluate tracking precision.Results are displayed in Table XI.cases.Among these five algorithms, the pure channels red and green perform better than the others because they are those with the lowest mean squared error (MSE) and only obtained inferior results in the synthetic case (experiment with template 4).

V. CONCLUSION
The initial results show a significant variation on SIFT output and performance according to the grayscale method used to process the input frame.After comparing results, one can point out that pure channels (R, G and B) are better than other approaches, generating numerically consistent outputs, which proved to be very effective for tracking using SIFT descriptors.
The computational cost to perform this type of tracking negligible compared to all the algorithms implemented, considering the absence of any test, operation or adjustment beyond the direct assignment of the value of the channels.Among primary channels, the top performers were the red and green channel.The blue channel had an unsatisfactory performance when compared to the red and green channels.

TABLE III TEMPLATE 1 :
% OF FRAMES WITH ENOUGH GOOD MATCHING . Template 3 results are shown in table IX.
As seen inTable XI, only the red, green, Luminance, Intensity and Luster are able to calculate homography in all SBC Journal on Interactive Systems, volume 6, number 2, 2015 35 ISSN: 2236-3297