The C++ approach

A warning for Python programmers

Don’t skip this section. Some of the concepts apply to you as well.

Here’s a function that takes two inputs, the x-momentum and the y-momentum, and returns one output, the transverse momentum.

Listing 27: A simple C++ function to be used with RDataFrame.
float pt_func( float xmom, float ymom ) {
    return sqrt( xmom*xmom + ymom*ymom );
}

Following the usual C++ standard, you’d define this function before you defined your main routine.1

Too simple?

Obviously, this function is so simple that you’re not likely to define it separately just to pass it to RDataFrame::Define() (though see the section on lambda expressions later on). The point is to start with something simple as a “skeleton” for you to see how to create more complex functions of your own.

In order to use this function pt_func on a dataframe, you could do:

Listing 28: How to apply pt_func to each entry in an n-tuple
auto pt_dataframe = dataframe.Define("pt",pt_func,{"px","py"});

Note how this differs from what we’ve done before:

Listing 29: Our earlier approach to defining a new column in our n-tuple.
auto pt_dataframe = dataframe.Define("pt","sqrt(px*px + py*py)");

In Listing 29, we supply the function in the form of a text string, to which ROOT applies its internal compiler to jit the string. In Listing 28, we let C++ compile the function from Listing 27 and pass that function’s C++ “programming layer” name to the Define() method.

However, that’s not enough for RDataFrame::Define() to use pt_func. It has to be told which n-tuple columns to supply as arguments to the function. That’s why we also have to provide a list {"px","py"} as a third argument to Define.2,3

This gives us a recipe:

  • Define a function that returns a value; e.g.,

    float some_function( float value1, float value2, ... ) {
       // Lines of code that use value1, value2, ...
       // to calculate a result.
       return result;
    }
    
  • Use that function in a Define, supplying the n-tuple columns to be passed to the function as a list of strings:

    auto new_dataframe =
        dataframe.Define("new-column",some_function,{"column1", "column2", ...});
    

If you’re writing a function that will be called by Filter, the recipe is almost the same, except that function has to return a boolean result (true, false). For example:

Listing 30: An example of a function that could be used as an argument to Filter
bool energy_cut( float energy ) {
    return energy < 145;
}

1

You could also define this function after your main routine, and just include a forward declaration before the main routine.

2

Could we have avoided the need to specify {"px","py"} to Define if we’d used those names in the definition of pt_func? For example,

float pt_func( float px, float py ) {
    return sqrt( px*px + py*py );
}

You’ve probably already guessed that the answer is no. Remember, names that are defined in the programming layer have no meaning to ROOT’s internal layer. Even if we choose to use the same name in the programming layer as in the internal layer, ROOT has no direct way of matching those names between layers.

3

If we omit the list of columns in Define, ROOT will assume that the user function takes every column in the n-tuple as an argument. For the extremely simple n-tuple tree1, you might be able to live with that; e.g.,

float pt_func( float c2, float eb, int ev,
               float xmom, float ymom, float zmom,
               float zvertex ) {
    return sqrt( xmom*xmom + ymom*ymom );
}

Then we could omit that third argument to Define:

auto pt_dataframe = dataframe.Define("pt",pt_func);

However:

  • The compiler will toss out a lot of warning messages about “unused variables”. This is accurate, since our function does not refer (for example) to zv in its body.

  • You can’t always control the order of the columns in an n-tuple. In particular, if you look at the n-tuple that you created using Snapshot, you may see that the method did not necessarily add the new columns to the end of the n-tuple.

  • The n-tuples in real experiments often have hundreds of columns. It’s impractical to list them all in the function definition. If you don’t, you may get “function not found” error messages when you compile your program; the number of arguments in your function (like the two in pt_func) won’t match the number of arguments assumed by the compiler (hundreds?).