Functions represent some of the most powerful aspects of the R language.
And they really represent the transition of the user
of R into the kind of programmer of R.
And the basic idea is that you can type the command
line and kind of explore some data, and run some code.
But eventually you'll probably get to the point where
you need to do something a little bit more complex.
A little bit more than, than can be expressed
in a single line or maybe in two lines.
And if you have to do this over and over again, then you're
usually going to want to encode this kind of functionality in a function.
I'm going to talk about functions in three parts here.
First I'll talk just about the basics of how
to write functions and how they are written, in R.
Then I'm going to talk a little bit about lexical
scoping and the scoping rules, in, for the R language.
And then last, I'm going to end with a little example.
So, functions in R are created using the function directive
and functions are stored as R objects just like anything else.
So you might have a vector of integers a list of
different things, a data frame, and then you have a function.
So, in particular, R objects, R functions are
R objects that are of the class function, okay?
So, the basic instruction here is that you assign
to some object, here I call it F, the,
the function directive, which will take some
arguments, and then inside the curly braces
there is R, there is R code, which does something that the function does.
So one nice thing about R is that functions
are con, considered what are called first class objects.
So you can treat a function just like you can treat pretty much any other R object.
So importantly, this means that you can
pass functions as arguments to other functions.
This is actually
ver, a very useful feature in statistics. And also functions can be nested.
So you can define a function inside of another function, and we'll
see what the implications of this are we talk about lexical scoping.
So the return value of a function is simply the
very last R expression in the function value to be evaluated.
so, there's no special expression for returning something for a function.
Although, there is a function called Return.
Which we'll talk about in a second.
So functions have what are called named arguments.
And the named arguments can potentially have default values.
So, a lot of these features are useful for when
you're designing functions that, that may be used by other people.
For example, you may have a function that had a lot
of different arguments so you can tweak a lot of different things.
But most of the time, you don't have to change all those different arguments.
You may only care about one or two.
So it's useful for some of the arguments to have default values.
So first of all, there's the formal arguments, which
are the arguments that are included in the function definition.
So if you go back to the previous slide the formal
arguments are the ones that are included inside this function definition here.
The formal's function actually will, takes a function as an input
and returns a list of all the formal arguments of a function.
So not every function call in R makes use of all the formal arguments.
So for example, if a, if a function has ten different arguments you may
not, you may not have to specify a value for all ten of those arguments.
So function arguments can be missing or they
may have default values that are used when they are not specified by the users.
So R function arguments can be matched positionally or by name.
So when, this is very, this is key when
you're writing a function and also when you're calling it.
So for example, take a look at the function sd, which calculates the standard
deviation of, of, of a set of numbers. So sd takes a input x, which is the name
of the argument and which is going to be a vector of data.
And there's a second argument called na.rm and this controls whether
the missing values in the data should be removed or not.
And the default value is for na.rm to be equal to false.
So by default if you have missing data in your, in the, in the set of
numbers for which you want to calculate the
standard deviation the missing values will not be included.
So, here I'm
simulating some data and I'm just simulating a hundred
normal random variables, and there's no missing data here.
So, if I just calculate sd on the vector
it'll give me an estimate of the standard deviation.
If I say X equals my data that's the same thing.
So here I've named the argument but I haven't but otherwise
the data are the same so it'll calculate the standard deviation.
In the first example I didn't
name the argument.
So it defaulted to passing mydata to be the first argument of the function.
So in the next example here, I'm going to name both arguments.
I'm going to say X equals mydata, and na.rm equals false.
That calculates the same thing as before.
Now when I name the arguments, I don't have to put them in any special order.
So for example, I could reverse the order of the argument here.
Say na.rm is equals false first, and then say x
equals mydata second, and that will produce exactly the same
results because I've named the arguments.
Now, what happens if I name one argument and don't name the other?
Well what happens is that the named argument is set, and
you can figure it as being removed from the argument list, and
then any other, any other things that are past will be matched
to the function arguments in the order in which they, they come.
So for example, SD after you remove the na.rm
argument only has one more argument left and so mydata
would be assigned to that argument.
So all these expressions return the same exact value.
So although it's generally, all these expressions are
equivalent, I don't say recommend all of them equally.
So for example, I don't necessarily recommend reversing the order of the
arguments just because you can even though if you name them, it's appropriate.
so, just, just because that can lead to some confusion.
So positional matching and matching by name can be mixed and this
is quite useful often for functions that have very long argument lists.
And so for example the lm function here which
fits linear models to data has this argument list here.
So the first is the formula, the second is
the data And then subset, the weights et cetera.
And you see that the first five arguments here don't have any default value.
So, the user has to specify them.
So the but then the method, the model, the X argument, they all have
default values so if you don't specify
them they will use those values by default.
And so the following two function calls are equivalent.
I could have specified the data first and then the formula and then the model.
And then, and then, and then the subset arguments
or I could specify the formula first, the data second,
the subset and then say model is equal to false.
Now the reason why the first one is okay is
because I, so I matched the data argument by name.
You can imagine that that's kind of taken out of the argument
list now, then Y till the X doesn't, isn't specified by name.
So it's given to the first argument that hasn't already been matched.
And I, in which case that's the formula.
Model equal to false, so that's been matched by name so
I can kind of get rid of that from the argument list.
And then 1 through 100 has to be assigned
to the argument that has not yet already been matched.
So in this case formula was already matched, data was already matched.
And so the next one is subset.
So 1 to 100 get's assigned to the subset argument.
So this is somewhat a confusing way to call lm,
and I don't recommend that you do it this way.
But, I, I wrote it this way just to demonstrate
how positional matching, and matching by name can work together.
A common usage for lm though is the second
version here. Which say lm Y til the X.
So there is a formula there.
And then the next one is mydata, which the
data set which you're going to grab the data from.
The subset argument and then, so the first three arguments,
you know, are commonly specified, every time you call lm.
But then, the rest you may or may not specify and so
you may, if you just want to specify one of the following arguments.
It's easier just to call it out by name.
so, most of the time, the named arguments are useful in the command line.
When you have a long argument list and you want to use the defaults for everything
except for one of the arguments, which may be in the middle or near the end
of the list, and you can't usually, you
know, you can't remember exactly which argument it
is, whether it's the fourth, or the sixth,
or the tenth argument on the argument list.
And so you just call it by name, and that way
you don't have to remember the order of the arguments on
the argument list.
Another example where this comes in handy is for plotting, because
mo, many of the plot functions have very long argument lists.
All of which have default values and you
may only want to tweak one specific argument.
And so it's useful not to have to remember, you know, what
the order of that argument is on the arg, on the argument list.
So function arguments can, can also be partially matched
which is used, mostly useful primarily for interactive work,
not so much for programming.
But when you call a function, if the argument has a very long name
you can match it partially so you can type part of the argument name
and as long as there's a unique match there then it will, the R
system will match the argument and assign the value to, to, to the correct one.
So the, the, the order of the operations that
R uses, first it'll check for an exact match.
So if you name an argument
it'll check, check to see if there's
an argument that, that exactly matches that name.
If there's no exact match it'll look for a partial match.
And then if that doesn't work, it'll look for a positional match.
Coding standards in R are really important becasue they help you, make your code
readable and allow you and other people to understand what's going on in your code.
Now, of course, just like it is with any
other, style whether it comes, when you, you know, whether
it's your clothing or whatever it is, it's difficult
to get everyone to agree on one set of ideas.
But I think there are a couple of very basic, kind
of minimal standards that are important when you're coding in R.
Alright, so I'm just going to talk a little bit about some of
the coding standards, that I think are important to, when you're writing
R code, and I think will help make your code more readable
and more usable by others if that's what you're trying to, to achieve.
So, the first principle that I think is very
important in pretty much any programming language, not just
R, is that you should always write your code
using a text editor and save as a text file.
Okay, so, a text
file is a kind of basic standard.
It usually doesn't have any sort of formatting or any
sort of, kind of special, appearance, it's just text, right?
And usually, typically, typically it's going to be
ASCII text, but if you're, on, in places
outside the US or the UK using non-English
languages there may be other standard text formats.
But the basic idea is that a text format, can be read by pretty much any
basic editing program.
These days, you know, when you're writing something there's a
lot different of tools that you can use to write.
If you're writing a book, or or a webpage or something like that, there's
all kinds of different tools that you can use to write, to write those things.
But you're, when you're writing code, you should always try to
use a text editor, because that's like kind of like the, the
kind of least common denominator, and it makes it so that
everyone will be able to access your code and improve upon it.
The second principle is, which is very
important for readability, is to indent your code.
So indenting is something that's often hotly debated in lots of mailing lists
and other types of discussion groups in
terms of how much indenting is appropriate.
Now I'm not going to talk about that although I do have some recommendations.
But I think the most important thing
is that you understand why indenting is important.
So indenting is the idea that different blocks of code
should be spaced over to the right a little bit more
than other blocks of code so you can see kind of how the
control flow how the flow of the program goes based on the indenting alone.
So coupled with indenting, is the third principle which I think
is very simple which is, limit the width of your code.
So you have indenting it's possible to kind of
indent off to the right forever so you need
to limit on the right hand side how wide
your code is going to be and usually this is
kind of determined by the number of columns of text.
And so one possibility is you limit your text to about 80 columns of
text and then and so that your, the width of your code never exceeds that.
So, let's take a look for, at a quick example here.
So here you can see I've got R Studio open, here
with a simple code file with some R code in it.
And, first of all, let me just mention that
the editor in R Studio is a text editor.
will always save the R files that you write as text format files.
So, so we've already got that kind of handled.
But you can see the indenting scheme here is equal to one space.
So every indent is one space.
And you can see that all the code is
kind of mashed together here on the left hand side.
It's difficult to tell kind of where the if blocks are.
Where the else blocks are.
Where does the function kind of end and begin?
And so the indenting scheme kind of makes the code not
very readable in this case.
So we can change the indenting in R Studio.
If we just go up to the Preferences menu here.
And go to Code Editing.
And let me just change it to four.
And you can see that the column, the margin column is set to
80 characters, so it will show you the margin when you've reached 80 characters.
And so I'm going to select all here with Cmd+A, and then Cmd+I to indent it.
So now you can see that the
indenting is a little bit nicer now.
You can see, kind of, where the function begins and ends, you can see where the
if blocks start and end, and the, kind
of, structure of the program is much more obvious.
So, I'm going to change this one more time though and my, because my personal
preference for indenting is to use eight spaces,
so I'm going to change this to eight.
Hit OK, and select all. Cmd+I.
And now you can see,
I prefer the eight spaces just because it
really makes the structure of the code very obvious.
And the spacing is nice and clear.
And it makes the code very readable in general.
So you can see that indenting is very important.
And the biggest problem you might have is, with the, with, with too little indenting.
If you don't indent at all or if you only use
a very small amount the code becomes kind of very mashed together.
So I recommend at least four
spaces for an indent and I'm pref, I
prefer, you know, eight spaces for an indent, just
because it makes the code much more readable
and spaces it out much nice, much more nicely.
One of the advantages of having something like an
eight space indent, is coupled with an 80 character margin
on the right hand side, is that it forces you
to think about your code in a slightly different way.
So for example, if you have eight space
indents, if you're going to have a for-loop, nested within
another for-loop within another for-loop, every time you nest another
for-loop, for example, you have to indent over eight spaces.
And by the time you get to maybe your fourth nested for-loop you're
pretty much hitting the right hand column at the 80 column margin, right?
And so the nice thing about the eight space
indent, coupled with the 80 column margin, is that it
prevents you from kind of writing very basic, making very
kind of fundamental, kind of mistakes with, with code readability.
So, for example, with an eight space indent and 80 column
margin, you might not be able to do feasibly more than
two nested for loops, and, but I think that's really the,
kind of, the boundary of what is readable in terms of code.
Typically except for some special cases, a three, you
know, a three nested or four nested four loop is
difficult to read, and it's probably better off, you
know, splitting off into separate functions or something like that.
So a good indenting policy not only
makes the code more readable, but it actually can force you
to think about writing your code in a slightly different way.
And so that's a really nice advantage of, of having a logical
indenting policy with, coupled with a, you know, a right-hand side restriction.
So the last thing I want to talk about is to limit the length of your functions.
Alright so, functions in R can, can theoretically go on for quite
a long time and of course just like in any other language but
just like in any other language I think that the, the logical thing
to do with a function is limit it to kind of one basic activity.
So for example, if you're function's named read the data.
Then your function should simply read the data, it should not read
the data, process it, fit a model, and then print some output, alright?
So you should, the logical kind of steps like
that, should, should probably be spit, split, into separate functions.
There are a couple of advantages to doing this.
First of all, it's nice to be able to
have a function written on a single page of code,
so you don't have to scroll endlessly to see,
you know, where all the code for this function goes.
If you could put all the function, the entire function on like one screen of the
editor, then you can look at the whole function and see what it does all at once.
Another advantage of splitting up your code into logical sections,
to logical functions, is that if you use functions like traceback,
or the profiler, or the debugger, these often tell you, you know,
where in the function call stack you are when a problem occurs.
And if you have multiple functions that are all logically divided
in to separate pieces then when a bug occurs and you know
that it occurs in a certain type of function or a certain
function then you know kind of where to go fix things, right?
So if you have, but if just have a single function that just goes
on forever and a bug occurs then the only thing that the debugger or
the traceback or the profiler can tell you
is that there's a problem in this one function.
But it, it doesn't, it, it's difficult to tell you where exactly the problem occurs.
So splitting up your functions has a secondary benefit, which
is that it can help you in debugging and profiling.
So limiting the size of your functions is
very useful for readability and for, kind of, debugging.
Of course, it's easy to go overboard and
having, you know, a hundred different three-line functions.
So that's not really what
you want to do.
So you just want to make it so that the, the separation of different functions
into, is logical, and that each function
kind of does, does one thing in particular.
So those are my basic guidelines for writing code in R.
There are, of course, many other things that you might be able to think about.
But then we start bordering into areas that
we might, we might kind of disagree on.
And so I'm not going to talk about too much more
in terms of coding standards, but the basic ideas are always
use a text editor, always indent your code, I'd say at least four spaces.
Limit on the right hand side how, how wide your code can be.
And and always limit the size of your functions, so that you
can, so that they're, kind of grouped into logical pieces of your program.
So with those four things, I think you'll,
your, your code will be much more readable.
It'll be readable to you, it'll be readable to others, and it'll make kind
of writing R code much more useful to everyone.