Starting from this chapter we will explore the ruby source code, starting by
studying the declaration of objects structures.
What are the required conditions to make sure objects can exist? Many explanations can be given but in reality there are three conditions that must be obeyed:
In this chapter, we are going to confirm these three features one by one.
The most interesting file in this quest will be ruby.h, but we will also
briefly look at other files such as object.c, class.c or variable.c.
VALUE and objectsIn ruby, the contents of an object is expressed by a C structure, always
handled via a pointer. A different kind of structure is used for each class,
but the pointer type will always be VALUE (figure 1).

Figure 1: VALUE and structure
Here is the definition of VALUE:
VALUE
71 typedef unsigned long VALUE; (ruby.h)
In practice, a VALUE must be casted to different types of structure pointer.
Therefore if an unsigned long and a pointer have a different size, ruby
will not work well. Strictly speaking, it will not work for pointer types
bigger than sizeof(unsigned long). Fortunately, no recent machine feature
this capability, even if some time ago there were quite a few of them.
Several structures are available according to object classes:
struct RObject |
all things for which none of the following applies |
struct RClass |
class object |
struct RFloat |
small numbers |
struct RString |
string |
struct RArray |
array |
struct RRegexp |
regular expression |
struct RHash |
hash table |
struct RFile |
IO, File, Socket, etc… |
struct RData |
all the classes defined at C level, except the ones mentioned above |
struct RStruct |
Ruby’s Struct class |
struct RBignum |
big integers |
For example, for an string object, struct RString is used, so we will have
something like the following.

Figure 2: String object
Let’s look at the definition of a few object structures.
▼ Examples of object structure
/* structure for ordinary objects */
295 struct RObject {
296 struct RBasic basic;
297 struct st_table *iv_tbl;
298 };
/* structure for strings (instance of String) */
314 struct RString {
315 struct RBasic basic;
316 long len;
317 char *ptr;
318 union {
319 long capa;
320 VALUE shared;
321 } aux;
322 };
/* structure for arrays (instance of Array) */
324 struct RArray {
325 struct RBasic basic;
326 long len;
327 union {
328 long capa;
329 VALUE shared;
330 } aux;
331 VALUE *ptr;
332 };
(ruby.h)
Before looking at every one of them in detail, let’s begin with something more general.
First, as VALUE is defined as unsigned long, it must be casted before
being used. That’s why Rxxxx() macros have been made for each object
structure. For example, for struct RString there is RSTRING(), for struct
RArray there is RARRAY(), etc… These macros are used like this:
VALUE str = ....; VALUE arr = ....; RSTRING(str)->len; /* ((struct RString*)str)->len */ RARRAY(arr)->len; /* ((struct RArray*)arr)->len */
Another important point to mention is that all object structures start with a
member basic of type struct RBasic. As a result, whatever the type of
structure pointed by VALUE, if you cast this VALUE to struct RBasic*,
you will be able to access the content of basic.

Figure 3: struct RBasic
You guessed that struct RBasic has been designed to contain some important
information shared by all object structures. The definition of struct RBasic
is the following:
struct RBasic
290 struct RBasic {
291 unsigned long flags;
292 VALUE klass;
293 };
(ruby.h)
flags are multipurpose flags, mostly used to register the structure type
(for instance struct RObject). The type flags are named T_xxxx, and can be
obtained from a VALUE using the macro TYPE(). Here is an example:
VALUE str; str = rb_str_new(); /* creates a Ruby string (its structure is RString) */ TYPE(str); /* the return value is T_STRING */
The names of these T_xxxx flags are directly linked to the corresponding type
name, like T_STRING for struct RString and T_ARRAY for struct RArray.
The other member of struct RBasic, klass, contains the class this object
belongs to. As the klass member is of type VALUE, what is stored is (a
pointer to) a Ruby object. In short, it is a class object.

Figure 4: object and class
The relation between an object and its class will be detailed in the “Methods” section of this chapter.
By the way, the name of this member is not class to make sure it does not
raise any conflict when the file is processed by a C++ compiler, as it is a
reserved word.
I said that the type of structure is stored in the flags member of struct
Basic. But why do we have to store the type of structure? It’s to be able to
handle all different types of structure via VALUE. If you cast a pointer to
a structure to VALUE, as the type information does not remain, the compiler
won’t be able to help. Therefore we have to manage the type ourselves. That’s
the consequence of being able to handle all the structure types in a unified
way.
OK, but the used structure is defined by the class so why are the structure type and class are stored separately? Being able to find the structure type from the class should be enough. There are two reasons for not doing this.
The first one is (I’m sorry for contradicting what I said before), in fact
there are structures that do not have a struct RBasic (i.e. they have no
klass member). For example struct RNode that will appear in the second
part of the book. However, flags is guaranteed to be in the beginning
members even in special structures like this. So if you put the type of
structure in flags, all the object structures can be differentiated in one
unified way.
The second reason is that there is no one-to-one correspondence between class
and structure. For example, all the instances of classes defined at the Ruby
level use struct RObject, so finding a structure from a class would require
to keep the correspondence between each class and structure. That’s why it’s
easier and faster to put the information about the type in the structure.
basic.flagsAs limiting myself to saying that basic.flags is used for different things
including the type of structure makes me feel bad, here’s a general
illustration for it (figure 5). There is no need to understand everything
right away, I just wanted to show its uses while it was bothering me.

Figure 5: Use of flags
When looking at the diagram, it looks like that 21 bits are not used on 32 bit
machines. On these additional bits, the flags FL_USER0 to FL_USER8 are
defined, and are used for a different purpose for each structure. In the
diagram I also put FL_USER0 (FL_SINGLETON) as an example.
VALUEAs I said, VALUE is an unsigned long. As VALUE is a pointer, it may look
like void* would also be all right, but there is a reason for not doing
this. In fact, VALUE can also not be a pointer. The 6 cases for which
VALUE is not a pointer are the following:
truefalsenilQundefI’ll explain them one by one.
As in Ruby all data are objects, integers are also objects. However, as there are lots of different instances of integers, expressing them as structures would risk slowing down execution. For example, when incrementing from 0 to 50000, just for this creating 50000 objects would make us hesitate.
That’s why in ruby, to some extent, small integers are treated specially and
embedded directly into VALUE. “small” means signed integers that can be held
in sizeof(VALUE)*8-1 bits. In other words, on 32 bits machines, the integers
have 1 bit for the sign, and 30 bits for the integer part. Integers in this
range will belong to the Fixnum class and the other integers will belong to
the Bignum class.
Then, let’s see in practice the INT2FIX() macro that converts from a C int
to a Fixnum, and confirm that Fixnum are directly embedded in VALUE.
INT2FIX
123 #define INT2FIX(i) ((VALUE)(((long)(i))<<1 | FIXNUM_FLAG)) 122 #define FIXNUM_FLAG 0x01 (ruby.h)
In brief, shift 1 bit to the right, and bitwise or it with 1.
0110100001000 |
before conversion |
1101000010001 |
after conversion |
That means that Fixnum as VALUE will always be an odd number. On the other
hand, as Ruby object structures are allocated with malloc(), they are
generally arranged on addresses multiple of 4. So they do not overlap with the
values of Fixnum as VALUE.
Also, to convert int or long to VALUE, we can use macros like
INT2NUM() or LONG2NUM(). Any conversion macro XXXX2XXXX with a name
containing NUM can manage both Fixnum and Bignum. For example if
INT2NUM() can’t convert an integer into a Fixnum, it will automatically
convert it to Bignum. NUM2INT() will convert both Fixnum and Bignum to
int. If the number can’t fit in an int, an exception will be raised, so
there is not need to check the value range.
What are symbols?
As this question is quite troublesome to answer, let’s start with the reasons
why symbols were necessary. First, let’s start with the ID type used inside
ruby. It’s like this:
ID
72 typedef unsigned long ID; (ruby.h)
This ID is a number having a one-to-one association with a string. However,
in this world it’s not possible to have an association between all strings and
a numerical value. That’s why they are limited to the one to one relationships
inside one ruby process. I’ll speak of the method to find an ID in the
next chapter “Names and name tables”.
In language implementations, there are a lot of names to handle. Method names
or variable names, constant names, file names in class names… It’s
troublesome to handle all of them as strings (char*), because of memory
management and memory management and memory management… Also, lots of
comparisons would certainly be necessary, but comparing strings character by
character will slow down the execution. That’s why strings are not handled
directly, something will be associated and used instead. And generally
“something” will be integers, as they are the simplest to handle.
These ID are found as symbols in the Ruby world. Up to ruby 1.4, the
values of ID where converted to Fixnum, but used as symbols. Even today
these values can be obtained using Symbol#to_i. However, as real use results
came piling up, it was understood that making Fixnum and Symbol the same
was not a good idea, so since 1.6 an independent class Symbol has been
created.
Symbol objects are used a lot, especially as keys for hash tables. That’s
why Symbol, like Fixnum, was made stored in VALUE. Let’s look at the
ID2SYM() macro converting ID to Symbol object.
ID2SYM
158 #define SYMBOL_FLAG 0x0e 160 #define ID2SYM(x) ((VALUE)(((long)(x))<<8|SYMBOL_FLAG)) (ruby.h)
When shifting 8 bits left, x becomes a multiple of 256, that means a
multiple of 4. Then after with a bitwise or (in this case it’s the same as
adding) with 0x0e (14 in decimal), the VALUE expressing the symbol is not
a multiple of 4. Or even an odd number. So it does not overlap the range of
any other VALUE. Quite a clever trick.
Finally, let’s see the reverse conversion of ID2SYM(), SYM2ID().
SYM2ID()
161 #define SYM2ID(x) RSHIFT((long)x,8) (ruby.h)
RSHIFT is a bit shift to the right. As right shift may keep or not the sign
depending of the platform, it became a macro.
true false nilThese three are Ruby special objects. true and false represent the boolean
values. nil is an object used to denote that there is no object. Their
values at the C level are defined like this:
true false nil
164 #define Qfalse 0 /* Ruby's false */ 165 #define Qtrue 2 /* Ruby's true */ 166 #define Qnil 4 /* Ruby's nil */ (ruby.h)This time it’s even numbers, but as 0 or 2 can’t be used by pointers, they can’t overlap with other
VALUE. It’s because usually the first bloc of
virtual memory is not allocated, to make the programs dereferencing a NULL
pointer crash.
And as Qfalse is 0, it can also be used as false at C level. In practice, in
ruby, when a function returns a boolean value, it’s often made to return an
int or VALUE, and returns Qtrue/Qfalse.
For Qnil, there is a macro dedicated to check if a VALUE is Qnil or not,
NIL_P().
NIL_P()
170 #define NIL_P(v) ((VALUE)(v) == Qnil) (ruby.h)
The name ending with p is a notation coming from Lisp denoting that it is a
function returning a boolean value. In other words, NIL_P means “is the
argument nil?”. It seems the “p” character comes from “predicate”. This
naming rule is used at many different places in ruby.
Also, in Ruby, false and nil are false and all the other objects are true.
However, in C, nil (Qnil) is true. That’s why in C a Ruby-style macro,
RTEST(), has been created.
RTEST()
169 #define RTEST(v) (((VALUE)(v) & ~Qnil) != 0) (ruby.h)
As in Qnil only the third lower bit is 1, in ~Qnil only the third lower
bit is 0. Then only Qfalse and Qnil become 0 with a bitwise and.
!=0 has be added to be certain to only have 0 or 1, to satisfy the
requirements of the glib library that only wants 0 or 1
([ruby-dev:11049]).
By the way, what is the ‘Q’ of Qnil? ‘R’ I would have understood but why
‘Q’? When I asked, the answer was “Because it’s like that in Emacs”. I did
not have the fun answer I was expecting…
QundefQundef
167 #define Qundef 6 /* undefined value for placeholder */ (ruby.h)
This value is used to express an undefined value in the interpreter. It can’t be found at all at the Ruby level.
I already brought up the three important points of a Ruby object, that is having an identity, being able to call a method, and keeping data for each instance. In this section, I’ll explain in a simple way the structure linking objects and methods.
struct RClassIn Ruby, classes exist as objects during the execution. Of course. So there
must be a structure for class objects. That structure is struct RClass. Its
structure type flag is T_CLASS.
As class and modules are very similar, there is no need to differentiate their
content. That’s why modules also use the struct RClass structure, and are
differentiated by the T_MODULE structure flag.
struct RClass
300 struct RClass {
301 struct RBasic basic;
302 struct st_table *iv_tbl;
303 struct st_table *m_tbl;
304 VALUE super;
305 };
(ruby.h)
First, let’s focus on the m_tbl (Method TaBLe) member. struct st_table is
an hashtable used everywhere in ruby. Its details will be explained in the
next chapter “Names and name tables”, but basically, it is a table mapping
names to objects. In the case of m_tbl, it keeps the
correspondence between the name (ID) of the methods possessed by this class
and the methods entity itself.
The fourth member super keeps, like its name suggests, the superclass. As it’s a
VALUE, it’s (a pointer to) the class object of the superclass. In Ruby there
is only one class that has no superclass (the root class): Object.
However I already said that all Object methods are defined in the Kernel
module, Object just includes it. As modules are functionally similar to
multiple inheritance, it may seem having just super is problematic, but but
in ruby some clever changes are made to make it look like single
inheritance. The details of this process will be explained in the fourth
chapter “Classes and modules”.
Because of this, super of the structure of Object points to struct
RClass of the Kernel object. Only the super of Kernel is NULL. So
contrary to what I said, if super is NULL, this RClass is the Kernel
object (figure 6).

Figure 6: Class tree at the C level
With classes structured like this, you can easily imagine the method call
process. The m_tbl of the object’s class is searched, and if the method was
not found, the m_tbl of super is searched, and so on. If there is no more
super, that is to say the method was not found even in Object, then it
must not be defined.
The sequential search process in m_tbl is done by search_method().
search_method()
256 static NODE*
257 search_method(klass, id, origin)
258 VALUE klass, *origin;
259 ID id;
260 {
261 NODE *body;
262
263 if (!klass) return 0;
264 while (!st_lookup(RCLASS(klass)->m_tbl, id, &body)) {
265 klass = RCLASS(klass)->super;
266 if (!klass) return 0;
267 }
268
269 if (origin) *origin = klass;
270 return body;
271 }
(eval.c)
This function searches the method named id in the class object klass.
RCLASS(value) is the macro doing:
((struct RClass*)(value))
st_lookup() is a function that searches in st_table the value
corresponding to a key. If the value is found, the function returns true and
puts the found value at the address given in third parameter (&body).
Nevertheless, doing this search each time whatever the circumstances would be
too slow. That’s why in reality, once called, a method is cached. So starting
from the second time it will be found without following super one by one.
This cache and its search will be seen in the 15th chapter “Methods”.
In this section, I will explain the implementation of the third essential condition, instance variables.
rb_ivar_set()Instance variables are what allows each object to store characteristic data.
Having it stored in the object itself (i.e. in the object structure) may seem
all right but how is it in practice? Let’s look at the function
rb_ivar_set() that puts an object in an instance variable.
rb_ivar_set()
/* write val in the id instance of obj */
984 VALUE
985 rb_ivar_set(obj, id, val)
986 VALUE obj;
987 ID id;
988 VALUE val;
989 {
990 if (!OBJ_TAINTED(obj) && rb_safe_level() >= 4)
991 rb_raise(rb_eSecurityError,
"Insecure: can't modify instance variable");
992 if (OBJ_FROZEN(obj)) rb_error_frozen("object");
993 switch (TYPE(obj)) {
994 case T_OBJECT:
995 case T_CLASS:
996 case T_MODULE:
997 if (!ROBJECT(obj)->iv_tbl)
ROBJECT(obj)->iv_tbl = st_init_numtable();
998 st_insert(ROBJECT(obj)->iv_tbl, id, val);
999 break;
1000 default:
1001 generic_ivar_set(obj, id, val);
1002 break;
1003 }
1004 return val;
1005 }
(variable.c)
rb_raise() and rb_error_frozen() are both error checks. Error checks are
necessary, but it’s not the main part of the treatment, so you should ignore
them at first read.
After removing error treatment, only the switch remains, but this
switch (TYPE(obj)) {
case T_aaaa:
case T_bbbb:
...
}
form is characteristic of ruby. TYPE() is the macro returning the type
flag of the object structure (T_OBJECT, T_STRING, etc.). In other words as
the type flag is an integer constant, we can branch depending on it with a
switch. Fixnum or Symbol do not have structures, but inside TYPE() a
special treatment is done to properly return T_FIXNUM and T_SYMBOL, so
there’s no need to worry.
Well, let’s go back to rb_ivar_set(). It seems only the treatments of
T_OBJECT, T_CLASS and T_MODULE are different. These 3 have been chosen on
the basis that their second member is iv_tbl. Let’s confirm it in practice.
iv_tbl
/* TYPE(val) == T_OBJECT */
295 struct RObject {
296 struct RBasic basic;
297 struct st_table *iv_tbl;
298 };
/* TYPE(val) == T_CLASS or T_MODULE */
300 struct RClass {
301 struct RBasic basic;
302 struct st_table *iv_tbl;
303 struct st_table *m_tbl;
304 VALUE super;
305 };
(ruby.h)
iv_tbl is the Instance Variable TaBLe. It stores instance variable names and
their corresponding value.
In rb_ivar_set(), let’s look again the code for the structures having
iv_tbl.
if (!ROBJECT(obj)->iv_tbl)
ROBJECT(obj)->iv_tbl = st_init_numtable();
st_insert(ROBJECT(obj)->iv_tbl, id, val);
break;
ROBJECT() is a macro that casts a VALUE into a struct
RObject*. It’s possible that obj points to a struct RClass, but as
we’re only going to access the second member no problem will occur.
st_init_numtable() is a function creating a new st_table. st_insert() is
a function doing associations in a st_table.
In conclusion, this code does the following: if iv_tbl does not exist, it
creates it, then stores the [variable name → object] association.
Warning: as struct RClass is a class object, this instance variable table is
for the use of the class object itself. In Ruby programs, it corresponds to
something like the following:
class C @ivar = "content" end
generic_ivar_set()For objects for which the structure used is not T_OBJECT, T_MODULE, or
T_CLASS, what happens when modifying an instance variable?
rb_ivar_set() in the case there is no iv_tbl
1000 default: 1001 generic_ivar_set(obj, id, val); 1002 break; (variable.c)
The control is transferred to generic_ivar_set(). Before looking at this
function, let’s first explain its general idea.
Structures that are not T_OBJECT, T_MODULE or T_CLASS do not have an
iv_tbl member (the reason why they do not have it will be explained later).
However, a method linking an instance to a struct st_table would allow
instances to have instance variables. In ruby, this was solved by using a
global st_table, generic_iv_table (figure 7) for these associations.

Figure 7: generic_iv_table
Let’s see this in practice.
▼generic_ivar_set()
801 static st_table *generic_iv_tbl;
830 static void
831 generic_ivar_set(obj, id, val)
832 VALUE obj;
833 ID id;
834 VALUE val;
835 {
836 st_table *tbl;
837
/* for the time being you should ignore this */
838 if (rb_special_const_p(obj)) {
839 special_generic_ivar = 1;
840 }
/* initialize generic_iv_tbl if it does not exist */
841 if (!generic_iv_tbl) {
842 generic_iv_tbl = st_init_numtable();
843 }
844
/* the treatment itself */
845 if (!st_lookup(generic_iv_tbl, obj, &tbl)) {
846 FL_SET(obj, FL_EXIVAR);
847 tbl = st_init_numtable();
848 st_add_direct(generic_iv_tbl, obj, tbl);
849 st_add_direct(tbl, id, val);
850 return;
851 }
852 st_insert(tbl, id, val);
853 }
(variable.c)
rb_special_const_p() is true when its parameter is not a pointer. However,
as this if part requires knowledge of the garbage collector, we’ll skip it
for now. I’d like you to check it again after reading the chapter 5 “Garbage
collection”.
st_init_numtable() already appeared some time ago. It creates a new hash
table.
st_lookup() searches a value corresponding to a key. In this case it
searches for what’s attached to obj. If an attached value can be found, the
whole function returns true and stores the value at the address (&tbl) given
as third parameter. In short, !st_lookup(...) can be read “if a value can’t
be found”.
st_insert() was also already explained. It stores a new association in a
table.
st_add_direct() is similar to st_insert(), but the part before adding the
association that checks if the key was already stored or not is different. In
other words, in the case of st_add_direct(), if a key already registered is
being used, two associations linked to this same key will be stored.
st_add_direct() can be used when the check for existence has already been
done, as is the case here, or when a new table has just been created.
FL_SET(obj, FL_EXIVAR) is the macro that sets the FL_EXIVAR flag in the
basic.flags of obj. The basic.flags flags are all named FL_xxxx and
can be set using FL_SET(). These flags can be unset with FL_UNSET(). The
EXIVAR from FL_EXIVAR seems to be the abbreviation of EXternal Instance
VARiable.
The setting of these flags is done to speed up the reading of instance
variables. If FL_EXIVAR is not set, even without searching in
generic_iv_tbl, we directly know if the object has instance variables. And
of course a bit check is way faster than searching a struct st_table.
Now you should understand how the instance variables are stored, but why are
there structures without iv_tbl? Why is there no iv_tbl in struct
RString or struct RArray? Couldn’t iv_tbl be part of RBasic?
Well, this could have been done, but there are good reasons why it was not. As
a matter of fact, this problem is deeply linked to the way ruby manages
objects.
In ruby, memory used by for example string data (char[]) is directly
allocated using malloc(). However, the object structures are handled in a
particular way. ruby allocates them by clusters, and then distribute them
from these clusters. As at allocation time the diversity of types (and sizes)
of structures is difficult to handle, a type (union) that combines all
structures RVALUE was declared and an array of this type is managed. As this
type’s size is the same as the biggest one of its members, if there is only
one big structure, there is a lot of unused space. That’s why doing as much as
possible to regroup structures of similar size is desirable. The details about
RVALUE will be explained in chapter 5 “Garbage collection”.
Generally the most used structure is struct RString. After that, in programs
there are struct RArray (array), RHash (hash), RObject (user defined
object), etc. However, this struct RObject only uses the space of struct
RBasic + 1 pointer. On the other hand, struct RString, RArray and RHash
take the space of struct RBasic + 3 pointers. In other words, when putting a
struct RObject in the shared entity, the space for 2 pointers is useless.
And beyond that, if RString had 4 pointers, RObject would use less that
half the size of the shared entity. As you would expect, it’s wasteful.
So the received merit for iv_tbl is more or less saving memory and speeding
up. Furthermore we do not know if it is used often or not. In the facts,
generic_iv_tbl was not introduced before ruby 1.2, so it was not possible
to use instance variables in String or Array at this time. Nevertheless it
was not so much of a problem. Making large amounts of memory useless just for
such a functionality looks stupid.
If you take all this into consideration, you can conclude that increasing the size of object structures does not do any good.
rb_ivar_get()We saw the rb_ivar_set() function that sets variables, so let’s see quickly
how to get them.
rb_ivar_get()
960 VALUE
961 rb_ivar_get(obj, id)
962 VALUE obj;
963 ID id;
964 {
965 VALUE val;
966
967 switch (TYPE(obj)) {
/* (A) */
968 case T_OBJECT:
969 case T_CLASS:
970 case T_MODULE:
971 if (ROBJECT(obj)->iv_tbl &&
st_lookup(ROBJECT(obj)->iv_tbl, id, &val))
972 return val;
973 break;
/* (B) */
974 default:
975 if (FL_TEST(obj, FL_EXIVAR) || rb_special_const_p(obj))
976 return generic_ivar_get(obj, id);
977 break;
978 }
/* (C) */
979 rb_warning("instance variable %s not initialized", rb_id2name(id));
980
981 return Qnil;
982 }
(variable.c)
The structure is strictly the same.
(A) For struct RObject or RClass, we search the variable in iv_tbl. As
iv_tbl can also be NULL, we must check it before using it. Then if
st_lookup() finds the relation, it returns true, so the whole if can be
read as “If the instance variable has been set, return its value”.
(C) If no correspondence could be found, in other words if we read an
instance variable that has not been set, we first leave the if then the
switch. rb_warning() will then issue a warning and nil will be returned.
That’s because you can read instance variables that have not been set in Ruby.
(B) On the other hand, if the structure is neither struct RObject nor
RClass, the instance variable table is searched in generic_iv_tbl. What
generic_ivar_get() does can be easily guessed, so I won’t explain it. I’d
rather want you to focus on the if.
I already told you that generic_ivar_set() sets the FL_EXIVAR flag to make
the check faster.
And what is rb_special_const_p()? This function returns true when its
parameter obj does not point to a structure. As no structure means no
basic.flags, no flag can be set, and FL_xxxx() will always returns false.
That’s why these objects have to be treated specially.
In this section we’ll see simply, among object structures, what the important ones contain and how they are handled.
struct RStringstruct RString is the structure for the instances of the String class and
its subclasses.
struct RString
314 struct RString {
315 struct RBasic basic;
316 long len;
317 char *ptr;
318 union {
319 long capa;
320 VALUE shared;
321 } aux;
322 };
(ruby.h)
ptr is a pointer to the string, and len the length of that string. Very
straightforward.
Rather than a string, Ruby’s string is more a byte array, and can contain any
byte including NUL. So when thinking at the Ruby level, ending the string
with NUL does not mean anything. As C functions require NUL, for
convenience the ending NUL is there, however, it is not included in len.
When dealing with a string coming from the interpreter or an extension
library, you can write RSTRING(str)->ptr or RSTRING(str)->len, and access
ptr and len. But there are some points to pay attention to.
str really points to a struct RStringRSTRING(str)->ptr in something like a local variable and
use it laterWhy is that? First, there is an important software engineering principle:
Don’t arbitrarily tamper with someone’s data. Interface functions are there
for a reason. However, there are concrete reasons in ruby’s design
why you should not do such things as consulting or storing a pointer, and
that’s related to the fourth member aux. However, to explain properly how to
use aux, we have to explain first a little more of Ruby’s strings’
characteristics.
Ruby’s strings can be modified (are mutable). By mutable I mean after the following code:
s = "str" # create a string and assign it to s
s.concat("ing") # append "ing" to this string object
p(s) # show the string
the content of the object pointed by s will become “string”. It’s
different from Java or Python string objects. Java’s StringBuffer is closer.
And what’s the relation? First, mutable means the length (len) of the string
can change. We have to increase or decrease the allocated memory size each time
the length changes. We can of course use realloc() for that, but generally
malloc() and realloc() are heavy operations. Having to realloc() each
time the string changes is a huge burden.
That’s why the memory pointed by ptr has been allocated with a size a little
bigger than len. Because of that, if the added part can fit into the
remaining memory, it’s taken care of without calling realloc(), so it’s
faster. The structure member aux.capa contains the length including this
additional memory.
So what is this other aux.shared? It’s to speed up the creation of literal
strings. Have a look at the following Ruby program.
while true do # repeat indefinitely
a = "str" # create a string with "str" as content and assign it to a
a.concat("ing") # append "ing" to the object pointed by a
p(a) # show "string"
end
Whatever the number of times you repeat the loop, the fourth line’s p has to
show "string". That’s why the code "str" should create, each time, a string
object holding a different char[]. However, if no change occurs for a lot of
strings, useless copies of char[] can be created many times. It would be better
to share one common char[].
The trick that allows this to happen is aux.shared. String objects created
with a literal use one shared char[]. When a change occurs, the string is
copied in unshared memory, and the change is done on this new copy. This
technique is called “copy-on-write”. When using a shared char[], the flag
ELTS_SHARED is set in the object structure’s basic.flags, and aux.shared
contains the original object. ELTS seems to be the abbreviation of
ELemenTS.
But, well, let’s return to our talk about RSTRING(str)->ptr. Even if
consulting the pointer is OK, you must not modify it, first because the value
of len or capa will no longer agree with the content, and also because when
modifying strings created as litterals, aux.shared has to be separated.
To finish this section about RString, let’s write some examples how to use
it. str is a VALUE that points to RString.
RSTRING(str)->len; /* length */
RSTRING(str)->ptr[0]; /* first character */
str = rb_str_new("content", 7); /* create a string with "content" as its content
the second parameter is the length */
str = rb_str_new2("content"); /* create a string with "content" as its content
its length is calculated with strlen() */
rb_str_cat2(str, "end"); /* Concatenate a C string to a Ruby string */
struct RArraystruct RArray is the structure for the instances of Ruby’s array class
Array.
struct RArray
324 struct RArray {
325 struct RBasic basic;
326 long len;
327 union {
328 long capa;
329 VALUE shared;
330 } aux;
331 VALUE *ptr;
332 };
(ruby.h)
Except for the type of ptr, this structure is almost the same as struct
RString. ptr points to the content of the array, and len is its length.
aux is exactly the same as in struct RString. aux.capa is the “real”
length of the memory pointed by ptr, and if ptr is shared, aux.shared
stores the shared original array object.
From this structure, it’s clear that Ruby’s Array is an array and not a
list. So when the number of elements changes in a big way, a realloc() must
be done, and if an element must be inserted at an other place than the end, a
memmove() will occur. But even if we do it, it’s moving so fast it’s really
impressive on current machines.
That’s why the way to access it is similar to RString. You can consult
RARRAY(arr)->ptr and RARRAY(arr)->len members, but can’t set them, etc.,
etc. We’ll only look at simple examples:
/* manage an array from C */ VALUE ary; ary = rb_ary_new(); /* create an empty array */ rb_ary_push(ary, INT2FIX(9)); /* push a Ruby 9 */ RARRAY(ary)->ptr[0]; /* look what's at index 0 */ rb_p(RARRAY(ary)->ptr[0]); /* do p on ary[0] (the result is 9) */ # manage an array from Ruby ary = [] # create an empty array ary.push(9) # push 9 ary[0] # look what's at index 0 p(ary[0]) # do p on ary[0] (the result is 9)
struct RRegexpIt’s the structure for the instances of the regular expression class Regexp.
struct RRegexp
334 struct RRegexp {
335 struct RBasic basic;
336 struct re_pattern_buffer *ptr;
337 long len;
338 char *str;
339 };
(ruby.h)
ptr is the regular expression after compilation. str is the string before
compilation (the source code of the regular expression), and len is this
string’s length.
As the Regexp object handling code doesn’t appear in this book, we won’t see
how to use it. Even if you use it in extension libraries, as long as you do
not want to use it a very particular way, the interface functions are enough.
struct RHashstruct RHash is the structure for Ruby’s Hash objects.
struct RHash
341 struct RHash {
342 struct RBasic basic;
343 struct st_table *tbl;
344 int iter_lev;
345 VALUE ifnone;
346 };
(ruby.h)
It’s a wrapper for struct st_table. st_table will be detailed in the next
chapter “Names and name tables”.
ifnone is the value when a key does not have an attached value, its default
is nil. iter_lev is to make the hashtable reentrant (multithread safe).
struct RFilestruct RFile is a structure for instances of the built-in IO class and
its subclasses.
struct RFile
348 struct RFile {
349 struct RBasic basic;
350 struct OpenFile *fptr;
351 };
(ruby.h)
▼ OpenFile
19 typedef struct OpenFile {
20 FILE *f; /* stdio ptr for read/write */
21 FILE *f2; /* additional ptr for rw pipes */
22 int mode; /* mode flags */
23 int pid; /* child's pid (for pipes) */
24 int lineno; /* number of lines read */
25 char *path; /* pathname for file */
26 void (*finalize) _((struct OpenFile*)); /* finalize proc */
27 } OpenFile;
(rubyio.h)
All members have been transferred in struct OpenFile. As there aren’t many
instances of IO objects, it’s OK to do it like this. The purpose of each member
is written in the comments. Basically, it’s a wrapper around C’s stdio.
struct RDatastruct RData has a different tenor from what we saw before. It is the
structure for implementation of extension libraries.
Of course structures for classes created in extension libraries as necessary,
but as the types of these structures depend of the created class, it’s
impossible to know their size or structure in advance. That’s why a “structure
for managing a pointer to a user defined structure” has been created on
ruby’s side to manage this. This structure is struct RData.
struct RData
353 struct RData {
354 struct RBasic basic;
355 void (*dmark) _((void*));
356 void (*dfree) _((void*));
357 void *data;
358 };
(ruby.h)
data is a pointer to the user defined structure,
dfree is the function used to free this structure, and
dmark is the function for when the “mark” of the mark and sweep occurs.
Because explaining struct RData is still too complicated, for
the time being let’s just look at its representation (figure 8). You’ll read
a detailed explanation of its members in chapter 5 “Garbage collection” where
there’ll be presented once again.

Figure 8: Representation of struct RData
