Make edits to ch8 as a result of edits/questions from nostarch

This commit is contained in:
Carol (Nichols || Goulding) 2016-11-20 14:28:54 -05:00
parent f3475a6652
commit 635f7b5202
4 changed files with 480 additions and 341 deletions

View File

@ -1,11 +1,14 @@
# Fundamental Collections
Rust's standard library includes a number of really useful data structures
called *collections*. Most other types represent one specific value, but
collections can contain multiple values inside of them. Each collection has
different capabilities and costs, and choosing an appropriate one for the
situation you're in is a skill you'll develop over time. In this chapter, we'll
go over three collections which are used very often in Rust programs:
called *collections*. Most other data types represent one specific value, but
collections can contain multiple values. Unlike the built-in array and tuple
types, the data these collections point to is stored on the heap, which means
the amount of data does not need to be known at compile time and can grow or
shrink as the program runs. Each kind of collection has different capabilities
and costs, and choosing an appropriate one for the situation you're in is a
skill you'll develop over time. In this chapter, we'll go over three
collections which are used very often in Rust programs:
* A *vector* allows us to store a variable number of values next to each other.
* A *string* is a collection of characters. We've seen the `String` type

View File

@ -2,31 +2,45 @@
The first type we'll look at is `Vec<T>`, also known as a *vector*. Vectors
allow us to store more than one value in a single data structure that puts all
the values next to each other in memory.
the values next to each other in memory. Vectors can only store values of the
same type. They are useful in situations where you have a list of items, such
as the lines of text in a file or the prices of items in a shopping cart.
### Creating a New Vector
To create a new vector, we can call the `new` function:
To create a new, empty vector, we can call the `Vec::new` function:
```rust
let v: Vec<i32> = Vec::new();
```
Note that we added a type annotation here. Since we don't actually do
anything with the vector, Rust doesn't know what kind of elements we intend to
store. This is an important point. Vectors are homogeneous: they may store many
values, but those values must all be the same type. Vectors are generic over
the type stored inside them (we'll talk about Generics more thoroughly in
Chapter 10), and the angle brackets here tell Rust that this vector will hold
Note that we added a type annotation here. Since we aren't inserting any values
into this vector, Rust doesn't know what kind of elements we intend to store.
This is an important point. Vectors are homogenous: they may store many values,
but those values must all be the same type. Vectors are implemented using
generics, which Chapter 10 will cover how to use in your own types. For now,
all you need to know is that the `Vec` type provided by the standard library
can hold any type, and when a specific `Vec` holds a specific type, the type
goes within angle brackets. We've told Rust that the `Vec` in `v` will hold
elements of the `i32` type.
That said, in real code, we very rarely need to do this type annotation since
Rust can infer the type of value we want to store once we insert values. Let's
look at how to modify a vector next.
In real code, Rust can infer the type of value we want to store once we insert
values, so you rarely need to do this type annotation. It's more common to
create a `Vec` that has initial values, and Rust provides the `vec!` macro for
convenience. The macro will create a new `Vec` that holds the values we give
it. This will create a new `Vec<i32>` that holds the values `1`, `2`, and `3`:
```rust
let v = vec![1, 2, 3];
```
Because we've given initial `i32` values, Rust can infer that the type of `v`
is `Vec<i32>`, and the type annotation isn't necessary. Let's look at how to
modify a vector next.
### Updating a Vector
To put elements in the vector, we can use the `push` method:
To create a vector then add elements to it, we can use the `push` method:
```rust
let mut v = Vec::new();
@ -37,18 +51,10 @@ v.push(7);
v.push(8);
```
Since these numbers are `i32`s, Rust infers the type of data we want to store
in the vector, so we don't need the `<i32>` annotation.
We can improve this code even further. Creating a vector with some initial
values like this is very common, so there's a macro to do it for us:
```rust
let v = vec![5, 6, 7, 8];
```
This macro does a similar thing to our previous example, but it's much more
convenient.
As with any variable as we discussed in Chapter 3, if we want to be able to
change its value, we need to make it mutable with the `mut` keyword. The
numbers we place inside are all `i32`s, and Rust infers this from the data, so
we don't need the `Vec<i32>` annotation.
### Dropping a Vector Drops its Elements
@ -63,18 +69,20 @@ Like any other `struct`, a vector will be freed when it goes out of scope:
} // <- v goes out of scope and is freed here
```
When the vector gets dropped, it will also drop all of its contents, so those
integers are going to be cleaned up as well. This may seem like a
When the vector gets dropped, all of its contents will also be dropped, meaning
those integers it holds will be cleaned up. This may seem like a
straightforward point, but can get a little more complicated once we start to
introduce references to the elements of the vector. Let's tackle that next!
### Reading Elements of Vectors
Now that we know how creating and destroying vectors works, knowing how to read
their contents is a good next step. There are two ways to reference a value
stored in a vector. In the following examples of these two ways, we've
annotated the types of the values that are returned from these functions for
extra clarity:
Now that you know how to create, update, and destroy vectors, knowing how to
read their contents is a good next step. There are two ways to reference a
value stored in a vector. In the examples, we've annotated the types of the
values that are returned from these functions for extra clarity.
This example shows both methods of accessing a value in a vector either with
indexing syntax or the `get` method:
```rust
let v = vec![1, 2, 3, 4, 5];
@ -83,13 +91,17 @@ let third: &i32 = &v[2];
let third: Option<&i32> = v.get(2);
```
First, note that we use the index value of `2` to get the third element:
vectors are indexed by number, starting at zero. Secondly, the two different
ways to get the third element are using `&` and `[]`s and using the `get`
method. The square brackets give us a reference, and `get` gives us an
`Option<&T>`. The reason we have two ways to reference an element is so that we
can choose the behavior we'd like to have if we try to use an index value that
the vector doesn't have an element for:
There are a few things to note here. First, that we use the index value of `2`
to get the third element: vectors are indexed by number, starting at zero.
Second, the two different ways to get the third element are: using `&` and
`[]`s, which gives us a reference, or using the `get` method with the index
passed as an argument, which gives us an `Option<&T>`.
The reason Rust has two ways to reference an element is so that you can choose
how the program behaves when you try to use an index value that the vector
doesn't have an element for. As an example, what should a program do if it has
a vector that holds five elements then tries to access an element at index 100
like this:
```rust,should_panic
let v = vec![1, 2, 3, 4, 5];
@ -98,23 +110,45 @@ let does_not_exist = &v[100];
let does_not_exist = v.get(100);
```
With the `[]`s, Rust will cause a `panic!`. With the `get` method, it will
instead return `None` without `panic!`ing. Deciding which way to access
elements in a vector depends on whether we consider an attempted access past
the end of the vector to be an error, in which case we'd want the `panic!`
behavior, or whether this will happen occasionally under normal circumstances
and our code will have logic to handle getting `Some(&element)` or `None`.
When you run this, you will find that with the first `[]` method, Rust will
cause a `panic!` when a non-existent element is referenced. This method would
be preferable if you want your program to consider an attempt to access an
element past the end of the vector to be a fatal error that should crash the
program.
Once we have a valid reference, the borrow checker will enforce the ownership
and borrowing rules we covered in Chapter 4 in order to ensure this and other
references to the contents of the vector stay valid. This means in a function
that owns a `Vec`, we can't return a reference to an element since the `Vec`
will be cleaned up at the end of the function:
When the `get` method is passed an index that is outside the array, it will
return `None` without `panic!`ing. You would use this if accessing an element
beyond the range of the vector will happen occasionally under normal
circumstances. Your code can then have logic to handle having either
`Some(&element)` or `None`, as we discussed in Chapter 6. For example, the
index could be coming from a person entering a number. If they accidentally
enter a number that's too large and your program gets a `None` value, you could
tell the user how many items are in the current `Vec` and give them another
chance to enter a valid value. That would be more user-friendly than crashing
the program for a typo!
#### Invalid References
Once the program has a valid reference, the borrow checker will enforce the
ownership and borrowing rules covered in Chapter 4 to ensure this reference and
any other references to the contents of the vector stay valid. This means that
in a function that owns a `Vec`, we can't return a reference to an element in
the `Vec` to be used outside the function since the `Vec` will be cleaned up at
the end of the function. Try it out with the following:
<!-- TODO: fix this code example https://github.com/rust-lang/book/issues/273 -->
```rust,ignore
fn element() -> String {
let list = vec![String::from("hi"), String::from("bye")];
list[1]
} // <-- list goes out of scope here
fn main() {
let e = element();
println!("{}", e); // <-- we can't have a reference to an element of
// list out here since list was cleaned up at the end
// of the element function.
}
```
@ -130,8 +164,8 @@ error: cannot move out of indexed content [--explain E0507]
Since `list` goes out of scope and gets cleaned up at the end of the function,
the reference `list[1]` cannot be returned because it would outlive `list`.
Here's another example of code that looks like it should be allowed, but it
won't compile because the references actually aren't valid anymore:
Here's another example of code that looks like it should be allowed, but won't
compile because the references aren't valid:
```rust,ignore
let mut v = vec![1, 2, 3, 4, 5];
@ -144,43 +178,49 @@ v.push(6);
Compiling this will give us this error:
```text
error: cannot borrow `v` as mutable because it is also borrowed as immutable
[--explain E0502]
|>
5 |> let first = &v[0];
|> - immutable borrow occurs here
7 |> v.push(6);
|> ^ mutable borrow occurs here
9 |> }
|> - immutable borrow ends here
error[E0502]: cannot borrow `v` as mutable because it is also borrowed as immutable
|
4 | let first = &v[0];
| - immutable borrow occurs here
5 |
6 | v.push(6);
| ^ mutable borrow occurs here
7 | }
| - immutable borrow ends here
```
This violates one of the ownership rules we covered in Chapter 4: the `push`
method needs to have a mutable borrow to the `Vec`, and we aren't allowed to
have any immutable borrows while we have a mutable borrow.
method needs to have a mutable borrow to the `Vec`, and Rust doesn't allow any
immutable borrows in the same scope as a mutable borrow.
Why is it an error to have a reference to the first element in a vector while
we try to add a new item to the end, though? Due to the way vectors work,
adding a new element onto the end might require allocating new memory and
copying the old elements over to the new space if there wasn't enough room to
put all the elements next to each other where the vector was. If this happened,
our reference would be pointing to deallocated memory. For more on this, see
[The Nomicon](https://doc.rust-lang.org/stable/nomicon/vec.html).
The reason behind disallowing references to the first element in a vector while
trying to add a new item to the end is due to the way vectors work. Adding a
new element onto the end of the vector might require allocating new memory and
copying the old elements over to the new space, in the circumstance that there
isn't enough room to put all the elements next to each other where the vector
was. In that case, the reference to the first element would be pointing to
deallocated memory. The borrowing rules prevent programs from ending up in that
situation.
> Note: For more on this, see [The Nomicon][nomicon].
[nomicon]: https://doc.rust-lang.org/stable/nomicon/vec.html
### Using an Enum to Store Multiple Types
Let's put vectors together with what we learned about enums in Chapter 6. At
the beginning of this section, we said that vectors will only store values that
are all the same type. This can be inconvenient; there are definitely use cases
for needing to store a list of things that might be different types. Luckily,
the variants of an enum are all the same type as each other, so when we're in
this scenario, we can define and use an enum!
At the beginning of this chapter, we said that vectors can only store values
that are all the same type. This can be inconvenient; there are definitely use
cases for needing to store a list of things of different types. Luckily, the
variants of an enum are all defined under the same enum type. When we need to
store elements of a different type in a vector this scenario, we can define and
use an enum!
For example, let's say we're going to be getting values for a row in a
spreadsheet. Some of the columns contain integers, some floating point numbers,
For example, let's say we want to get values from a row in a spreadsheet, where
some of the columns in the row contain integers, some floating point numbers,
and some strings. We can define an enum whose variants will hold the different
value types. All of the enum variants will then be the same type, that of the
enum. Then we can create a vector that, ultimately, holds different types:
value types, and then all of the enum variants will be considered the same
type, that of the enum. Then we can create a vector that holds that enum and
so, ultimately, holds different types:
```rust
enum SpreadsheetCell {
@ -196,20 +236,41 @@ let row = vec![
];
```
This has the advantage of being explicit about what types are allowed in this
vector. If we allowed any type to be in a vector, there would be a chance that
the vector would hold a type that would cause errors with the operations we
performed on the vector. Using an enum plus a `match` where we access elements
in a vector like this means that Rust will ensure at compile time that we
always handle every possible case.
The reason Rust needs to know exactly what types will be in the vector at
compile time is so that it knows exactly how much memory on the heap will be
needed to store each element. A secondary advantage to this is that we can be
explicit about what types are allowed in this vector. If Rust allowed a vector
to hold any type, there would be a chance that one or more of the types would
cause errors with the operations performed on the elements of the vector. Using
an enum plus a `match` means that Rust will ensure at compile time that we
always handle every possible case, as we discussed in Chapter 6.
Using an enum for storing different types in a vector does imply that we need
to know the set of types we'll want to store at compile time. If that's not the
case, instead of an enum, we can use a trait object. We'll learn about those in
Chapter 23.
<!-- Can you briefly explain what the match is doing here, as a recap? How does
it mean we always handle every possible case? I'm not sure it's totally clear.
-->
<!-- Because this is a focus of chapter 6 rather than this chapter's focus, we
don't think we should repeat it here as well, but we added a reference. /Carol
-->
If you don't know at the time that you're writing a program the exhaustive set
of types the program will get at runtime to store in a vector, the enum
technique won't work. Insetad, you can use a trait object, which we'll cover in
Chapter 13.
Now that we've gone over some of the most common ways to use vectors, be sure
to take a look at the API documentation for other useful methods defined on
`Vec` by the standard library. For example, in addition to `push` there's a
`pop` method that will remove and return the last element. Let's move on to the
next collection type: `String`!
to take a look at the API documentation for all of the many useful methods
defined on `Vec` by the standard library. For example, in addition to `push`
there's a `pop` method that will remove and return the last element. Let's move
on to the next collection type: `String`!
<!-- Do you mean the Rust online documentation here? Are you not including it
in the book for space reasons? We might want to justify sending them out of the
book if we don't want to cover it here -->
<!-- Yes, there are many, many methods on Vec: https://doc.rust-lang.org/stable/std/vec/struct.Vec.html
Also there are occcasionally new methods available with new versions of the
language, so there's no way we can be comprehensive here. We want the reader to
use the API documentation in these situations since the purpose of the online
docs is to be comprehensive and up to date. I personally wouldn't expect a book
like this to duplicate the info that's in the API docs, so I don't think a
justification is necessary here. /Carol -->

View File

@ -1,32 +1,41 @@
## Strings
We've already talked about strings a bunch in Chapter 4, but let's take a more
in-depth look at them now.
in-depth look at them now. Strings are an area that new Rustaceans commonly get
stuck on. This is due to a combination of three things: Rust's propensity for
making sure to expose possible errors, strings being a more complicated data
structure than many programmers give them credit for, and UTF-8. These things
combine in a way that can seem difficult when coming from other languages.
### Many Kinds of Strings
The reason Strings are in the collections chapter is that strings are
implemented as a collection of bytes plus some methods to provide useful
functionality when those bytes are interpreted as text. In this section, we'll
talk about the operations on `String` that every collection type has, like
creating, updating, and reading. We'll also discuss the ways in which `String`
is different than the other collections, namely how indexing into a `String` is
complicated by the differences in which people and computers interpret `String`
data.
Strings are a common place for new Rustaceans to get stuck. This is due to a
combination of three things: Rust's propensity for making sure to expose
possible errors, strings being a more complicated data structure than many
programmers give them credit for, and UTF-8. These things combine in a way that
can seem difficult coming from other languages.
### What is a String?
Before we can dig into those aspects, we need to talk about what exactly we
even mean by the word 'string'. Rust actually only has one string type in the
core language itself: `&str`. We talked about *string slices* in Chapter 4:
they're a reference to some UTF-8 encoded string data stored somewhere else.
String literals, for example, are stored in the binary output of the program,
and are therefore string slices.
mean by the term 'string'. Rust actually only has one string type in the core
language itself: `str`, the string slice, which is usually seen in its borrowed
form, `&str`. We talked about *string slices* in Chapter 4: these are a
reference to some UTF-8 encoded string data stored elsewhere. String literals,
for example, are stored in the binary output of the program, and are therefore
string slices.
Rust's standard library is what provides the type called `String`. This is a
growable, mutable, owned, UTF-8 encoded string type. When Rustaceans talk about
'strings' in Rust, they usually mean "`String` and `&str`". This chapter is
largely about `String`, and these two types are used heavily in Rust's standard
library. Both `String` and string slices are UTF-8 encoded.
The type called `String` is provided in Rust's standard library rather than
coded into the core language, and is a growable, mutable, owned, UTF-8 encoded
string type. When Rustaceans talk about 'strings' in Rust, they usually mean
both the `String` and the string slice `&str` types, not just one of those.
This section is largely about `String`, but both these types are used heavily
in Rust's standard library. Both `String` and string slices are UTF-8 encoded.
Rust's standard library also includes a number of other string types, such as
`OsString`, `OsStr`, `CString`, and `CStr`. Library crates may provide even
more options for storing string data. Similarly to the `*String`/`*Str` naming,
more options for storing string data. Similar to the `*String`/`*Str` naming,
they often provide an owned and borrowed variant, just like `String`/`&str`.
These string types may store different encodings or be represented in memory in
a different way, for example. We won't be talking about these other string
@ -35,15 +44,18 @@ them and when each is appropriate.
### Creating a New String
Let's look at how to do the same operations on `String` as we did with `Vec`,
starting with creating one. Similarly, `String` has `new`:
Many of the same operations available with `Vec` are available with `String` as
well, starting with the `new` function to create a string, like so:
```rust
let s = String::new();
```
Often, we'll have some initial data that we'd like to start the string off with.
For that, there's the `to_string` method:
This creates a new empty string called `s` that we can then load data into.
Often, we'll have some initial data that we'd like to start the string off
with. For that, we use the `to_string` method, which is available on any type
that implements the `Display` trait, which string literals do:
```rust
let data = "initial contents";
@ -54,19 +66,20 @@ let s = data.to_string();
let s = "initial contents".to_string();
```
This form is equivalent to using `to_string`:
This creates a string containing `initial contents`.
We can also use the function `String::from` to create a `String` from a string
literal. This is equivalent to using `to_string`:
```rust
let s = String::from("Initial contents");
let s = String::from("initial contents");
```
Since strings are used for so many things, there are many different generic
APIs that make sense for strings. There are a lot of options, and some of them
can feel redundant because of this, but they all have their place! In this
case, `String::from` and `.to_string` end up doing the exact same thing, so
which you choose is a matter of style. Some people use `String::from` for
literals, and `.to_string` for variables. Most Rust style is pretty
uniform, but this specific question is one of the most debated.
Because strings are used for so many things, there are many different generic
APIs that can be used for strings, so there are a lot of options. Some of them
can feel redundant, but they all have their place! In this case, `String::from`
and `.to_string` end up doing the exact same thing, so which you choose is a
matter of style.
Remember that strings are UTF-8 encoded, so we can include any properly encoded
data in them:
@ -87,80 +100,85 @@ let hello = "Hola";
### Updating a String
A `String` can be changed and can grow in size, just like a `Vec` can.
A `String` can can grow in size and its contents can change just like the
contents of a `Vec`, by pushing more data into it. In addition, `String` has
concatenation operations implemented with the `+` operator for convenience.
#### Push
#### Appending to a String with Push
We can grow a `String` by using the `push_str` method to append another
string:
We can grow a `String` by using the `push_str` method to append a string slice:
```rust
let mut s = String::from("foo");
s.push_str("bar");
```
`s` will contain "foobar" after these two lines.
`s` will contain "foobar" after these two lines. The `push_str` method takes a
string slice because we don't necessarily want to take ownership of the
argument. For example, it would be unfortunate if we weren't able to use `s2`
after appending its contents to `s1`:
The `push` method will add a `char`:
```rust
let mut s1 = String::from("foo");
let s2 = String::from("bar");
s1.push_str(&s2);
```
The `push` method is defined to take a single character as an argument and add
it to the `String`:
```rust
let mut s = String::from("lo");
s.push('l');
```
`s` will contain "lol" after this point.
After this, `s` will contain "lol".
We can make any `String` contain the empty string with the `clear` method:
#### Concatenation with the + Operator or the `format!` Macro
```rust
let mut s = String::from("Noooooooooooooooooooooo!");
s.clear();
```
Now `s` will be the empty string, "".
#### Concatenation
Often, we'll want to combine two strings together. One way is to use the `+`
operator:
Often, we'll want to combine two existing strings together. One way is to use
the `+` operator like this:
```rust
let s1 = String::from("Hello, ");
let s2 = String::from("world!");
let s3 = s1 + &s2;
let s3 = s1 + &s2; // Note that s1 has been moved here and can no longer be used
```
This code will make `s3` contain "Hello, world!" There's some tricky bits here,
though, that come from the type signature of `+` for `String`. The signature
for the `add` method that the `+` operator uses looks something like this:
After this code the String `s3` will contain `Hello, world!`. The reason that
`s1` is no longer valid after the addition and the reason that we used a
reference to `s2` has to do with the signature of the method that gets called
when we use the `+` operator. The `+` operator uses the `add` method, whose
signature looks something like this:
```rust,ignore
fn add(self, s: &str) -> String {
```
This isn't exactly what the actual signature is in the standard library because
`add` is defined using generics there. Here, we're just looking at what the
signature of the method would be if `add` was defined specifically for
`String`. This signature gives us the clues we need in order to understand the
tricky bits of `+`.
This isn't the exact signature that's in the standard library; there `add` is
defined using generics. Here, we're looking at the signature of `add` with
concrete types substituted for the generic ones, which is what happens when we
call this method with `String` values. This signature gives us the clues we
need to understand the tricky bits of the `+` operator.
First of all, `s2` has an `&`. This is because of the `s` argument in the `add`
function: we can only add a `&str` to a `String`, we can't add two `String`s
together. Remember back in Chapter 4 when we talked about how `&String` will
coerce to `&str`: we write `&s2` so that the `String` will coerce to the proper
type, `&str`.
First of all, `s2` has an `&`, meaning that we are adding a *reference* of the
second string to the first string. This is because of the `s` argument in the
`add` function: we can only add a `&str` to a `String`, we can't add two
`String`s together. Remember back in Chapter 4 when we talked about how
`&String` will coerce to `&str`: we write `&s2` so that the `String` will
coerce to the proper type, `&str`. Because this method does not take ownership
of the argument, `s2` will still be valid after this operation.
Secondly, `add` takes ownership of `self`, which we can tell because `self`
does *not* have an `&` in the signature. This means `s1` in the above example
will be moved into the `add` call and no longer be a valid variable after that.
So while `let s3 = s1 + &s2;` looks like it will copy both strings and create a
new one, this statement actually takes ownership of `s1`, appends a copy of
`s2`'s contents, then returns ownership of the result. In other words, it looks
like it's making a lot of copies, but isn't: the implementation is more
efficient than copying.
Second, we can see in the signature that `add` takes ownership of `self`,
because `self` does *not* have an `&`. This means `s1` in the above example
will be moved into the `add` call and no longer be valid after that. So while
`let s3 = s1 + &s2;` looks like it will copy both strings and create a new one,
this statement actually takes ownership of `s1`, appends a copy of `s2`'s
contents, then returns ownership of the result. In other words, it looks like
it's making a lot of copies, but isn't: the implementation is more efficient
than copying.
If we need to concatenate multiple strings, this behavior of `+` gets
unwieldy:
If we need to concatenate multiple strings, the behavior of `+` gets unwieldy:
```rust
let s1 = String::from("tic");
@ -182,17 +200,32 @@ let s3 = String::from("toe");
let s = format!("{}-{}-{}", s1, s2, s3);
```
<!-- Are we going to discuss the format macro elsewhere at all? If not, some
more info here might be good, this seems like a really useful tool. Is it only
used on strings? -->
<!-- No, we weren't planning on it. We thought it would be sufficient to
mention that it works the same way as `println!` since we've covered how
`println!` works in Ch 2, "Printing Values with `println!` Placeholders" and Ch
5, Ch 5, "Adding Useful Functionality with Derived Traits". `format!` can be
used on anything that `println!` can; using `{}` in the format string works
with anything that implements the `Display` trait and `{:?}` works with
anything that implements the `Debug` trait. Do you have any thoughts on how we
could make the similarities with `format!` and `println!` clearer than what we
have in the next paragraph without repeating the `println!` content too much?
/Carol -->
This code will also set `s` to "tic-tac-toe". The `format!` macro works in the
same way as `println!`, but instead of printing the output to the screen, it
returns a `String` with the contents. This version is much easier to read than
all of the `+`s.
returns a `String` with the contents. This version is much easier to read, and
also does not take ownership of any of its arguments.
### Indexing into Strings
In many other languages, accessing individual characters in a string by
referencing the characters by index is a valid and common operation. In Rust,
however, if we try to access parts of a `String` using indexing syntax, we'll
get an error. That is, this code:
referencing them by index is a valid and common operation. In Rust, however, if
we try to access parts of a `String` using indexing syntax, we'll get an error.
That is, this code:
```rust,ignore
let s1 = String::from("hello");
@ -231,69 +264,77 @@ UTF-8. What about this example, though?
let len = "Здравствуйте".len();
```
There are two answers that potentially make sense here: the first is 12, which
is the number of letters that a person would count if we asked someone how long
this string was. The second, though, is what Rust's answer is: 24. This is the
number of bytes that it takes to encode "Здравствуйте" in UTF-8, because each
character takes two bytes of storage.
A person asked how long the string is might say 12. However, Rust's answer
is 24. This is the number of bytes that it takes to encode "Здравствуйте" in
UTF-8, since each character takes two bytes of storage. Therefore, an index
into the string's bytes will not always correlate to a valid character.
By the same token, imagine this invalid Rust code:
To demonstrate, consider this invalid Rust code:
```rust,ignore
let hello = "Здравствуйте";
let answer = &h[0];
let answer = &hello[0];
```
What should the value of `answer` be? Should it be `З`, the first letter? When
encoded in UTF-8, the first byte of `З` is `208`, and the second is `151`. So
should `answer` be `208`? `208` is not a valid character on its own, though.
Plus, for Latin letters, this would not return the answer most people would
expect: `&"hello"[0]` would then return `104`, not `h`.
encoded in UTF-8, the first byte of `З` is `208`, and the second is `151`, so
`answer` should in fact be `208`, but `208` is not a valid character on its
own. Returning `208` is likely not what a person would want if they asked for
the first letter of this string, but that's the only data that Rust has at byte
index 0. Returning the byte value is probably not what people want, even with
only latin letters: `&"hello"[0]` would return `104`, not `h`. To avoid
returning an unexpected value and causing bugs that might not be discovered
immediately, Rust chooses to not compile this code at all and prevent
misunderstandings earlier.
#### Bytes and Scalar Values and Grapheme Clusters! Oh my!
This leads to another point about UTF-8: there are really three relevant ways
to look at strings, from Rust's perspective: bytes, scalar values, and grapheme
clusters. If we look at the string "नमस्ते", it is ultimately stored as a `Vec`
of `u8` values that looks like this:
to look at strings, from Rust's perspective: as bytes, scalar values, and
grapheme clusters (the closest thing to what people would call 'letters').
If we look at the Hindi word "नमस्ते" written in the Devanagari script, it is
ultimately stored as a `Vec` of `u8` values that looks like this:
```text
[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]
```
That's 18 bytes. But if we look at them as Unicode scalar values, which are
what Rust's `char` type is, those bytes look like this:
That's 18 bytes, and is how computers ultimately store this data. If we look at
them as Unicode scalar values, which are what Rust's `char` type is, those
bytes look like this:
```text
['न', 'म', 'स', '्', 'त', 'े']
```
There are six `char` values here. Finally, if we look at them as grapheme
clusters, which is the closest thing to what humans would call 'letters', we'd
get this:
There are six `char` values here, but the fourth and sixth are not letters,
they're diacritics that don't make sense on their own. Finally, if we look at
them as grapheme clusters, we'd get what a person would call the four letters
that make up this word:
```text
["न", "म", "स्", "ते"]
```
Four elements! It turns out that even within 'grapheme cluster', there are
multiple ways of grouping things. Convinced that strings are actually really
complicated yet?
Rust provides different ways of interpreting the raw string data that computers
store so that each program can choose the interpretation it needs, no matter
what human language the data is in.
Another reason that indexing into a `String` to get a character is not available
is that indexing operations are expected to always be fast. This isn't possible
with a `String`, since Rust would have to walk through the contents from the
beginning to the index to determine how many valid characters there were, no
matter how we define "character".
A final reason Rust does not allow you to index into a `String` to get a
character is that indexing operations are expected to always take constant time
(O(1)). It isn't possible to guarantee that performance with a `String`,
though, since Rust would have to walk through the contents from the beginning
to the index to determine how many valid characters there were.
All of these problems mean that Rust does not implement `[]` for `String`, so
we cannot directly do this.
### Slicing Strings
However, indexing the bytes of a string is very useful, and is not expected to
be fast. While we can't use `[]` with a single number, we *can* use `[]` with
a range to create a string slice from particular bytes:
However, indexing the *bytes* of a string is very useful, and is not expected
to be fast. While we can't use `[]` with a single number, we _can_ use `[]`
with a range to create a string slice containing particular bytes:
```rust
let hello = "Здравствуйте";
@ -302,8 +343,8 @@ let s = &hello[0..4];
```
Here, `s` will be a `&str` that contains the first four bytes of the string.
Earlier, we mentioned that each of these characters was two bytes, so that means
that `s` will be "Зд".
Earlier, we mentioned that each of these characters was two bytes, so that
means that `s` will be "Зд".
What would happen if we did `&hello[0..1]`? The answer: it will panic at
runtime, in the same way that accessing an invalid index in a vector does:
@ -313,11 +354,16 @@ thread 'main' panicked at 'index 0 and/or 1 in `Здравствуйте` do not
character boundary', ../src/libcore/str/mod.rs:1694
```
You should use this with caution, since it can cause your program to crash.
### Methods for Iterating Over Strings
If we do need to perform operations on individual characters, the best way to
do that is using the `chars` method. Calling `chars` on "नमस्ते" gives us the six
Rust `char` values:
Luckily, there are other ways we can access elements in a String.
If we need to perform operations on individual characters, the best way to do
so is to use the `chars` method. Calling `chars` on "नमस्ते" separates out and
returns six values of type `char`, and you can iterate over the result in order
to access each element:
```rust
for c in "नमस्ते".chars() {
@ -337,8 +383,7 @@ This code will print:
```
The `bytes` method returns each raw byte, which might be appropriate for your
domain, but remember that valid UTF-8 characters may be made up of more than
one byte:
domain:
```rust
for b in "नमस्ते".bytes() {
@ -356,15 +401,30 @@ This code will print the 18 bytes that make up this `String`, starting with:
// ... etc
```
There are crates available on crates.io to get grapheme clusters from `String`s.
But make sure to remember that valid UTF-8 characters may be made up of more
than one byte.
Getting grapheme clusters from `String`s is complex, so this functionality is
not provided by the standard library. There are crates available on crates.io
if this is the functionality you need.
<!-- Can you recommend some, or maybe just say why we aren't outlining the
method here, ie it's complicated and therefore best to use a crate? -->
<!-- We're trying not to mention too many crates in the book. Most crates are
provided by the community, so we don't want to mention some and not others and
seem biased towards certain crates, plus crates can change more quickly (and
new crates can be created) than the language and this book will. /Carol -->
### Strings are Not so Simple
To summarize, strings are complicated. Different programming languages make
different choices about how to present this complexity to the programmer. Rust
has chosen to attempt to make correct handling of `String` data be the default
has chosen to make the correct handling of `String` data the default behavior
for all Rust programs, which does mean programmers have to put more thought
into handling UTF-8 data upfront. This tradeoff exposes us to more of the
complexity of strings than we have to handle in other languages, but will
prevent us from having to handle errors involving non-ASCII characters later in
our development lifecycle.
into handling UTF-8 data upfront. This tradeoff exposes more of the complexity
of strings than other programming languages do, but this will prevent you from
having to handle errors involving non-ASCII characters later in your
development lifecycle.
Let's switch to something a bit less complex: Hash Map!

View File

@ -7,55 +7,72 @@ into memory. Many different programming languages support this kind of data
structure, but often with a different name: hash, map, object, hash table, or
associative array, just to name a few.
We'll go over the basic API in this chapter, but there are many more goodies
hiding in the functions defined on `HashMap` by the standard library. As always,
check the standard library documentation for more information.
Hash maps are useful for when you want to be able to look up data not by an
index, as you can with vectors, but by using a key that can be of any type. For
example, in a game, you could keep track of each team's score in a hash map
where each key is a team's name and the values are each team's score. Given a
team name, you can retrieve their score.
We'll go over the basic API of hash maps in this chapter, but there are many
more goodies hiding in the functions defined on `HashMap` by the standard
library. As always, check the standard library documentation for more
information.
### Creating a New Hash Map
We can create an empty `HashMap` with `new`, and add elements with `insert`:
We can create an empty `HashMap` with `new`, and add elements with `insert`.
Here we're keeping track of the scores of two teams whose names are Blue and
Yellow. The Blue team will start with 10 points and the Yellow team starts with
50:
```rust
use std::collections::HashMap;
let mut map = HashMap::new();
let mut scores = HashMap::new();
map.insert(1, "hello");
map.insert(2, "world");
scores.insert(String::from("Blue"), 10);
scores.insert(String::from("Yellow"), 50);
```
Note that we need to `use` the `HashMap` from the collections portion of the
standard library. Of our three fundamental collections, this one is the least
often used, so it has a bit less support from the language. There's no built-in
macro to construct them, for example, and they're not in the prelude, so we
need to add a `use` statement for them.
Note that we need to first `use` the `HashMap` from the collections portion of
the standard library. Of our three fundamental collections, this one is the
least often used, so it's not included in the features imported automatically
in the prelude. Hash maps also have less support from the standard library;
there's no built-in macro to construct them, for example.
Just like vectors, hash maps store their data on the heap. This `HashMap` has
keys of type `i32` and values of type `&str`. Like vectors, hash maps are
homogeneous: all of the keys must have the same type, and all of the values must
homogenous: all of the keys must have the same type, and all of the values must
have the same type.
If we have a vector of tuples, we can convert it into a hash map with the
`collect` method. The first element in each tuple will be the key, and the
second element will be the value:
Another way of constructing a hash map is by using the `collect` method on a
vector of tuples, where each tuple consists of a key and its value. The
`collect` method gathers up data into a number of collection types, including
`HashMap`. For example, if we had the team names and initial scores in two
separate vectors, we can use the `zip` method to create a vector of tuples
where "Blue" is paired with 10, and so forth. Then we can use the `collect`
method to turn that vector of tuples into a `HashMap`:
```rust
use std::collections::HashMap;
let data = vec![(1, "hello"), (2, "world")];
let teams = vec![String::from("Blue"), String::from("Yellow")];
let initial_scores = vec![10, 50];
let map: HashMap<_, _> = data.into_iter().collect();
let scores: HashMap<_, _> = teams.iter().zip(initial_scores.iter()).collect();
```
The type annotation `HashMap<_, _>` is needed here because it's possible to
`collect` into many different data structures, so Rust doesn't know which we
want. For the type parameters for the key and value types, however, we can use
underscores and Rust can infer the types that the hash map contains based on the
types of the data in our vector.
`collect` into many different data structures, and Rust doesn't know which you
want unless you specify. For the type parameters for the key and value types,
however, we use underscores and Rust can infer the types that the hash map
contains based on the types of the data in the vector.
For types that implement the `Copy` trait like `i32` does, the values are
copied into the hash map. If we insert owned values like `String`, the values
will be moved and the hash map will be the owner of those values:
### Hashmaps and Ownership
For types that implement the `Copy` trait, like `i32`, the values are copied
into the hash map. For owned values like `String`, the values will be moved and
the hash map will be the owner of those values:
```rust
use std::collections::HashMap;
@ -68,13 +85,13 @@ map.insert(field_name, field_value);
// field_name and field_value are invalid at this point
```
We would not be able to use the variables `field_name` and `field_value` after
We would not be able to use the bindings `field_name` and `field_value` after
they have been moved into the hash map with the call to `insert`.
If we insert references to values, the values themselves will not be moved into
the hash map. The values that the references point to must be valid for at least
as long as the hash map is valid, though. We will talk more about these issues
in the Lifetimes section of Chapter 10.
If we insert references to values into the hash map, the values themselves will
not be moved into the hash map. The values that the references point to must be
valid for at least as long as the hash map is valid, though. We will talk more
about these issues in the Lifetimes section of Chapter 10.
### Accessing Values in a Hash Map
@ -83,18 +100,20 @@ We can get a value out of the hash map by providing its key to the `get` method:
```rust
use std::collections::HashMap;
let mut map = HashMap::new();
let mut scores = HashMap::new();
map.insert(1, "hello");
map.insert(2, "world");
scores.insert(String::from("Blue"), 10);
scores.insert(String::from("Yellow"), 50);
let value = map.get(&2);
let team_name = String::from("Blue");
let score = scores.get(&team_name);
```
Here, `value` will have the value `Some("world")`, since that's the value
associated with the `2` key. "world" is wrapped in `Some` because `get` returns
an `Option<V>`. If there's no value for that key in the hash map, `get` will
return `None`.
Here, `score` will have the value that's associated with the Blue team, and the
result will be `Some(10)`. The result is wrapped in `Some` because `get`
returns an `Option<V>`; if there's no value for that key in the hash map, `get`
will return `None`. The program will need to handle the `Option` in one of
the ways that we covered in Chapter 6.
We can iterate over each key/value pair in a hash map in a similar manner as we
do with vectors, using a `for` loop:
@ -102,101 +121,98 @@ do with vectors, using a `for` loop:
```rust
use std::collections::HashMap;
let mut map = HashMap::new();
let mut scores = HashMap::new();
map.insert(1, "hello");
map.insert(2, "world");
scores.insert(String::from("Blue"), 10);
scores.insert(String::from("Yellow"), 50);
for (key, value) in &map {
for (key, value) in &scores {
println!("{}: {}", key, value);
}
```
This will print:
This will print each pair, in an arbitrary order:
```text
1: hello
2: world
Yellow: 50
Blue: 10
```
### Updating a Hash Map
Since each key can only have one value, when we want to change the data in a
hash map, we have to decide how to handle the case when a key already has a
value assigned. We could choose to replace the old value with the new value. We
could choose to keep the old value and ignore the new value, and only add the
new value if the key *doesn't* already have a value. Or we could change the
existing value. Let's look at how to do each of these!
<!-- So the quantity of keys must be defined up front, that's not growable?
That could be worthy saying -->
<!-- No, the number of keys is growable, it's just that for EACH individual
key, there can only be one value. I've tried to clarify. /Carol -->
While the number of keys and values is growable, each individual key can only
have one value associated with it at a time. When we want to change the data in
a hash map, we have to decide how to handle the case when a key already has a
value assigned. We could choose to replace the old value with the new value,
completely disregarding the old value. We could choose to keep the old value
and ignore the new value, and only add the new value if the key *doesn't*
already have a value. Or we could combine the old value and the new value.
Let's look at how to do each of these!
#### Overwriting a Value
If we insert a key and a value, then insert that key with a different value,
the value associated with that key will be replaced. Even though this code
calls `insert` twice, the hash map will only contain one key/value pair, since
we're inserting with the key `1` both times:
If we insert a key and a value into a hashmap, then insert that same key with a
different value, the value associated with that key will be replaced. Even
though this following code calls `insert` twice, the hash map will only contain
one key/value pair because we're inserting the value for the Blue team's key
both times:
```rust
use std::collections::HashMap;
let mut map = HashMap::new();
let mut scores = HashMap::new();
map.insert(1, "hello");
map.insert(1, "Hi There");
scores.insert(String::from("Blue"), 10);
scores.insert(String::from("Blue"), 25);
println!("{:?}", map);
println!("{:?}", scores);
```
This will print `{1: "Hi There"}`.
This will print `{"Blue": 25}`. The original value of 25 has been overwritten.
#### Only Insert If the Key Has No Value
It's common to want to see if there's some sort of value already stored in the
hash map for a particular key, and if not, insert a value. hash maps have a
special API for this, called `entry`, that takes the key we want to check as an
argument:
It's common to want to check if a particular key has a value and, if it does
not, insert a value for it. Hash maps have a special API for this, called
`entry`, that takes the key we want to check as an argument. The return value
of the `entry` function is an enum, `Entry`, that represents a value that might
or might not exist. Let's say that we want to check if the key for the Yellow
team has a value associated with it. If it doesn't, we want to insert the value
50, and the same for the Blue team. With the entry API, the code for this
looks like:
```rust
use std::collections::HashMap;
let mut map = HashMap::new();
map.insert(1, "hello");
let mut scores = HashMap::new();
scores.insert(String::from("Blue"), 10);
let e = map.entry(2);
scores.entry(String::from("Yellow")).or_insert(50);
scores.entry(String::from("Blue")).or_insert(50);
println!("{:?}", scores);
```
Here, the value bound to `e` is a special enum, `Entry`. An `Entry` represents a
value that might or might not exist. Let's say that we want to see if the key
`2` has a value associated with it. If it doesn't, we want to insert the value
"world". In both cases, we want to return the resulting value that now goes
with `2`. With the entry API, it looks like this:
The `or_insert` method on `Entry` returns the value for the `Entry`'s key if it
exists, and if not, inserts its argument as the new value for the `Entry`'s key
and returns that. This is much cleaner than writing the logic ourselves, and in
addition, plays more nicely with the borrow checker.
```rust
use std::collections::HashMap;
let mut map = HashMap::new();
map.insert(1, "hello");
map.entry(2).or_insert("world");
map.entry(1).or_insert("Hi There");
println!("{:?}", map);
```
The `or_insert` method on `Entry` does exactly this: returns the value for the
`Entry`'s key if it exists, and if not, inserts its argument as the new value
for the `Entry`'s key and returns that. This is much cleaner than writing the
logic ourselves, and in addition, plays more nicely with the borrow checker.
This code will print `{1: "hello", 2: "world"}`. The first call to `entry` will
insert the key `2` with the value "world", since `2` doesn't have a value
already. The second call to `entry` will not change the hash map since `1`
already has the value "hello".
This code will print `{"Yellow": 50, "Blue": 10}`. The first call to `entry`
will insert the key for the Yellow team with the value 50, since the Yellow
team doesn't have a value already. The second call to `entry` will not change
the hash map since the Blue team already has the value 10.
#### Update a Value Based on the Old Value
Another common use case for hash maps is to look up a key's value and then update
it, using the old value. For instance, if we wanted to count how many times
Another common use case for hash maps is to look up a key's value then update
it, based on the old value. For instance, if we wanted to count how many times
each word appeared in some text, we could use a hash map with the words as keys
and increment the value to keep track of how many times we've seen that word.
If this is the first time we've seen a word, we'll first insert the value `0`.
@ -217,40 +233,39 @@ println!("{:?}", map);
```
This will print `{"world": 2, "hello": 1, "wonderful": 1}`. The `or_insert`
method actually returns a mutable reference (`&mut V`) to the value in the
hash map for this key. Here we store that mutable reference in the `count`
variable, so in order to assign to that value we must first dereference
`count` using the asterisk (`*`). The mutable reference goes out of scope at
the end of the `for` loop, so all of these changes are safe and allowed by the
borrowing rules.
method actually returns a mutable reference (`&mut V`) to the value for this
key. Here we store that mutable reference in the `count` variable, so in order
to assign to that value we must first dereference `count` using the asterisk
(`*`). The mutable reference goes out of scope at the end of the `for` loop, so
all of these changes are safe and allowed by the borrowing rules.
### Hashing Function
By default, `HashMap` uses a cryptographically secure hashing function that can
provide resistance to Denial of Service (DoS) attacks. This is not the fastest
hashing algorithm out there, but the tradeoff for better security that comes
with the drop in performance is a good default tradeoff to make. If you profile
your code and find that the default hash function is too slow for your
purposes, you can switch to another function by specifying a different
*hasher*. A hasher is an object that implements the `BuildHasher` trait. We'll
be talking about traits and how to implement them in Chapter 10.
with the drop in performance is worth it. If you profile your code and find
that the default hash function is too slow for your purposes, you can switch to
another function by specifying a different *hasher*. A hasher is a type that
implements the `BuildHasher` trait. We'll be talking about traits and how to
implement them in Chapter 10.
## Summary
Vectors, strings, and hash maps will take you far in programs where you need to
store, access, and modify data. Some programs you are now equipped to write and
might want to try include:
store, access, and modify data. Here are some exercises you should now be
equipped to solve:
* Given a list of integers, use a vector and return their mean (average),
median (when sorted, the value in the middle position), and mode (the value
that occurs most often; a hash map will be helpful here).
* Convert strings to Pig Latin, where the first consonant of each word gets
moved to the end with an added "ay", so "first" becomes "irst-fay". Words that
start with a vowel get an h instead ("apple" becomes "apple-hay"). Remember
about UTF-8 encoding!
* Using a hash map and vectors, create a text interface to allow a user to add
1. Given a list of integers, use a vector and return the mean (average), median
(when sorted, the value in the middle position), and mode (the value that
occurs most often; a hash map will be helpful here) of the list.
2. Convert strings to Pig Latin, where the first consonant of each word is
moved to the end of the word with an added "ay", so "first" becomes
"irst-fay". Words that start with a vowel get "hay" added to the end instead
("apple" becomes "apple-hay"). Remember about UTF-8 encoding!
3. Using a hash map and vectors, create a text interface to allow a user to add
employee names to a department in the company. For example, "Add Sally to
Engineering" or "Add Ron to Sales". Then let the user retrieve a list of all
Engineering" or "Add Amir to Sales". Then let the user retrieve a list of all
people in a department or all people in the company by department, sorted
alphabetically.