Wording edits for strings

This commit is contained in:
Carol (Nichols || Goulding) 2016-09-16 13:31:45 -04:00
parent 8f1b922945
commit 854614fc29

View File

@ -1,42 +1,38 @@
# Strings
We've already talked about strings a bunch in chapter four, but let's take a
more in-depth look at them now.
We've already talked about strings a bunch in Chapter 4, but let's take a more
in-depth look at them now.
## Many Kinds of Strings
Strings are a common place for new Rustaceans to get stuck. This is due to a
combination of three things: Rust's propensity for making sure to expose
possible errors, that strings are a more complicated data structure than many
possible errors, strings being a more complicated data structure than many
programmers give them credit for, and UTF-8. These things combine in a way that
can seem difficult when you're used to other languages.
Before we can dig into those things, we need to talk about what exactly we even
mean by the word 'string'. Rust-the-language has only one string type: `&str`.
We talked about these string slices in chapter four: they're a reference to some
We talked about these string slices in Chapter 4: they're a reference to some
UTF-8 encoded string data stored somewhere else. String literals, for example,
are stored in the binary output of your program, and are therefore string
slices.
Rust's standard library also provides a type called `String`. This is a growable,
mutable, UTF-8 encoded string type. When Rustaceans talk about 'strings' in Rust,
they usually mean "`String` and `&str`." This chapter is largely about `String`,
and these two types are used heavily in Rust's standard library. This is what
Rustaceans mean when they say "Rust strings are UTF-8," since both `String` and
string slices are UTF-8 encoded.
Rust's standard library also provides a type called `String`. This is a
growable, mutable, owned, UTF-8 encoded string type. When Rustaceans talk about
'strings' in Rust, they usually mean "`String` and `&str`". This chapter is
largely about `String`, and these two types are used heavily in Rust's standard
library. Both `String` and string slices are UTF-8 encoded.
Rust's standard library also includes a number of other string types, such as
`OsString`, `OsStr`, `CString`, and `CStr`. Library crates may provide even
more options for string string data. As you can see from the `*String`/`*Str`
more options for storing string data. As you can see from the `*String`/`*Str`
naming, they often provide an owned and borrowed variant, just like
`String`/`&str`. These string types may store different encodings, be
represented in memory in a different way, or all kinds of other things. We
won't be talking about them in this chapter, see their API documentation for
more about how to use them, and when each is appropriate.
`String`/`&str`. These string types may store different encodings or be
represented in memory in a different way, for example. We won't be talking
about these other string types in this chapter; see their API documentation for
more about how to use them and when each is appropriate.
Many options! As I said, strings are surprisingly complex. Most of the time,
we'll be using `String` as an owned string type, though. So let's talk about
how to use it.
## Creating a New String
Let's look at how to do the same operations on `String` as we did with `Vec`,
@ -69,10 +65,10 @@ APIs that make sense for strings. There are a lot of options, and some of them
can feel redundant because of this, but they all have their place! In this
case, `String::from` and `.to_string` end up doing the exact same thing, so
which you choose is a matter of style. Some people use `String::from` for
literals, and `.to_string` for variables bindings. Most Rust style is pretty
uniform, but this specific question is one of the most-debated.
literals, and `.to_string` for variable bindings. Most Rust style is pretty
uniform, but this specific question is one of the most debated.
Don't forget that strings are UTF-8 encoded, and so you can include any
Don't forget that strings are UTF-8 encoded, so you can include any
properly encoded data in them:
```rust
@ -91,37 +87,38 @@ let hello = "Hola";
## Updating a String
A `String` can be changed and can grow in size, just like a `Vec` can.
### Push
You can grow a `String` by using the `push_str` method to append another
string:
```rust
let mut s = String::from("foo");
s.push_str("bar");
// s will be "foobar" here
```
And `push` will add a `char`:
`s` will contain "foobar" after these two lines.
The `push` method will add a `char`:
```rust
let mut s = String::from("lo");
s.push('l');
// s will be "lol" here
```
You can make any string into the empty string with the `clear` method:
`s` will contain "lol" after this point.
You can make any `String` contain the empty string with the `clear` method:
```rust
let mut s = String::from("Noooooooooooooooooooooo!");
s.clear();
// s will be "" here
```
Now `s` will be the empty string, "".
### Concatenation
Often, you'll want to combine two strings together. One way is to use the `+`
@ -130,34 +127,37 @@ operator:
```rust
let s1 = String::from("Hello, ");
let s2 = String::from("world!");
let s3 = s1 + &s2;
// s3 is "Hello, world!"
```
There's some tricky bits here, though! They come through the type signature
of `+` for strings. It looks something like this:
This code will make `s3` contain "Hello, world!" There's some tricky bits here,
though, that come from the type signature of `+` for `String`. The signature
for the `add` method that the `+` operator uses looks something like this:
```rust,ignore
fn add(self, s: &str) -> String {
```
I say 'something' because `add` is generic, and so this is what the signature
would be if it isn't. But this signature gives us all the clues we need about
the tricky bits of `+`.
This isn't excatly what the actual signature is in the standard library because
`add` is defined using generics there. Here, we're just looking at what the
signature of the method would be if `add` was defined specifically for
`String`. This signature gives us the clues we need in order to understand the
tricky bits of `+`.
First of all, you'll notice that `s2` has an `&`. This is because of the `s`
argument in the function: you can only add a `&str` to a `String`, you can't
add two `String`s together. Remember back in chpater four when we talked about
how `&String` will coerce to `&str`? That's why it's `&s2`: so that it will
coerce to the proper type.
argument in the `add` function: you can only add a `&str` to a `String`, you
can't add two `String`s together. Remember back in Chapter 4 when we talked
about how `&String` will coerce to `&str`: we write `&s2` so that the `String`
will coerce to the proper type, `&str`.
Secondly, `add` takes ownership of `self`. This means that `s1` in the above
example will move. So while `let s3 = s1 + &s2;` looks like it will copy the
two strings and create a new one, it actually takes ownership of `s1`, appends
a copy of `s2`'s contents, and then returns ownership back. In other words, it
looks like it's making a lot of copies, but isn't.
Secondly, `add` takes ownership of `self`, which we can tell because `self`
does *not* have an `&` in the signature. This means `s1` in the above example
will be moved into the `add` call and no longer be a valid binding after that.
So while `let s3 = s1 + &s2;` looks like it will copy both strings and create a
new one, this statement actually takes ownership of `s1`, appends a copy of
`s2`'s contents, then returns ownership of the result. In other words, it looks
like it's making a lot of copies, but isn't: the implementation is more
efficient than copying.
If you need to concatenate multiple strings, this behavior of `+` gets
unwieldy:
@ -168,12 +168,11 @@ let s2 = String::from("tac");
let s3 = String::from("toe");
let s = s1 + "-" + &s2 + "-" + &s3;
// s will be 'tic-tac-toe' here
```
With all of these `+`s and `"`s, it gets hard to see what's going on. For more
complicated string combining, we can use the `format!` macro:
`s` will be "tic-tac-toe" at this point. With all of the `+` and `"`
characters, it gets hard to see what's going on. For more complicated string
combining, we can use the `format!` macro:
```rust
let s1 = String::from("tic");
@ -181,109 +180,106 @@ let s2 = String::from("tac");
let s3 = String::from("toe");
let s = format!("{}-{}-{}", s1, s2, s3);
// s will be 'tic-tac-toe' here
```
The `format!` macro works in the same way as `println!`, but instead of
printing the output to the screen, it returns a `String` with the contents.
This version is much easier to read than all of the `+`s.
This code will also set `s` to "tic-tac-toe". The `format!` macro works in the
same way as `println!`, but instead of printing the output to the screen, it
returns a `String` with the contents. This version is much easier to read than
all of the `+`s.
## Indexing Strings
## Indexing into Strings
If you try to access a string with the indexing syntax, you'll get an error. In
other words, this:
In many other languages, accessing individual characters in a string by
referencing the characters by index is a valid and common operation. In Rust,
however, if we try to access parts of a `String` using indexing syntax, we'll
get an error. That is, this code:
```rust,ignore
let s1 = String::from("hello");
let h = s1[0];
```
will give you this:
will result in this error:
```text
error: the trait bound `std::string::String: std::ops::Index<_>` is not satisfied [--explain E0277]
--> <anon>:4:14
error: the trait bound `std::string::String: std::ops::Index<_>` is not
satisfied [--explain E0277]
|>
4 |> let s3 = s1[0];
|> ^^^^^
|> let h = s1[0];
|> ^^^^^
note: the type `std::string::String` cannot be indexed by `_`
```
That note tells the story: Rust strings don't support indexing like this. So
The error and the note tell the story: Rust strings don't support indexing. So
the follow-up question is, why not? In order to answer that, we have to talk a
bit about how Rust stores strings in memory.
### Internal Representation
A `String` is a wrapper over a `Vec<u8>`. Let's take a look at some of our
examples from before. First, this one:
properly-encoded UTF-8 example strings from before. First, this one:
```rust
let len = "Hola".len();
```
In this case, `len` will be four bytes long: each of these letters takes one
byte when encoded in UTF-8. What about this one, though?
In this case, `len` will be four, which means the `Vec` storing the string
"Hola" is four bytes long: each of these letters takes one byte when encoded in
UTF-8. What about this example, though?
```rust
let len = "Здравствуйте".len();
```
There are two answers that make sense here: the first is 12, which is the number
of letters that it would be if you asked someone. The second, though, is the
real answer here: 24. This is the number of bytes that it takes to encode
"Здравствуйте" in UTF-8: each character is two bytes.
There are two answers that potentially make sense here: the first is 12, which
is the number of letters that a person would count if we asked someone how long
this string was. The second, though, is what Rust's answer is: 24. This is the
number of bytes that it takes to encode "Здравствуйте" in UTF-8, because each
character takes two bytes of storage.
By the same token, imagine this invalid Rust code:
```rust,ignore
let hello = "Здравствуйте";
let answer = &h[0];
```
What should the value of `answer` be? Should it be `З`, the first letter? Or
should it be `208`? When you encode `З` in UTF-8, the first byte is `208`, and
the second is `151`. If it's `208`, well, that's not a valid character on its
own. So... do we make `[]` return an integer? For latin letters, this would
then not return the answer you'd expect: `&"hello"[0]` would then give `104`,
not `h`.
What should the value of `answer` be? Should it be `З`, the first letter? When
you encode `З` in UTF-8, the first byte is `208`, and the second is `151`. So
should `answer` be `208`? `208` is not a valid character on its own, though.
Plus, for latin letters, this would not return the answer most people would
expect: `&"hello"[0]` would then return `104`, not `h`.
This leads to another point about UTF-8: there are really three relevant ways
to look at strings, from Rust's perspective: bytes, scalar values, and grapheme
clusters. If we look at "नमस्ते ", it ultimately boils down to:
clusters. If we look at the string "नमस्ते ", it is ultimately stored as a `Vec`
of `u8` values that looks like this:
```text
[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135, 32]
```
That's 19 bytes. But if you look at them as Unicode scalar values, which are what
Rust's `char` type is, those bytes look like this:
That's 19 bytes. But if we look at them as Unicode scalar values, which are
what Rust's `char` type is, those bytes look like this:
```text
['न', 'म', 'स', '्', 'त', 'े', ' ']
```
There are seven of them, and the last one isn't even visible! Finally, if you
look at them as grapheme clusters, which is the closest thing to what humans
would call 'letters', you'd get this:
There are seven `char` values here, and the last one isn't even visible!
Finally, if we look at them as grapheme clusters, which is the closest thing
to what humans would call 'letters', we'd get this:
```text
["न", "म", "स्", "ते", " "]
```
Five elemtns, and there's still that empty character on the end. It turns out
Five elements, and there's still that empty character on the end. It turns out
that even within 'grapheme cluster', there are multiple ways of grouping
things. Have we convinced you strings are actually really complicated yet?
Furthermore, `[]` implies *O(n)* access time. But because UTF-8 is a
variable-length encoding, implementing `[]` on strings would be *O(n)* instead,
which would make it significantly worse-performing than what people expect.
All of these problems means that we decided to not implement `[]` for strings, so
you cannot directly do this.
All of these problems mean that Rust does not implement `[]` for `String`, so
we cannot directly do this.
However.