mirror of
https://github.com/rust-lang-cn/book-cn.git
synced 2025-01-23 23:50:25 +08:00
Wording edits for strings
This commit is contained in:
parent
8f1b922945
commit
854614fc29
@ -1,42 +1,38 @@
|
||||
# Strings
|
||||
|
||||
We've already talked about strings a bunch in chapter four, but let's take a
|
||||
more in-depth look at them now.
|
||||
We've already talked about strings a bunch in Chapter 4, but let's take a more
|
||||
in-depth look at them now.
|
||||
|
||||
## Many Kinds of Strings
|
||||
|
||||
Strings are a common place for new Rustaceans to get stuck. This is due to a
|
||||
combination of three things: Rust's propensity for making sure to expose
|
||||
possible errors, that strings are a more complicated data structure than many
|
||||
possible errors, strings being a more complicated data structure than many
|
||||
programmers give them credit for, and UTF-8. These things combine in a way that
|
||||
can seem difficult when you're used to other languages.
|
||||
|
||||
Before we can dig into those things, we need to talk about what exactly we even
|
||||
mean by the word 'string'. Rust-the-language has only one string type: `&str`.
|
||||
We talked about these string slices in chapter four: they're a reference to some
|
||||
We talked about these string slices in Chapter 4: they're a reference to some
|
||||
UTF-8 encoded string data stored somewhere else. String literals, for example,
|
||||
are stored in the binary output of your program, and are therefore string
|
||||
slices.
|
||||
|
||||
Rust's standard library also provides a type called `String`. This is a growable,
|
||||
mutable, UTF-8 encoded string type. When Rustaceans talk about 'strings' in Rust,
|
||||
they usually mean "`String` and `&str`." This chapter is largely about `String`,
|
||||
and these two types are used heavily in Rust's standard library. This is what
|
||||
Rustaceans mean when they say "Rust strings are UTF-8," since both `String` and
|
||||
string slices are UTF-8 encoded.
|
||||
Rust's standard library also provides a type called `String`. This is a
|
||||
growable, mutable, owned, UTF-8 encoded string type. When Rustaceans talk about
|
||||
'strings' in Rust, they usually mean "`String` and `&str`". This chapter is
|
||||
largely about `String`, and these two types are used heavily in Rust's standard
|
||||
library. Both `String` and string slices are UTF-8 encoded.
|
||||
|
||||
Rust's standard library also includes a number of other string types, such as
|
||||
`OsString`, `OsStr`, `CString`, and `CStr`. Library crates may provide even
|
||||
more options for string string data. As you can see from the `*String`/`*Str`
|
||||
more options for storing string data. As you can see from the `*String`/`*Str`
|
||||
naming, they often provide an owned and borrowed variant, just like
|
||||
`String`/`&str`. These string types may store different encodings, be
|
||||
represented in memory in a different way, or all kinds of other things. We
|
||||
won't be talking about them in this chapter, see their API documentation for
|
||||
more about how to use them, and when each is appropriate.
|
||||
`String`/`&str`. These string types may store different encodings or be
|
||||
represented in memory in a different way, for example. We won't be talking
|
||||
about these other string types in this chapter; see their API documentation for
|
||||
more about how to use them and when each is appropriate.
|
||||
|
||||
Many options! As I said, strings are surprisingly complex. Most of the time,
|
||||
we'll be using `String` as an owned string type, though. So let's talk about
|
||||
how to use it.
|
||||
## Creating a New String
|
||||
|
||||
Let's look at how to do the same operations on `String` as we did with `Vec`,
|
||||
@ -69,10 +65,10 @@ APIs that make sense for strings. There are a lot of options, and some of them
|
||||
can feel redundant because of this, but they all have their place! In this
|
||||
case, `String::from` and `.to_string` end up doing the exact same thing, so
|
||||
which you choose is a matter of style. Some people use `String::from` for
|
||||
literals, and `.to_string` for variables bindings. Most Rust style is pretty
|
||||
uniform, but this specific question is one of the most-debated.
|
||||
literals, and `.to_string` for variable bindings. Most Rust style is pretty
|
||||
uniform, but this specific question is one of the most debated.
|
||||
|
||||
Don't forget that strings are UTF-8 encoded, and so you can include any
|
||||
Don't forget that strings are UTF-8 encoded, so you can include any
|
||||
properly encoded data in them:
|
||||
|
||||
```rust
|
||||
@ -91,37 +87,38 @@ let hello = "Hola";
|
||||
|
||||
## Updating a String
|
||||
|
||||
A `String` can be changed and can grow in size, just like a `Vec` can.
|
||||
|
||||
### Push
|
||||
|
||||
You can grow a `String` by using the `push_str` method to append another
|
||||
string:
|
||||
|
||||
```rust
|
||||
let mut s = String::from("foo");
|
||||
|
||||
s.push_str("bar");
|
||||
|
||||
// s will be "foobar" here
|
||||
```
|
||||
|
||||
And `push` will add a `char`:
|
||||
`s` will contain "foobar" after these two lines.
|
||||
|
||||
The `push` method will add a `char`:
|
||||
|
||||
```rust
|
||||
let mut s = String::from("lo");
|
||||
|
||||
s.push('l');
|
||||
|
||||
// s will be "lol" here
|
||||
```
|
||||
|
||||
You can make any string into the empty string with the `clear` method:
|
||||
`s` will contain "lol" after this point.
|
||||
|
||||
You can make any `String` contain the empty string with the `clear` method:
|
||||
|
||||
```rust
|
||||
let mut s = String::from("Noooooooooooooooooooooo!");
|
||||
|
||||
s.clear();
|
||||
|
||||
// s will be "" here
|
||||
```
|
||||
|
||||
Now `s` will be the empty string, "".
|
||||
|
||||
### Concatenation
|
||||
|
||||
Often, you'll want to combine two strings together. One way is to use the `+`
|
||||
@ -130,34 +127,37 @@ operator:
|
||||
```rust
|
||||
let s1 = String::from("Hello, ");
|
||||
let s2 = String::from("world!");
|
||||
|
||||
let s3 = s1 + &s2;
|
||||
|
||||
// s3 is "Hello, world!"
|
||||
```
|
||||
|
||||
There's some tricky bits here, though! They come through the type signature
|
||||
of `+` for strings. It looks something like this:
|
||||
This code will make `s3` contain "Hello, world!" There's some tricky bits here,
|
||||
though, that come from the type signature of `+` for `String`. The signature
|
||||
for the `add` method that the `+` operator uses looks something like this:
|
||||
|
||||
```rust,ignore
|
||||
fn add(self, s: &str) -> String {
|
||||
```
|
||||
|
||||
I say 'something' because `add` is generic, and so this is what the signature
|
||||
would be if it isn't. But this signature gives us all the clues we need about
|
||||
the tricky bits of `+`.
|
||||
This isn't excatly what the actual signature is in the standard library because
|
||||
`add` is defined using generics there. Here, we're just looking at what the
|
||||
signature of the method would be if `add` was defined specifically for
|
||||
`String`. This signature gives us the clues we need in order to understand the
|
||||
tricky bits of `+`.
|
||||
|
||||
First of all, you'll notice that `s2` has an `&`. This is because of the `s`
|
||||
argument in the function: you can only add a `&str` to a `String`, you can't
|
||||
add two `String`s together. Remember back in chpater four when we talked about
|
||||
how `&String` will coerce to `&str`? That's why it's `&s2`: so that it will
|
||||
coerce to the proper type.
|
||||
argument in the `add` function: you can only add a `&str` to a `String`, you
|
||||
can't add two `String`s together. Remember back in Chapter 4 when we talked
|
||||
about how `&String` will coerce to `&str`: we write `&s2` so that the `String`
|
||||
will coerce to the proper type, `&str`.
|
||||
|
||||
Secondly, `add` takes ownership of `self`. This means that `s1` in the above
|
||||
example will move. So while `let s3 = s1 + &s2;` looks like it will copy the
|
||||
two strings and create a new one, it actually takes ownership of `s1`, appends
|
||||
a copy of `s2`'s contents, and then returns ownership back. In other words, it
|
||||
looks like it's making a lot of copies, but isn't.
|
||||
Secondly, `add` takes ownership of `self`, which we can tell because `self`
|
||||
does *not* have an `&` in the signature. This means `s1` in the above example
|
||||
will be moved into the `add` call and no longer be a valid binding after that.
|
||||
So while `let s3 = s1 + &s2;` looks like it will copy both strings and create a
|
||||
new one, this statement actually takes ownership of `s1`, appends a copy of
|
||||
`s2`'s contents, then returns ownership of the result. In other words, it looks
|
||||
like it's making a lot of copies, but isn't: the implementation is more
|
||||
efficient than copying.
|
||||
|
||||
If you need to concatenate multiple strings, this behavior of `+` gets
|
||||
unwieldy:
|
||||
@ -168,12 +168,11 @@ let s2 = String::from("tac");
|
||||
let s3 = String::from("toe");
|
||||
|
||||
let s = s1 + "-" + &s2 + "-" + &s3;
|
||||
|
||||
// s will be 'tic-tac-toe' here
|
||||
```
|
||||
|
||||
With all of these `+`s and `"`s, it gets hard to see what's going on. For more
|
||||
complicated string combining, we can use the `format!` macro:
|
||||
`s` will be "tic-tac-toe" at this point. With all of the `+` and `"`
|
||||
characters, it gets hard to see what's going on. For more complicated string
|
||||
combining, we can use the `format!` macro:
|
||||
|
||||
```rust
|
||||
let s1 = String::from("tic");
|
||||
@ -181,109 +180,106 @@ let s2 = String::from("tac");
|
||||
let s3 = String::from("toe");
|
||||
|
||||
let s = format!("{}-{}-{}", s1, s2, s3);
|
||||
|
||||
// s will be 'tic-tac-toe' here
|
||||
```
|
||||
|
||||
The `format!` macro works in the same way as `println!`, but instead of
|
||||
printing the output to the screen, it returns a `String` with the contents.
|
||||
This version is much easier to read than all of the `+`s.
|
||||
This code will also set `s` to "tic-tac-toe". The `format!` macro works in the
|
||||
same way as `println!`, but instead of printing the output to the screen, it
|
||||
returns a `String` with the contents. This version is much easier to read than
|
||||
all of the `+`s.
|
||||
|
||||
## Indexing Strings
|
||||
## Indexing into Strings
|
||||
|
||||
If you try to access a string with the indexing syntax, you'll get an error. In
|
||||
other words, this:
|
||||
In many other languages, accessing individual characters in a string by
|
||||
referencing the characters by index is a valid and common operation. In Rust,
|
||||
however, if we try to access parts of a `String` using indexing syntax, we'll
|
||||
get an error. That is, this code:
|
||||
|
||||
```rust,ignore
|
||||
let s1 = String::from("hello");
|
||||
|
||||
let h = s1[0];
|
||||
```
|
||||
|
||||
will give you this:
|
||||
will result in this error:
|
||||
|
||||
```text
|
||||
error: the trait bound `std::string::String: std::ops::Index<_>` is not satisfied [--explain E0277]
|
||||
--> <anon>:4:14
|
||||
error: the trait bound `std::string::String: std::ops::Index<_>` is not
|
||||
satisfied [--explain E0277]
|
||||
|>
|
||||
4 |> let s3 = s1[0];
|
||||
|> ^^^^^
|
||||
|> let h = s1[0];
|
||||
|> ^^^^^
|
||||
note: the type `std::string::String` cannot be indexed by `_`
|
||||
```
|
||||
|
||||
That note tells the story: Rust strings don't support indexing like this. So
|
||||
The error and the note tell the story: Rust strings don't support indexing. So
|
||||
the follow-up question is, why not? In order to answer that, we have to talk a
|
||||
bit about how Rust stores strings in memory.
|
||||
|
||||
### Internal Representation
|
||||
|
||||
A `String` is a wrapper over a `Vec<u8>`. Let's take a look at some of our
|
||||
examples from before. First, this one:
|
||||
properly-encoded UTF-8 example strings from before. First, this one:
|
||||
|
||||
```rust
|
||||
let len = "Hola".len();
|
||||
```
|
||||
|
||||
In this case, `len` will be four bytes long: each of these letters takes one
|
||||
byte when encoded in UTF-8. What about this one, though?
|
||||
In this case, `len` will be four, which means the `Vec` storing the string
|
||||
"Hola" is four bytes long: each of these letters takes one byte when encoded in
|
||||
UTF-8. What about this example, though?
|
||||
|
||||
```rust
|
||||
let len = "Здравствуйте".len();
|
||||
```
|
||||
|
||||
There are two answers that make sense here: the first is 12, which is the number
|
||||
of letters that it would be if you asked someone. The second, though, is the
|
||||
real answer here: 24. This is the number of bytes that it takes to encode
|
||||
"Здравствуйте" in UTF-8: each character is two bytes.
|
||||
There are two answers that potentially make sense here: the first is 12, which
|
||||
is the number of letters that a person would count if we asked someone how long
|
||||
this string was. The second, though, is what Rust's answer is: 24. This is the
|
||||
number of bytes that it takes to encode "Здравствуйте" in UTF-8, because each
|
||||
character takes two bytes of storage.
|
||||
|
||||
By the same token, imagine this invalid Rust code:
|
||||
|
||||
```rust,ignore
|
||||
let hello = "Здравствуйте";
|
||||
|
||||
let answer = &h[0];
|
||||
```
|
||||
|
||||
What should the value of `answer` be? Should it be `З`, the first letter? Or
|
||||
should it be `208`? When you encode `З` in UTF-8, the first byte is `208`, and
|
||||
the second is `151`. If it's `208`, well, that's not a valid character on its
|
||||
own. So... do we make `[]` return an integer? For latin letters, this would
|
||||
then not return the answer you'd expect: `&"hello"[0]` would then give `104`,
|
||||
not `h`.
|
||||
What should the value of `answer` be? Should it be `З`, the first letter? When
|
||||
you encode `З` in UTF-8, the first byte is `208`, and the second is `151`. So
|
||||
should `answer` be `208`? `208` is not a valid character on its own, though.
|
||||
Plus, for latin letters, this would not return the answer most people would
|
||||
expect: `&"hello"[0]` would then return `104`, not `h`.
|
||||
|
||||
This leads to another point about UTF-8: there are really three relevant ways
|
||||
to look at strings, from Rust's perspective: bytes, scalar values, and grapheme
|
||||
clusters. If we look at "नमस्ते ", it ultimately boils down to:
|
||||
clusters. If we look at the string "नमस्ते ", it is ultimately stored as a `Vec`
|
||||
of `u8` values that looks like this:
|
||||
|
||||
```text
|
||||
[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135, 32]
|
||||
```
|
||||
|
||||
That's 19 bytes. But if you look at them as Unicode scalar values, which are what
|
||||
Rust's `char` type is, those bytes look like this:
|
||||
That's 19 bytes. But if we look at them as Unicode scalar values, which are
|
||||
what Rust's `char` type is, those bytes look like this:
|
||||
|
||||
```text
|
||||
['न', 'म', 'स', '्', 'त', 'े', ' ']
|
||||
```
|
||||
|
||||
There are seven of them, and the last one isn't even visible! Finally, if you
|
||||
look at them as grapheme clusters, which is the closest thing to what humans
|
||||
would call 'letters', you'd get this:
|
||||
There are seven `char` values here, and the last one isn't even visible!
|
||||
Finally, if we look at them as grapheme clusters, which is the closest thing
|
||||
to what humans would call 'letters', we'd get this:
|
||||
|
||||
```text
|
||||
["न", "म", "स्", "ते", " "]
|
||||
```
|
||||
|
||||
Five elemtns, and there's still that empty character on the end. It turns out
|
||||
Five elements, and there's still that empty character on the end. It turns out
|
||||
that even within 'grapheme cluster', there are multiple ways of grouping
|
||||
things. Have we convinced you strings are actually really complicated yet?
|
||||
|
||||
Furthermore, `[]` implies *O(n)* access time. But because UTF-8 is a
|
||||
variable-length encoding, implementing `[]` on strings would be *O(n)* instead,
|
||||
which would make it significantly worse-performing than what people expect.
|
||||
|
||||
All of these problems means that we decided to not implement `[]` for strings, so
|
||||
you cannot directly do this.
|
||||
All of these problems mean that Rust does not implement `[]` for `String`, so
|
||||
we cannot directly do this.
|
||||
|
||||
However.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user